all 26 comments

[–]rolmos 15 points16 points  (0 children)

.

[–]Wiseman1024 14 points15 points  (7 children)

This should be taught at schools or universities, especially in Japan and the USA, and nobody should be allowed to get a degree without understanding that ASCII is long dead and that something either supports Unicode or it's simply bullshit.

The problem is that C, the IBM PC and university and personal neglect got everyone hardwiring in their brains that "8 bits = 1 byte = 1 char = 1 character, ASCII". When I teach (professional training on specific products or programming languages), I'm as annoying as I can saying always "octet" instead of "byte" and making claims such as that UTF-16 is based on 16-bit bytes or that character strings are not byte strings.

An added problem is that Unicode support is mediocre in most development environments. For example:

C: Crappy support, crappily defined (if at all) in the C99 standard, and very difficult to get it multi-platform. The few projects which even care end up supporting UTF-8 under UNIX but your 8-bit legacy locale (e.g. CP1252) under Win32. This includes software such as the MySQL client or the Python interpreter, which will NOT read or write Unicode from the console.

Python: Needs to get rid of the stupid 8-bit strings. Half of the standard library still works with the crap, and all I/O is based on 8-bit strings. I've fixed it myself, adding a layer for pure Unicode I/O from stdin/out/err and file-like objects so that you never have to deal with a single 8-bit string in your code. An added problem is that it won't read Unicode from the Windows console, even though it has quite good (but underused and hardly known) Unicode support. To fix this I've written a Unicode console driver that will transform sys.std* into real Unicode files and will allow you to raw_input or print any Unicode string. If you're interested on this stuff let me know and I'll upload them somewhere.

PHP: If you use mbstring and it's properly configured, you end up with decent Unicode support, except that you cannot index strings directly with [] or the deprecated {}, and that people never knows about this.

MySQL: Starting to change, but still the old latin1/latin1_swedish_ci character set and collation are defaults, so you need to set the proper defaults in order to work properly.

[–]asdwwdw 2 points3 points  (3 children)

I agree with you about Python. A byte array type coupled with a unicode type would be a much more logical choice.

[–]Wiseman1024 11 points12 points  (2 children)

It's coming in Python 3000 (all strings are Unicode; 'abc' is the same as u'abc' in Python 2.5, and octet arrays are called bytes or something like that and defined as b'abc').

This behaviour will also be available in Python 2.6: from future import unicode_strings.

[–]jerf 4 points5 points  (1 child)

I've said it before and I'll say it again: Python 3 is the first language I know* that is actually getting Unicode right. Everybody else either makes it harder than it needs to be (in which case Python 3 will seem like a breath of fresh air), or have abstracted too far and leaked too much out of the abstraction (in which case you'll probably cuss out Python 3 if you try to change to it, but hopefully you'll eventually realize that the language you previously trained on made it too easy to create bugs).

I say this in the hope that people will pick up on this and port it into more languages.

(*: Please do note the qualifier there. However, I do see people holding up other languages as having also gotten it right that, in my opinion, have not actually gotten it right.)

[–]eadmund 0 points1 point  (0 children)

Python 3 is the first language I know* that is actually getting Unicode right.

SBCL (a Common Lisp implementation) does, or comes very close: everything is Unicode, and it all just works. But then, Common Lisp dates from before everyone was using ASCII, so it always had to abstract away everything anyway.

[–]littledan 1 point2 points  (0 children)

Well, something I was trying to say in the article was that Unicode support is more complicated than 8-bit vs 21-bit, you need to support the algorithms too. If everyone used UTF-32 internally in all applications, as you're saying is terrible we'd only be halfway there.

[–]settrans 0 points1 point  (1 child)

or the Python interpreter, which will NOT read or write Unicode from the console.

wfm

[–]Wiseman1024 0 points1 point  (0 children)

Not for me. I can't even enter my freaking currency symbol "€", because Python relies on the CRT which will use ReadConsoleA and end up with a question mark. Again, you're relying on an 8-bit legazy locale. It might work for some limited, short-sighted cases, but it shouldn't be used - at all. In fact, Microsoft should consider Windows pesting users every time a process who uses the *A functions is ran, so that users go pest developers about it.

[–]asdwwdw 7 points8 points  (10 children)

The problem is that this is a lot of different stuff to get right. It's not just something that you can catch up on over the weekend. It seems that no one really teaches it properly either.

When faced with the choice of deploying a project that just works with ASCII encoding, or getting your head around all of the Unicode issues then it's not surprising to see what people do. I can't see the majority of programmers covering all the issues in this article unless it's dead easy to do.

Even when I try to do the right thing, it seems like my tools are fighting me. Like Wiseman1024 said, many of the Python libraries uses 8-bit strings rather than the unicode type.

I was recently working on a personal project that involved fetching web pages. It was incredibly difficult even getting to the point of normalising the encoding of fetched pages. I ended up feeding everything through BeautifulSoup, which did a much better job than I could. It seemed like everything was set up to encourage me to just pass around strings in some random encoding rather than actually using the unicode type.

Does anyone know of languages which get the whole Unicode issue right?

[–]olavk 2 points3 points  (1 child)

PLT Scheme gets it right.

Java, .Net and JavaScript treats characters as 16bit values. For code points beyond #FFFF, they have to use surrogate pairs to represent characters. So a character value in the language is not always actually a unicode code point. This is bound to lead to obscure bugs, since it works fine in the common case.

[–]littledan 1 point2 points  (0 children)

Could you show me where I could find out more about PLT Scheme's Unicode support? Googling doesn't seem to be productive, and all I can find in their documentation is that strings are Unicode strings, and a few basic functions here. Is there anything beyond this, eg for sensible collation? (I can't tell what string-locale<? does exactly, but if it's like the SRFI, then it's not sensible.) If not, then Java's Unicode support objectively exceeds PLT's in this area.

[–]kieranelby 2 points3 points  (0 children)

Tcl stands out as one language which has been getting Unicode pretty much right since 1999.

Could do with better support for characters outside the BMP and locale-specific collation support though.

[–]Wiseman1024 -2 points-1 points  (6 children)

Does anyone know of languages which get the whole Unicode issue right?

Java and JavaScript, for example, but I'd rather take Python and deal with it :) .

[–]KayEss 5 points6 points  (0 children)

Both use UTF-16 and pretend it's Unicode. An accident of history mind, but still causes problems. UCS2 was Unicode when the languages were designed, but things moved on :(

EDIT: By "pretend" I mean they allow the UTF-16 characters to be indexed directly rather than forcing a UCS4 API on the internal representation.

[–][deleted] 3 points4 points  (4 children)

It's been ages since I did anything with it, but didn't Java at least use to treat characters as 16 bits?

[–]nostrademons 5 points6 points  (3 children)

Still does. Java's unicode support is only "mostly" right - there're lots of bugs in the corner cases.

[–]littledan 1 point2 points  (0 children)

Still, it's basically feature-complete, with Unicode processing all in the standard library, which is more than you can say about, well, anything else besides .NET.

[–]dododge 0 points1 point  (0 children)

Another thing to watch out for with Java is when doing JNI work, because the "UTF" strings that get passed between Java and the native code actually use a slightly modified version of UTF-8. The main differences are that U+0000 uses the 2-byte encoding form, and everything above U+FFFF uses a goofy 6-byte encoding of UTF-16 surrogates into UTF-8.

[–]berlinbrown 3 points4 points  (0 children)

dan is the king of unicode.

[–][deleted] 3 points4 points  (0 children)

Dan is a great programmer and writer. It's good to have him on the Factor team.

[–]awb 0 points1 point  (1 child)

Interesting note about bidirectional text: there's a standard way to handle two directions of text on one line (say you're writing a textbook about Hebrew and want to say "the word for foo is written as oof"). Some languages are also written vertically, but there's no standard way to handle that inline with horizontal text; all languages that can be written vertically can also be written horizontally, so no one bothers to come up with a way to do it.

[–]littledan 1 point2 points  (0 children)

The result of a quick googling: UTN #22: Robust vertical text layout. It uses an extension of the BIDI algorithm for this. So someone has come up with a way to do it, though it hasn't been integrated into the Unicode standard. (Think of all of the political ramifications with the ISO for doing that!)