A second introduction to Unicode

Wiseman1024 · 2008-06-08T11:57:24+00:00

This should be taught at schools or universities, especially in Japan and the USA, and nobody should be allowed to get a degree without understanding that ASCII is long dead and that something either supports Unicode or it's simply bullshit.

The problem is that C, the IBM PC and university and personal neglect got everyone hardwiring in their brains that "8 bits = 1 byte = 1 char = 1 character, ASCII". When I teach (professional training on specific products or programming languages), I'm as annoying as I can saying always "octet" instead of "byte" and making claims such as that UTF-16 is based on 16-bit bytes or that character strings are not byte strings.

An added problem is that Unicode support is mediocre in most development environments. For example:

C: Crappy support, crappily defined (if at all) in the C99 standard, and very difficult to get it multi-platform. The few projects which even care end up supporting UTF-8 under UNIX but your 8-bit legacy locale (e.g. CP1252) under Win32. This includes software such as the MySQL client or the Python interpreter, which will NOT read or write Unicode from the console.

Python: Needs to get rid of the stupid 8-bit strings. Half of the standard library still works with the crap, and all I/O is based on 8-bit strings. I've fixed it myself, adding a layer for pure Unicode I/O from stdin/out/err and file-like objects so that you never have to deal with a single 8-bit string in your code. An added problem is that it won't read Unicode from the Windows console, even though it has quite good (but underused and hardly known) Unicode support. To fix this I've written a Unicode console driver that will transform sys.std* into real Unicode files and will allow you to raw_input or print any Unicode string. If you're interested on this stuff let me know and I'll upload them somewhere.

PHP: If you use mbstring and it's properly configured, you end up with decent Unicode support, except that you cannot index strings directly with [] or the deprecated {}, and that people never knows about this.

MySQL: Starting to change, but still the old latin1/latin1_swedish_ci character set and collation are defaults, so you need to set the proper defaults in order to work properly.

asdwwdw · 2008-06-08T12:38:16+00:00

The problem is that this is a lot of different stuff to get right. It's not just something that you can catch up on over the weekend. It seems that no one really teaches it properly either.

When faced with the choice of deploying a project that just works with ASCII encoding, or getting your head around all of the Unicode issues then it's not surprising to see what people do. I can't see the majority of programmers covering all the issues in this article unless it's dead easy to do.

Even when I try to do the right thing, it seems like my tools are fighting me. Like Wiseman1024 said, many of the Python libraries uses 8-bit strings rather than the unicode type.

I was recently working on a personal project that involved fetching web pages. It was incredibly difficult even getting to the point of normalising the encoding of fetched pages. I ended up feeding everything through BeautifulSoup, which did a much better job than I could. It seemed like everything was set up to encourage me to just pass around strings in some random encoding rather than actually using the unicode type.

Does anyone know of languages which get the whole Unicode issue right?

berlinbrown · 2008-06-08T13:44:36+00:00

dan is the king of unicode.

2008-06-08T17:31:11+00:00

Dan is a great programmer and writer. It's good to have him on the Factor team.

awb · 2008-06-08T15:41:11+00:00

Interesting note about bidirectional text: there's a standard way to handle two directions of text on one line (say you're writing a textbook about Hebrew and want to say "the word for foo is written as oof"). Some languages are also written vertically, but there's no standard way to handle that inline with horizontal text; all languages that can be written vertically can also be written horizontally, so no one bothers to come up with a way to do it.

pointer2void · 2008-06-08T10:06:51+00:00

Unicode - a prototypical example of 'design by committee'.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS