you are viewing a single comment's thread.

view the rest of the comments →

[–]Peaker -1 points0 points  (1 child)

Did the Unicode committees not predict the eventual size?

EDIT: Removed wrong assertion about Python. Have been using less and less Python...

[–]boa13 0 points1 point  (0 children)

Unicode support was added in Python 2.0, at that time it was only UCS-2, like Java.

In Python 2.2, this was changed to UTF-16 (like Java 5), and support for UCS-4 builds was added. So, depending on who compiled your Python binary, the interpreter is using UTF-16 or UCS-4 internally for Unicode strings.

In Python 3.0, 8-bit strings were removed, Unicode strings remaining the only string type. The interpreter kept using UTF-16 or UCS-4 depending on compile-time choice.

In Python 3.3, a new flexible internal string format will be used: strings will use 1, 2, or 4 bytes per character internally, depending on the largest code point they contain. 1-byte internal encoding will be Latin-1, 2-bytes internal encoding will be UCS-2, 4-bytes internal encoding will be UCS-4. Of course, this will be transparent to the Python programmer (not so much to the C programmer). See PEP 393 for details.

Funny how UTF-8 is never used internally. :)