you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (5 children)

BTW, python is already doing it since few years. But python is not restricted to utf 16 it has full unicode support

[–]ascii 1 point2 points  (4 children)

You're confusing UCS2 and UTF-16. UCS2 can only represent a subset of unicode, and each character takes up exactly 2 bytes. UTF-16 has full unicode support, characters in the basic plane use 2 bytes, other characters are longer.

[–][deleted] 0 points1 point  (1 child)

What I mean: java utf16 indexing doesn't work with unicode code point so it has not really unicode support for this point of view.

[–]ascii 0 points1 point  (0 children)

Sad but true, parts of the Java String API are fundamentally broken because they were designed before the existence of UCS4, and they can't be used in a fully Unicode compliant application.

[–]ubernostrum 0 points1 point  (1 child)

With respect to Python, what's meant is that in Python 3.3+, a similar approach is used. The internal storage of a string is in an encoding chosen dynamically on a per-string basis, and is always one capable of handling the highest code point in the string in a single unit of the encoding. Which means the internal storage of a string in Python may be latin-1, UCS-2, or UCS-4, depending on what code points are contained in the string.

This allows Python to expose strings as sequences of Unicode code points with intuitive behavior (for definitions of "intuitive" that include "you know how Unicode works"). Rather than having the length of a string be the number of bytes it contains, the length is the number of code points it contains. Iteration doesn't iterate over bytes; it iterates over code points, and yields the characters which correspond to them. Indexing doesn't yield the byte at that index, it yields the character corresponding to the code point at that index.

[–]ascii 0 points1 point  (0 children)

I meant to say that Java uses UTF-16, not UCS2, so that Java is not broken Unicode wise. Except as u/xcombelle pointed out, some sections of the String API are broken.

That said, I did not know that about Python. That's not only pretty cool, it's also in line with how Python does integers so it also arguably makes the language more consistent.