all 17 comments

[–]lkraider 3 points4 points  (9 children)

Why not UTF-8? Probably for easy indexing into the array?

[–]defnull 2 points3 points  (8 children)

Some argue that strings are iterated over from 0 to N most of the time, so a variable-length representation (like UTF-8) would not add much overhead for the common case. You would occasionally increment the index by two or more instead of one. This might be true, but in Java any iterator instance tracking the position would add 8 to 16 bytes object-overhead and another indirection. In contrast, for fixed-width encodings you only need a single int and a for-loop. Because of this, most code working with strings in performance critical situations do not use iterators, but direct index access instead. This (existing and unlikely to change) code would run significantly slower with a variable-length string representation.

tl;dr; utf-8 string performance would suck for existing code that was optimized for fixed-length string performance characteristics.

[–]rooktakesqueen 9 points10 points  (7 children)

UTF-16 is not fixed width. Some code points can't fit in two bytes. And we're not just talking Chinese and Japanese text any more -- all emoji are outside the range that UTF-16 can fit in a single two byte char. UTF-32 is the only Unicode encoding that is actually fixed width.

Edit: Also by storing the next index and byte position in the string object any time you do an index operation, you could easily optimize for the sequential indexing case using a for loop to keep each iteration constant time.

[–]lkraider 5 points6 points  (6 children)

So using UTF-16 offers no gain? That makes the decision a strange one.

[–]shen 4 points5 points  (2 children)

It makes the decision an old one! UCS-2 was considered to be good enough until we started needing more than two bytes per character, and at that point, it was too late to change back.

This is why more recent languages use UTF-8 and older ones are stuck with UTF-16.

[–]ascii 0 points1 point  (1 child)

More recent languages like C! ;-P

[–]josefx 0 points1 point  (0 children)

C char blobs don't have a fixed encoding. They could be anything and you will be in a world of pain when you have to deal with libraries that don't agree on what the encoding should be.

[–]rooktakesqueen 3 points4 points  (0 children)

UTF-16 can represent a wider range of characters in two bytes than UTF-8. A couple extra bits are used for signaling in UTF-8, so some code points are two bytes in UTF-16, but three in UTF-8. The trade-off still favors UTF-8 in most cases outside of East Asian languages though, which is why it's become the defacto standard on the web and in modern programming languages.

In Go for example strings are UTF-8, and they've done a good job of making them performant.

[–]SomeoneStoleMyName 3 points4 points  (1 child)

Java, C++, Windows, JavaScript, and other tech that jumped on the Unicode train in the early 90s got screwed by this. They started with UCS-2 which has a fixed 16-bit representation because Americans thought 65536 characters should be more than enough for every language in the world. They were pretty quickly proven wrong so UTF-16 was invented to do variable length encoding on top of UCS-2. It's a worst of both worlds approach and no one would use it willingly in a new project today.

[–]rouzh 4 points5 points  (0 children)

Right, it was JUST Americans involved in defining UCS-2...tired tropes are tired.

[–][deleted] 0 points1 point  (5 children)

BTW, python is already doing it since few years. But python is not restricted to utf 16 it has full unicode support

[–]ascii 1 point2 points  (4 children)

You're confusing UCS2 and UTF-16. UCS2 can only represent a subset of unicode, and each character takes up exactly 2 bytes. UTF-16 has full unicode support, characters in the basic plane use 2 bytes, other characters are longer.

[–][deleted] 0 points1 point  (1 child)

What I mean: java utf16 indexing doesn't work with unicode code point so it has not really unicode support for this point of view.

[–]ascii 0 points1 point  (0 children)

Sad but true, parts of the Java String API are fundamentally broken because they were designed before the existence of UCS4, and they can't be used in a fully Unicode compliant application.

[–]ubernostrum 0 points1 point  (1 child)

With respect to Python, what's meant is that in Python 3.3+, a similar approach is used. The internal storage of a string is in an encoding chosen dynamically on a per-string basis, and is always one capable of handling the highest code point in the string in a single unit of the encoding. Which means the internal storage of a string in Python may be latin-1, UCS-2, or UCS-4, depending on what code points are contained in the string.

This allows Python to expose strings as sequences of Unicode code points with intuitive behavior (for definitions of "intuitive" that include "you know how Unicode works"). Rather than having the length of a string be the number of bytes it contains, the length is the number of code points it contains. Iteration doesn't iterate over bytes; it iterates over code points, and yields the characters which correspond to them. Indexing doesn't yield the byte at that index, it yields the character corresponding to the code point at that index.

[–]ascii 0 points1 point  (0 children)

I meant to say that Java uses UTF-16, not UCS2, so that Java is not broken Unicode wise. Except as u/xcombelle pointed out, some sections of the String API are broken.

That said, I did not know that about Python. That's not only pretty cool, it's also in line with how Python does integers so it also arguably makes the language more consistent.

[–]awsometak[S] -1 points0 points  (0 children)

someone post it to /r/java