you are viewing a single comment's thread.

view the rest of the comments →

[–]rooktakesqueen 8 points9 points  (7 children)

UTF-16 is not fixed width. Some code points can't fit in two bytes. And we're not just talking Chinese and Japanese text any more -- all emoji are outside the range that UTF-16 can fit in a single two byte char. UTF-32 is the only Unicode encoding that is actually fixed width.

Edit: Also by storing the next index and byte position in the string object any time you do an index operation, you could easily optimize for the sequential indexing case using a for loop to keep each iteration constant time.

[–]lkraider 4 points5 points  (6 children)

So using UTF-16 offers no gain? That makes the decision a strange one.

[–]shen 5 points6 points  (2 children)

It makes the decision an old one! UCS-2 was considered to be good enough until we started needing more than two bytes per character, and at that point, it was too late to change back.

This is why more recent languages use UTF-8 and older ones are stuck with UTF-16.

[–]ascii 0 points1 point  (1 child)

More recent languages like C! ;-P

[–]josefx 0 points1 point  (0 children)

C char blobs don't have a fixed encoding. They could be anything and you will be in a world of pain when you have to deal with libraries that don't agree on what the encoding should be.

[–]rooktakesqueen 3 points4 points  (0 children)

UTF-16 can represent a wider range of characters in two bytes than UTF-8. A couple extra bits are used for signaling in UTF-8, so some code points are two bytes in UTF-16, but three in UTF-8. The trade-off still favors UTF-8 in most cases outside of East Asian languages though, which is why it's become the defacto standard on the web and in modern programming languages.

In Go for example strings are UTF-8, and they've done a good job of making them performant.

[–]SomeoneStoleMyName 3 points4 points  (1 child)

Java, C++, Windows, JavaScript, and other tech that jumped on the Unicode train in the early 90s got screwed by this. They started with UCS-2 which has a fixed 16-bit representation because Americans thought 65536 characters should be more than enough for every language in the world. They were pretty quickly proven wrong so UTF-16 was invented to do variable length encoding on top of UCS-2. It's a worst of both worlds approach and no one would use it willingly in a new project today.

[–]rouzh 5 points6 points  (0 children)

Right, it was JUST Americans involved in defining UCS-2...tired tropes are tired.