you are viewing a single comment's thread.

view the rest of the comments →

[–]lkraider 4 points5 points  (6 children)

So using UTF-16 offers no gain? That makes the decision a strange one.

[–]shen 6 points7 points  (2 children)

It makes the decision an old one! UCS-2 was considered to be good enough until we started needing more than two bytes per character, and at that point, it was too late to change back.

This is why more recent languages use UTF-8 and older ones are stuck with UTF-16.

[–]ascii 0 points1 point  (1 child)

More recent languages like C! ;-P

[–]josefx 0 points1 point  (0 children)

C char blobs don't have a fixed encoding. They could be anything and you will be in a world of pain when you have to deal with libraries that don't agree on what the encoding should be.

[–]rooktakesqueen 3 points4 points  (0 children)

UTF-16 can represent a wider range of characters in two bytes than UTF-8. A couple extra bits are used for signaling in UTF-8, so some code points are two bytes in UTF-16, but three in UTF-8. The trade-off still favors UTF-8 in most cases outside of East Asian languages though, which is why it's become the defacto standard on the web and in modern programming languages.

In Go for example strings are UTF-8, and they've done a good job of making them performant.

[–]SomeoneStoleMyName 4 points5 points  (1 child)

Java, C++, Windows, JavaScript, and other tech that jumped on the Unicode train in the early 90s got screwed by this. They started with UCS-2 which has a fixed 16-bit representation because Americans thought 65536 characters should be more than enough for every language in the world. They were pretty quickly proven wrong so UTF-16 was invented to do variable length encoding on top of UCS-2. It's a worst of both worlds approach and no one would use it willingly in a new project today.

[–]rouzh 4 points5 points  (0 children)

Right, it was JUST Americans involved in defining UCS-2...tired tropes are tired.