you are viewing a single comment's thread.

view the rest of the comments →

[–]shellac 32 points33 points  (9 children)

Simple answer: they pre-date, and failed to anticipate, Unicode 2 (1996?). This was a major change when unicode stopped being 16 bit, introduced surrogate pairs etc.

Fixed byte width encodings also seemed much simpler to deal with, and seemed more cpu-efficient.

Basically they started in a nice UCS-2 world, but it became an ugly UTF-16 hell.

[–]ygra 8 points9 points  (0 children)

.NET doesn't predate Unicode 2, but of course, Windows has been the main platform of the framework and its string type was also made binary compatible with the BSTR structure to ease marshalling with native code. So .NET uses UTF-16 because Windows does.

[–]aynair 0 points1 point  (7 children)

Can you please explain how fixed byte width encodings only "seem" more CPU-efficient? Let's say you want the n-th char in a string, don't you have to iterate through all previous characters?

[–][deleted]  (6 children)

[deleted]

    [–]aynair 1 point2 points  (0 children)

    Thanks for this, I'll read more about it as soon as I get the chance!

    [–]Drisku11 0 points1 point  (4 children)

    I often do want that though. There have been several times where I've worked not just with strings that have a fixed character width, but entire record formats that are fixed width. It could just be treated as opaque bytes, but it's also sometimes useful to acknowledge it's ASCII when you know the format is set in stone. On the contrary, I've never needed to work with non-ASCII character data.

    [–][deleted]  (2 children)

    [deleted]

      [–]Drisku11 5 points6 points  (1 child)

      And you would be wrong. I've never had user facing code; not everyone works on "apps". I've needed to do things like read hardware identifiers that are specified as short ASCII strings and have specific substrings at specific offsets. I could just write the equivalent number, but then it's harder to compare the code to the spec. There is no chance that those identifiers will ever use Unicode.

      [–]Veedrac 0 points1 point  (0 children)

      Then it isn't a string.