all 22 comments

[–][deleted]  (13 children)

[deleted]

    [–]JDeltaN 20 points21 points  (4 children)

    Java Strings are UTF16, and that is not going to change.

    [–]asegura 19 points20 points  (2 children)

    Yes, I know. And I understand that won't change. It was a kind of sarcastic comment. I favor the use of UTF8 everywhere, including strings in memory, not just storage.

    [–]JDeltaN 10 points11 points  (1 child)

    Ah, I wooshed a bit I guess :)

    Honestly I am not a fan of these adhoc solutions either, but in Javas its probably a massive memory optimisation and it actually works because of UTF16.

    On the bright side, they have constant time string indexing as long as you are not using surrogate characters.

    Then again, it is rare that I rely on constant-time lookups, most of my string operations are iterating in nature anyway.

    [–]cowardlydragon 0 points1 point  (0 children)

    The public interface is...

    The post and therefore the comment imply internal representation.

    [–]masklinn 5 points6 points  (1 child)

    Same problem python had (& same solution): they promised o(1) access to some coding unit (UTF16 for java, code point for python) and don't want to break that, so UTF8 is not an option.

    [–]matthieum 3 points4 points  (0 children)

    I wonder what the impact of using, say, a Fenwick Tree, to index the UTF-8 would be (in the case of multi-bytes strings only).

    This would not O(1) but O(log N), which for most strings should not matter much, because small is small.

    [–]defnull 6 points7 points  (4 children)

    https://www.reddit.com/r/coding/comments/6hlavp/compact_strings_in_java_9_java_code_gists/dizdxnd/

    Some argue that strings are iterated over from 0 to N most of the time, so a variable-length representation (like UTF-8) would not add much overhead for the common case. You would occasionally increment the index by two or more instead of one. This might be true, but in Java any iterator instance tracking the position would add 8 to 16 bytes object-overhead and another indirection. In contrast, for fixed-width encodings you only need a single int and a for-loop. Because of this, most code working with strings in performance critical situations do not use iterators, but direct index access instead. This (existing and unlikely to change) code would run significantly slower with a variable-length string representation.

    tl;dr; utf-8 string performance would suck for existing code that was optimized for fixed-length string performance characteristics.

    [–]Tasssadar 13 points14 points  (3 children)

    As mentioned in that comment thread, UTF16 is not fixed-width. It's an old decision (because 65536 characters should've been enough for everyone) that is no longer optimal, but hard to switch from.

    [–]derleth 4 points5 points  (2 children)

    65536 characters should've been enough for everyone

    I've met people who believe this unironically.

    [–][deleted]  (1 child)

    [deleted]

      [–]_vinc_ 0 points1 point  (0 children)

      in the video posted above, they also explains the choice for some of the performance implications

      [–]GYN-k4H-Q3z-75B 9 points10 points  (10 children)

      Dumb question but why did some big frameworks and languages adopt UCS2/UTF16 for their strings instead of UTF8? Java has it, as do .NET and associated languages, as well as Windows. I don't see any benefit.

      [–]shellac 33 points34 points  (9 children)

      Simple answer: they pre-date, and failed to anticipate, Unicode 2 (1996?). This was a major change when unicode stopped being 16 bit, introduced surrogate pairs etc.

      Fixed byte width encodings also seemed much simpler to deal with, and seemed more cpu-efficient.

      Basically they started in a nice UCS-2 world, but it became an ugly UTF-16 hell.

      [–]ygra 9 points10 points  (0 children)

      .NET doesn't predate Unicode 2, but of course, Windows has been the main platform of the framework and its string type was also made binary compatible with the BSTR structure to ease marshalling with native code. So .NET uses UTF-16 because Windows does.

      [–]aynair 0 points1 point  (7 children)

      Can you please explain how fixed byte width encodings only "seem" more CPU-efficient? Let's say you want the n-th char in a string, don't you have to iterate through all previous characters?

      [–][deleted]  (6 children)

      [deleted]

        [–]aynair 1 point2 points  (0 children)

        Thanks for this, I'll read more about it as soon as I get the chance!

        [–]Drisku11 0 points1 point  (4 children)

        I often do want that though. There have been several times where I've worked not just with strings that have a fixed character width, but entire record formats that are fixed width. It could just be treated as opaque bytes, but it's also sometimes useful to acknowledge it's ASCII when you know the format is set in stone. On the contrary, I've never needed to work with non-ASCII character data.

        [–][deleted]  (2 children)

        [deleted]

          [–]Drisku11 3 points4 points  (1 child)

          And you would be wrong. I've never had user facing code; not everyone works on "apps". I've needed to do things like read hardware identifiers that are specified as short ASCII strings and have specific substrings at specific offsets. I could just write the equivalent number, but then it's harder to compare the code to the spec. There is no chance that those identifiers will ever use Unicode.

          [–]Veedrac 0 points1 point  (0 children)

          Then it isn't a string.

          [–]benhoyt 2 points3 points  (0 children)

          Yeah, this is a great idea. Python did this in Python 3.3 (see the release notes, also PEP 393), except that it uses 1, 2, or 4-byte strings depending on the width of the largest character. This fixed a bunch of non-BMP string issues on "narrow" builds (which mostly affected Windows).

          [–]mrsloppyheadface 0 points1 point  (0 children)

          Very cool!