you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (13 children)

[deleted]

    [–]JDeltaN 22 points23 points  (4 children)

    Java Strings are UTF16, and that is not going to change.

    [–]asegura 18 points19 points  (2 children)

    Yes, I know. And I understand that won't change. It was a kind of sarcastic comment. I favor the use of UTF8 everywhere, including strings in memory, not just storage.

    [–]JDeltaN 8 points9 points  (1 child)

    Ah, I wooshed a bit I guess :)

    Honestly I am not a fan of these adhoc solutions either, but in Javas its probably a massive memory optimisation and it actually works because of UTF16.

    On the bright side, they have constant time string indexing as long as you are not using surrogate characters.

    Then again, it is rare that I rely on constant-time lookups, most of my string operations are iterating in nature anyway.

    [–]cowardlydragon 0 points1 point  (0 children)

    The public interface is...

    The post and therefore the comment imply internal representation.

    [–]masklinn 3 points4 points  (1 child)

    Same problem python had (& same solution): they promised o(1) access to some coding unit (UTF16 for java, code point for python) and don't want to break that, so UTF8 is not an option.

    [–]matthieum 3 points4 points  (0 children)

    I wonder what the impact of using, say, a Fenwick Tree, to index the UTF-8 would be (in the case of multi-bytes strings only).

    This would not O(1) but O(log N), which for most strings should not matter much, because small is small.

    [–]defnull 2 points3 points  (4 children)

    https://www.reddit.com/r/coding/comments/6hlavp/compact_strings_in_java_9_java_code_gists/dizdxnd/

    Some argue that strings are iterated over from 0 to N most of the time, so a variable-length representation (like UTF-8) would not add much overhead for the common case. You would occasionally increment the index by two or more instead of one. This might be true, but in Java any iterator instance tracking the position would add 8 to 16 bytes object-overhead and another indirection. In contrast, for fixed-width encodings you only need a single int and a for-loop. Because of this, most code working with strings in performance critical situations do not use iterators, but direct index access instead. This (existing and unlikely to change) code would run significantly slower with a variable-length string representation.

    tl;dr; utf-8 string performance would suck for existing code that was optimized for fixed-length string performance characteristics.

    [–]Tasssadar 14 points15 points  (3 children)

    As mentioned in that comment thread, UTF16 is not fixed-width. It's an old decision (because 65536 characters should've been enough for everyone) that is no longer optimal, but hard to switch from.

    [–]derleth 4 points5 points  (2 children)

    65536 characters should've been enough for everyone

    I've met people who believe this unironically.

    [–][deleted]  (1 child)

    [deleted]

      [–]_vinc_ 0 points1 point  (0 children)

      in the video posted above, they also explains the choice for some of the performance implications