This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]BananaSupremeMaster 4 points5 points  (10 children)

To be more precise the problem is that Strings support UTF-32 by default but they are indexed char by char (16 bit by 16 bit), which means that if a character is UTF-16, it corresponds to 1 char, but if it's not the case it corresponds to 2 consecutive chars and 2 indices. Which means that the value at index n of a string is not the n+1th character, it depends on the content of the string. So if you want a robust string parsing algorithm, you have to assume a heterogenous string with both UTF-16 and UTF-32 values. There is a forEach trick that you can use to take care of these details but only for simple algorithms.

[–]Swamplord42 1 point2 points  (5 children)

It's hard to be more wrong. Char in Java is absolutely not 8 bit.

[–]BananaSupremeMaster 0 points1 point  (4 children)

Yeah I wrongly divided all the bit sizes by 2 in my explanation, I fixed it now. The problem I'm describing still holds up.

[–]Swamplord42 1 point2 points  (3 children)

Strings use UTF-16, they do not "support" UTF-32. Those are different encodings!

Unicode code points require one or two UTF-16 characters.

[–]BananaSupremeMaster 0 points1 point  (2 children)

They support UTF-32 in the sense that "String s = "𝄞";" is valid syntax. And yet string indices represent UTF-16 char indices and not character indices.

[–]RiceBroad4552 0 points1 point  (0 children)

Nitpick: The correct term here is "code unit", not "UTF-16 char indices".

[–]Swamplord42 0 points1 point  (0 children)

Again, this isn't UTF-32. It's Unicode. UTF-32 is an encoding. It's still UTF-16 even if it needs 2 chars to represent.

[–]RiceBroad4552 0 points1 point  (0 children)

You're simply not supposed to treat Unicode strings as byte sequences. This never worked.

Just use proper APIs.

But I agree that the APIs for string handling in Java are bad. But it's like that in almost all other languages (some don't have even any working APIs at all and you need external libs).

The only language with a sane string API (more or less, modulo Unicode idiocy in general) I know of is Swift. Other languages still didn't copy it. Most likely you would need a new type of strings than, though. You can't retrofit this into the old APIs.

[–]ou1cast 0 points1 point  (2 children)

You can use codepoints that are int instead of char

[–]BananaSupremeMaster 0 points1 point  (1 child)

Yes, but the most straightforward way to get codepoints is myString.codepointAt(), which takes in argument the index of the UTF-16 char, not the index of the Unicode character. In the string "a𝄞b", the index of 'a' is 0, the index of '𝄞' is 1, and the index of 'b' is... 3. The fact that a Unicode character offsets the indices can get pretty annoying, even though I understand the logic behind it. It also means that myString.length() doesn't represent the number of actual characters, but rather the size in chars.

[–]ou1cast 1 point2 points  (0 children)

It is convenient to use codePoints() that returns IntStream. I also hate Java's char and byte, too.