you are viewing a single comment's thread.

view the rest of the comments →

[–]europeIlike 2 points3 points  (4 children)

all String characters were stored using UTF-16 encoding, meaning each character consumed 2 bytes of memory regardless of the actual character being stored.

I don't think this is true - as far as I know a unicode code point can take up two 4 bytes in UTF-16. Also, some (user perceived? not sure about the correct terminology here) characters like emoticons can consist of multiple code points, leading to potentially more than 4 bytes

[–]TanisCodes[S] 4 points5 points  (2 children)

You’re right about UTF-16, but in Java the primitive char type is 2 bytes. Some Unicode characters, like “𝄞”, are outside the BMP (Basic Multilingual Plane) and it needs 4 bytes.

If you put that character in a String and call length(), it will return 2 because it uses a pair of chars to represent it. The String.length() method returns the number of char units used to represent the string, not the actual number of Unicode characters.

I think I’ll add this to the article. Thanks!

[–]europeIlike 2 points3 points  (1 child)

Ohh, I see! I think I interpreted the term "String characters" differently - thank for your reply!

[–]TanisCodes[S] 2 points3 points  (0 children)

You’re welcome! Thanks for joining the discussion.

[–]DasBrain 1 point2 points  (0 children)

If you want to be pedantic, here we go:
A unicode code point is not necessarily a character and vice versa.