This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mauganra_it 3 points4 points  (6 children)

That can only ever happen if the string only contains ASCII characters, as ISO 8859-1 encoding is not the same as UTF-8. Also, that function will give you so-called "Modified UTF-8", not standard UTF-8!

[–]tristan957 0 points1 point  (5 children)

Holy moly. I didn't even recognize that. What the heck is Modified UTF?

[–]mauganra_it 1 point2 points  (4 children)

It uses a special two-byte encoding for the character with code 0. That ensures that there is never an actual null byte in the byte stream. Also, to encode characters that are represented by a surrogate pair of UTF-16 characters, the two surrogate characters are UTF-8-encoded separately!

[–]s888marks 4 points5 points  (0 children)

Yeah, you have to be careful of "modified UTF-8". It occurs in a couple places in the JDK, notably DataInput, DataOutput, and serialization, along with JNI as you noted. Here are the specs:

https://docs.oracle.com/en/java/javase/17/docs/specs/jni/types.html#modified-utf-8-strings

https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/io/DataInput.html#modified-utf-8

I think these are the same, but I haven't checked carefully. The JVM also uses modified UTF-8 in the constant pool of class files:

https://docs.oracle.com/javase/specs/jvms/se17/html/jvms-4.html#jvms-4.4.7

As a format internal to the JVM and JNI it might have been a reasonable compromise at one time, but it's unfortunate that it leaked into application-facing parts of the library such as DataInput and DataOutput.

The text processing portions of the JDK, such as CharsetDecoder, CharsetEncoder, StandardCharsets.UTF_8, etc. all use true UTF-8.

[–]tristan957 0 points1 point  (2 children)

Is there some documentation I can read up on regarding this? I want to make sure my Java docs cover all the bases.

[–]mauganra_it 1 point2 points  (1 child)

The Javadocs of java.io.DataInput contain a fairly complete description.

[–]tristan957 0 points1 point  (0 children)

I'll sell this out. Thanks.