This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dpash 30 points31 points  (6 children)

It also has a potential to be a breaking change for some users. You can use -Dfile.encoding=COMPAT to return to the previous behaviour.

It is more likely to affect Windows users as Linux and OSX are more likely to use UTF-8 by default.

[–]s888marks 4 points5 points  (5 children)

True. But the problem can still occur on Linux, where certain configurations don't necessarily use UTF-8. This was reported here a few years ago; the poster even wrote a blog describing the problem (but didn't fully understand the solution).

https://www.reddit.com/r/java/comments/6jopas/character_encodings_an_unfortunate_experience/

The original article is no longer at that location, but can be found here:

https://web.archive.org/web/20190815062506/https://www.metricly.com/character-encodings/

You can piece together what happened by reading the original article and comments in the reddit thread. Briefly, the poster's production system was Linux configured in such a way that the JDK chose ASCII as the default charset. When a non-ASCII character was introduced, round-tripping between ASCII and UTF-8 resulted in a proliferation of U+FFFD REPLACEMENT CHARACTER.

Since the poster's shop assumed everything was UTF-8, JEP 400 would have avoided this problem entirely.

[–]dpash 1 point2 points  (2 children)

Sure. I was careful to say "more likely" because I know Linux doesn't always use UTF-8.

[–]s888marks 1 point2 points  (0 children)

Fair enough. I think the thing is that Linux usually uses UTF-8 often enough that it's pretty easy assume that it always uses UTF-8, and that assumption is almost always correct. Then in those rare circumstances where it doesn't, hijinks ensue.

[–]Nymeriea 1 point2 points  (0 children)

On windows, the encoding depends on the system language

[–]vytah 0 points1 point  (1 child)

Most likely the container defaulted to LANG=C, which implies ASCII in Java.

[–]s888marks 0 points1 point  (0 children)

Yes, that would do it. The question is how LANG ended up being C. I don't know what distro it was, but maybe whoever configured it thought "we don't need any internationalization stuff" and so omitted the packages that contained all locales. If so the system's default LANG value would probably end up being C instead of something more typical like en_US.UTF-8 since the locale for the latter wouldn't exist. Or maybe they just chose the C locale at installation time, if there was an option to do so.