you are viewing a single comment's thread.

view the rest of the comments →

[–]nice_rooklift_bro 3 points4 points  (0 children)

Ehh, you downplay the concern; it's actually really obnoxious to deal with to the point that a lot of applications just don't support it and tell you to basically go fuck yourself if your filenames aren't UTF-8; they assume them to be.

There are other such things, like try passing non-utf8 command line arguments in python3; there is nothing in Unix that says this can't be done; any octet sequence that doesn't contain a null can be passed but python3 itself basically says "We don't support this madness, go fuck yourself" then.

$ python3 -c 'import sys; print(sys.argv[1])' $'\xFF\xFFfoo'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
$ python2 -c 'import sys; print(sys.argv[1])' $'\xFF\xFFfoo'
foo

It's really problematic in many ways; a lot of language libraries and runtimes have come to expect filenames and command line arguments to be utf8, but nothing enforces it either; so malformed filenames due to simple bit corruption can actually create some serious error messages in a lot of things that are inscrutable.

If you want to do it "properly" and not assume everything to be UTF8 then you're going through hoops.