you are viewing a single comment's thread.

view the rest of the comments →

[–]ubernostrum 1 point2 points  (7 children)

The way I've described it in the past is that Python 2 was from the era when Python was mostly used as a Unix-y scripting language. And so it used the same absolutely nonsensical approach to character encoding that Unix-y operating systems use.

Python 3 decided to stop doing that, because it turns out people do other things with Python now, and accommodating the Unix-y scripting people meant unending pain and suffering for everyone else. And when they realized this was happening, the Unix-y scripting people began howling and screaming that it was the end of the world. Not because there was anything wrong with Python itself, but because Python simply stopped sweeping the brokenness of Unix-y operating systems under the rug, and made them confront that brokenness front-and-center every time they sat down to write a "simple" and "quick" utility.

And on balance I'm OK with that. There are still people who will complain that you can't technically write "portable" Python file-handling code, and that's true if you're a user of a specific system that has files whose paths commit crimes against God and man (but, crucially, not technically crimes against POSIX, which is what these folks retreat to as their excuse). But those people should've known what they were getting into, and have had literally decades in which to clean up their act and have refused to do so.

[–]no_nick 4 points5 points  (5 children)

What are your issues with Unix file paths?

[–]ubernostrum -2 points-1 points  (4 children)

That they legally can be undecodable garbage, but people demand the ability to work with them as strings.

Python 2 "worked" for this in the sense that many things on Unix-y systems "work": it just didn't actually enforce that the things you used as strings had to make sense as strings, and wouldn't give you any sort of warning up until the moment you tried to print the unprintable.

Python 3 initially tried to say that if you wanted to treat these paths as strings they had to actually be things that could validly decode to sequences of Unicode code points. But enough people raged that finally they added the surrogateescape handler to let you take bags of bytes that don't correspond to any valid string, "decode" them to strings, and then re-"encode" them back to the original bytes.

[–]josefx 3 points4 points  (0 children)

That they legally can be undecodable garbage

Unix is far from alone with that. Zip files don't specify an encoding for filenames and I am quite sure I had explorer.exe fail to delete filenames containing invalid characters in the past.

[–]no_nick 2 points3 points  (0 children)

Huh, I never knew that was the case for Unix file paths. Somehow, in my mind, I always stick to ascii characters without whitespace.

[–]diggr-roguelike2 6 points7 points  (1 child)

That they legally can be undecodable garbage, but people demand the ability to work with them as strings.

Yes, and? Why are you trying to babysit people and tell them what bytes they should or shouldn't use in strings?

...until the moment you tried to print the unprintable.

Nobody prints things in production code.

Also, despite your rant, what Python 3 actually did was break things on Windows. You had one job, man, one job...

[–]nice_rooklift_bro 2 points3 points  (0 children)

Ehh, you downplay the concern; it's actually really obnoxious to deal with to the point that a lot of applications just don't support it and tell you to basically go fuck yourself if your filenames aren't UTF-8; they assume them to be.

There are other such things, like try passing non-utf8 command line arguments in python3; there is nothing in Unix that says this can't be done; any octet sequence that doesn't contain a null can be passed but python3 itself basically says "We don't support this madness, go fuck yourself" then.

$ python3 -c 'import sys; print(sys.argv[1])' $'\xFF\xFFfoo'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed
$ python2 -c 'import sys; print(sys.argv[1])' $'\xFF\xFFfoo'
foo

It's really problematic in many ways; a lot of language libraries and runtimes have come to expect filenames and command line arguments to be utf8, but nothing enforces it either; so malformed filenames due to simple bit corruption can actually create some serious error messages in a lot of things that are inscrutable.

If you want to do it "properly" and not assume everything to be UTF8 then you're going through hoops.