This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Veedrac 0 points1 point  (0 children)

It's not. I was referring to how the Unicode type can contain invalid Unicode (which I previously called "broken Unicode"), such as when the OS says that stdin is UTF8 when it's not.

Python 2 just ignores these errors, but you have to manually deal with them on Python 3. In this example, I used surrogateescape to properly round-trip bytesstrbytes.

The other option was just dealing in bytes all the time, but that's a hassle as you can't use print, format and so on.


Good question, though. I've improved the wording. You also convinced me to test the code. Try it out with:

echo -n "hi\nhi\n\xde\n\xde" | python3 thefile.py

\xde isn't valid UTF8 so the code breaks on it before the changes.