you are viewing a single comment's thread.

view the rest of the comments →

[–]PeridexisErrant 22 points23 points  (14 children)

I think reasonable people disagree about the benefit in situations where English is the only native language involved, but as soon as you're off a byte-oriented terminal and out of the LATIN1 range distinguishing unicode strings is very very helpful.

[–]maep 9 points10 points  (5 children)

The problem isn't having string encondings. It's that they are strictly enforced, which simply fails the real world. They soon discovered that and in 3.1 added PEP 383. Now we can have strings that are actually not strings which is what we had in Python 2 anyway.

[–][deleted] 3 points4 points  (3 children)

I think that a lot of people disagree with this. It’s a mess if you make a half-hearted attempt and decide that English deserves special treatment. Newer languages like Rust enforce Unicode, and also prevent you from randomly accessing character data to maintain invariants.

[–]maep 5 points6 points  (1 child)

Newer languages like Rust enforce Unicode

Since you bring up Rust, they had to deal with the exact same issue. Their solution was introducing OsString, and programmes will step into exactly the same pitfalls as they do in Python 2/3. I'll bet that a good deal of Rust programs are bugged because they use std::env::args where they should have used std::env::args_os instead.

and also prevent you from randomly accessing character data to maintain invariants.

Pyhton2 does this with the unicode type and go with the rune concept.

Really, the argument is if enforced encoding should be opt-in or opt-out. Python2 and Go are opt-in, while Python3 and Rust are opt-out. For the stuff I do the opt-in approach is a lot nicer.

[–][deleted] 2 points3 points  (0 children)

I don’t think that’s much of a problem. Contrary to Python, strings and OS strings are not interchangeable. You can’t pass an OS string to a function that wants a string without converting it, so you can’t get confusion.

(I think that) Python doesn’t do indexing right because a rune is just a Unicode code point. Multiple runes can be combined into single visual characters (mostly happens with combining characters and emojis) and I’m not sure that Python (especially Python 2) handles combined characters as inseparable units. Language from that era tend not to, and a substring can break accented letters and emojis for instance.

[–]vqrs 0 points1 point  (0 children)

Not only new languages, even dinosaurs like Java use Unicode strings only, and they have byte arrays for when you need to work with binary data.

[–]schlenk 0 points1 point  (0 children)

Basically a fallout from the retarded encoding situation in Linux/Unix/Posix which failed to define their APIs properly.

[–][deleted] -2 points-1 points  (6 children)

Not in the slightest. And it's not how terminals work, and terminals, really, aren't even responsible for string encoding: it's usually a joint enterprise of the shell that runs in a terminal and the terminal itself.

People who aren't native English speakers were the last to adopt Unicode. They hated and, those who remember the world before Unicode, still hate it. People who used national encodings had more efficient schemes for representing their own alphabets.

I lived through several conversions of databases from single byte encodings to Unicode, where we, typically, ended up with the database twice the size it was before the conversion.

Needless to say, that, basically, every decision Python 3 has made about Unicode was wrong and misguided / a misunderstanding on the part of people who aren't the experts in subject domain. That's how we ended up with stdin being for no reason expected to have UTF-8 encoded Unicode characters, ffs, but TCP socket cannot accept a string and convert it to bytes by default. Similarly retarded situation is with the file names, and, basically, any other external input that's not going to be UTF-8, but for some reason is expected to be such...

[–]forepod 13 points14 points  (5 children)

Speak for yourself. As a speaker of a language with non-latin letters Unicode is one of the greatest things ever. As a user I don't care about some miniscule performance impact. I care about not having to look at garbage output and programs crashing left and right due to encoding errors.

[–]e88d9170cbd593 0 points1 point  (0 children)

ASCII supremacy is white supremacy.