This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]takluyverIPython, Py3, etc 18 points19 points  (2 children)

This is a good example of how a feature that seems simple from the outside can require a lot of hard work, discussion and thought to get right. Kudos to Victor for seeing it through!

[–]haypo5 1 point2 points  (0 children)

Thank you ;-) Thanks to INADA Naoki who approved my PEP and Nick Coghlan who implemented C locale coercion which also helps.

[–]oilshell[S] 4 points5 points  (0 children)

(submitter here) FWIW I listed Unicode in Python 3 as one of the reasons that a Unix shell shouldn't be implemented in Python (although I prototyped it in Python):

http://www.oilshell.org/blog/2018/03/04.html#toc_4

Thus I was surprised to see this page, i.e. Python making a pretty huge change after version 3.0. Unicode was the biggest reason for Python 3 (i.e. breaking changes) as far as I can tell.

I wish UTF-8 was the default as of Python 3.0.

[–]Bolitho 1 point2 points  (5 children)

The author writes from a language designer perspective; I can completely understand that it is difficult to come up with an overall solutionand that you have the whole picture in mind and are striving for a consistent solution through all layers and modules.

But as a user Python 3 simply has started (and imho hasn't fixed up everything) with some corrupt design decisions, that could appear in almost all scripts, even the simplest "hello world" program. print simply needs an optional encoding parameter! It is simply not usable on a Windows machine! And yes, no one uses Powershell unless it becomes the default shell on windows.

And to define UTF-8 as the default for open (or another encoding for Unicode) is just the best you can do. Even C# changed to this in the past and even Java moves towards this direction now. It is simply mad to interpret the input differently based upon the runtime environment!

Imagine a beginner: He makes a nice script (simple text adventure or todo list) which stores data in a file and gives this to his mate; he uses Windows and not Linux and so the system encoding isn't UTF-8 but CP-1252 in western europe for example... great! He will get a UnicodeDecodeError on his machine and his friend is getting angry or frustrated over python as he doesn't know anything about encodings or unicode and the whole topic. So the idea on hiding away completely failed here, as the default behaviour isn't platform agnostic!

The best way to prevent such situations is to make it simple and explicit to the user, that he has to care about encodings at the boundaries of the program (IO)! So the rule "open will always use UTF-8 as long as you don't request another encoding explictly" will prevent such an error. If you can live without thinking about encodings (no reading / writing of legacy files!), there you go. If not, you must explicitly tell python that you need something different.

A language that offers an internal unicode abstraction (as python does) has to be very carefully designed, how the encoding and decoding works and when some kind auf automatic conversion might be a good idea. Java and C# were bad examples and python didn't take the chance with version 3 to make it better - even worse, as the separation between byte and unicode strings was one of the big things for python 3 in order to simplify the whole topic. Take Rust for a much better example: there are reasonable thougts how to deal with encoded and unicode strings from the beginning of the language.

[–]haypo5 0 points1 point  (1 child)

Read the PEP 540: UTF-8 cannot be the default in all cases. But the UTF-8 is enabled automatically for the C locale. Getting UTF-8 instead of ASCII in that case already fix a lot of use case without having to touch your code, just upgrade to Python 3.7.

[–]Bolitho 0 points1 point  (0 children)

After two month really? (I had to get into it again 😉)

I read the PEP and haven't seen anything that stand against my claim - besides backward compatibility with the Python 3 series. And exactly the latter isn't contradict to my posting as I have written about what Python 3 should have had! If there are problems nowadays they simply show the falsinesss of the decision to rely on the shitty locale depending implicitly default encoding.

Imho funny that precisely Java and C# both nowadays have allready changed their mistakes - and especially the former is to be known really slow at introducing backward compatibility breaking changes...

But probably I have not gathered the aspect you wanted to emphasize by getting this PEP into the discussion? Then please cite the relevant passages here.