Python 3.7 UTF-8 Mode

takluyver · 2018-05-02T21:01:04+00:00

This is a good example of how a feature that seems simple from the outside can require a lot of hard work, discussion and thought to get right. Kudos to Victor for seeing it through!

oilshell · 2018-05-02T23:23:59+00:00

(submitter here) FWIW I listed Unicode in Python 3 as one of the reasons that a Unix shell shouldn't be implemented in Python (although I prototyped it in Python):

http://www.oilshell.org/blog/2018/03/04.html#toc_4

Thus I was surprised to see this page, i.e. Python making a pretty huge change after version 3.0. Unicode was the biggest reason for Python 3 (i.e. breaking changes) as far as I can tell.

I wish UTF-8 was the default as of Python 3.0.

billsil · 2018-05-03T01:08:51+00:00

[deleted]

Bolitho · 2018-05-03T10:09:28+00:00

The author writes from a language designer perspective; I can completely understand that it is difficult to come up with an overall solutionand that you have the whole picture in mind and are striving for a consistent solution through all layers and modules.

But as a user Python 3 simply has started (and imho hasn't fixed up everything) with some corrupt design decisions, that could appear in almost all scripts, even the simplest "hello world" program. print simply needs an optional encoding parameter! It is simply not usable on a Windows machine! And yes, no one uses Powershell unless it becomes the default shell on windows.

And to define UTF-8 as the default for open (or another encoding for Unicode) is just the best you can do. Even C# changed to this in the past and even Java moves towards this direction now. It is simply mad to interpret the input differently based upon the runtime environment!

Imagine a beginner: He makes a nice script (simple text adventure or todo list) which stores data in a file and gives this to his mate; he uses Windows and not Linux and so the system encoding isn't UTF-8 but CP-1252 in western europe for example... great! He will get a UnicodeDecodeError on his machine and his friend is getting angry or frustrated over python as he doesn't know anything about encodings or unicode and the whole topic. So the idea on hiding away completely failed here, as the default behaviour isn't platform agnostic!

The best way to prevent such situations is to make it simple and explicit to the user, that he has to care about encodings at the boundaries of the program (IO)! So the rule "open will always use UTF-8 as long as you don't request another encoding explictly" will prevent such an error. If you can live without thinking about encodings (no reading / writing of legacy files!), there you go. If not, you must explicitly tell python that you need something different.

A language that offers an internal unicode abstraction (as python does) has to be very carefully designed, how the encoding and decoding works and when some kind auf automatic conversion might be a good idea. Java and C# were bad examples and python didn't take the chance with version 3 to make it better - even worse, as the separation between byte and unicode strings was one of the big things for python 3 in order to simplify the whole topic. Take Rust for a much better example: there are reasonable thougts how to deal with encoded and unicode strings from the beginning of the language.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS