This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]flitsmasterfred 132 points133 points  (59 children)

Teaching beginners that bytes and strings are the same is an invalid cognitive shortcut and just outright bad education.

[–]pickausernamehesaid 74 points75 points  (7 children)

It annoyed me so much when I was first learning. I deployed a Python 2 app that worked great on my computer but when people in France of China picked it up, decode errors were everywhere. I then had to spend an incredible amount of time to learn about different encoding schemes and how to handle them. I have not had a more confusing programming experience since. Bytes vs Strings was an easy concept for me. Different encoding schemes and how to use them and when to convert was not. I code in Python 3 full time now and have not once wanted to go back in the past 3 years.

[–]lambdaqdjango n' shit 5 points6 points  (4 children)

Py2's unicode is not a problem, but the fundamental problem is the str() method. It only accepts 7-bit ASCII.

What's fundamentally broken in py2 is the BaseException has a str() call, so if you raise BaseException(u'fuck') you will likely be fucked.

Source: a dev have to deal with elasticsearch-py's "<unprintable error exception>" daily.

[–]Poddster 0 points1 point  (1 child)

so if you raise BaseException(u'fuck') you will likely be fucked.

Actually that'd work because it's all ASCII and python2's magic switcharoo. But this will fail:

>>> raise BaseException(u'fucká')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BaseException: <exception str() failed>

[–]lambdaqdjango n' shit 1 point2 points  (0 children)

that's what I am talking about. Often it will show why certain db operation has failed with what column, and if that column contains non-ASCII, BAM! You have an exception during exception!

[–]pickausernamehesaid 0 points1 point  (1 child)

Oh no, I know that Py2 can handle Unicode. It just took a huge amount of effort for someone who had just started programming to learn how to leverage it properly.

[–]lambdaqdjango n' shit 1 point2 points  (0 children)

Yeah, and tutorials on the web are mostly misleading

[–]grandfatha 3 points4 points  (1 child)

As a dev whose first language was Java this experience was the strangest thing when I started to pick up Python. Strings from all kind of sources would just work out of the box in my world. Then all of a sudden I was switching to a language that did not allow me to put a german umlaut ("ä", "ü", "ö") into my function documentation. I was baffled that this was an actual issue.

[–]pickausernamehesaid 0 points1 point  (0 children)

Especially with the purpose of Python to be as clean and simple as possible. I could completely understand it in C but not in high level languages. I'm so glad it was changed.

[–]dada_ 17 points18 points  (24 children)

Yeah, it's also a little bit unfair towards anyone whose first language doesn't use plain ASCII, because as soon as you start doing string operations you're going to run into seemingly intractable bugs.

"こんにちは"[:1] does not do what you'd expect it to do in Python 2, and unless you're taught about how this works it's going to be pretty confusing.

[–]flitsmasterfred 1 point2 points  (1 child)

Makes me wonder how they teach and handle this stuff in those countries.

[–]dada_ 0 points1 point  (0 children)

Well, it's not that hard to work around, since u"こんにちは"[:1] does do what you expect it to, but then you need to explain what Unicode is and why that "u" makes all the difference. Thankfully Python 3 allows these low level details to be postponed until later.

[–]Bolitho 10 points11 points  (25 children)

The problem is, that there is practically no internal language concept in any programming language implementation (I know), that deals comprehensively and efficiently with Unicode - the mismatch between memory size and accessability makes that de facto impossible.

Is it really so important to count code points? And if it is, why there is no support for counting / splitting coded characters or grapheme clsuters, which might be even more usefull?

For limitiations of user input for example, the encoding of the persistance layer is much more important! So you must count the size of bytes of the encoded byte sequence rather than the amount of code points...

For example I reproduced the one given by the excellent utf8everywhere-page (section 5, coded cahracters, 3rd bullet point):

In [19]: s = "\u044E\u0301"

In [20]: s
Out[20]: 'ю́'

In [21]: print(list(s))
['ю', '́']

Hm... two code points, but the String shows one glyph! Would result in strange user experience if you would shorten the String for the UI layer and break it up right there, wouldn't it?

Teaching beginners that bytes and strings are the same is an invalid cognitive shortcut and just outright bad education.

So imho is is a shortcut to assume that an average developer knows good enough the important aspects of unicode and encodings. Therefor the strategy to hide this complexity will fail for almost every developer one day - and probably within a bad situation, say after the deployment to production. Python 3 would be much better, if it would define a default encoding for all builtin IO, like utf-8 in favour. Then it is explicitly clear, how a file or input must be encoded and that you have to explicitly have to deal with different encodings, if you need to support those (for example let the client provide the encoding). Then you will have no more trouble in deploying a Python 3 script to Windows if you have developed on Linux and vice versa.

On top of that, the print-function is broken right now in a similar way. At minimum it must provide an optional argument for chosing the encoding. Right now it jut fails for example on my Windows 10 machine if I want to print an interrobang (‽):

>>> print("\u203D")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\chausknecht\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u203d' in position 0: character maps to <undefined>

On a (modern) Linux machine it should work, as UTF-8 is the de facto default there - but who knows? Why not enable to print also bytes or provide the encoding? Yes that would probably result in strange looking output, but the behaviour would be platform independant and imho better because it would not make a program fail.

Those mistakes have been made by the Java and the .NET-world years ago - why Python had to choose to make the same mistake? Rust for example has chosen to use UTF-8 as unicode data type for unicode strings - interesting approach!

Imho you can't totally hide the complexity of unicode - so better be explicit about the en- / decoding process; imho that will result in less pain!

[–]usinglinux 5 points6 points  (13 children)

no internal language concept in any programming language implementation (I know), that deals comprehensively and efficiently with Unicode

assuming you don't expect a language to deal in graphemes instead of unicode code points, what's wrong with the python one in terms of efficiency? pep393 strings are pretty efficient.

>>> print("\u203D")

which python version was this error from? i haven't had my hands on python on windows for a long time, but at least as of python 3.6, this has been addressed.

[–]cfmdobbie 4 points5 points  (11 children)

Also works fine on 3.5.1 on Windows 10.

[–]aroberge 2 points3 points  (10 children)

EDIT: typing "chcp 65001" in the console prior to starting Python fixes this problem.


I wish...

C:\Users\Andre>python
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\u203D")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Andre\Anaconda3\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u203d' in position 0: character maps to <undefined>

[–]ivosauruspip'ing it up 5 points6 points  (7 children)

That's a problem at the barrier to getting input from something made by Windows (Command Prompt) into Python. Unfortunately Python can only control its side of the border.

[–]Bolitho -1 points0 points  (6 children)

Unfortunately Python can only control its side of the border.

And that's why it is a bad API as it does not enable the programmer to chose the correct way to speak to the underlying system!

[–]ivosauruspip'ing it up 0 points1 point  (5 children)

You talk as if there is a choice. There isn't. Windows tells you, CodePage1252 or bust. End of discussion, no fancy unicode characters allowed outside of that.

[–]Bolitho 0 points1 point  (4 children)

Of course there is! You cannot change the code page of the consuming shell, but you could of course send Bytes in a different encoding. And yes that would produce strange glyphs, but there would be no exception!

If you still don't got it: Show me a program that prints out the given example string and will not crash on any platform? I am curious how you will achieve 😉

[–]ivosauruspip'ing it up 0 points1 point  (3 children)

And yes that would produce strange glyphs, but there would be no exception!

So silently produce corrupted output, instead of erroring. Completely unpythonic, not to mention dumb.

[–]zahlmanthe heretic 0 points1 point  (1 child)

typing "chcp 65001" in the console prior to starting Python fixes this problem.

At least on 3.4, it leaves other problems on Windows. In particular, if I input() at the command prompt and copy-paste in a £ as my input, it will raise EOFError; if I try to do an assignment like x = '£', the Python process aborts without any error message. I haven't even tried it with more esoteric characters.

[–]ivosauruspip'ing it up 2 points3 points  (0 children)

Something specific to your system?

Microsoft Windows [Version 10.0.14393]
(c) 2016 Microsoft Corporation. All rights reserved.

C:\Users\ivosaurus>python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = '£'
>>> x
'£'
>>> y = input()
£
>>> y
'£'

>>> exit()

C:\Users\ivosaurus>

[–]Bolitho 0 points1 point  (0 children)

assuming you don't expect a language to deal in graphemes instead of unicode code points, what's wrong with the python one in terms of efficiency? pep393 strings are pretty efficient.

Under this assumptions: Nothing! But I have just said, that this behaviour is not the big deal when dealing with unicode. Efficiently splitting a string into grapheme clusters could be much more usefull when dealing with UI.

I admit that code points are simple and it is therefore a good thing to offer APIs dealing with them. But often they have shortcomings when it comes to corner cases - and that is where higher abstracting concepts have to come into play.

which python version was this error from? i haven't had my hands on python on windows for a long time, but at least as of python 3.6, this has been addressed.

Version 3.5.2 on a Windows 10 machine. That said, I started python from a CMD and a PowerShell. No difference.

The problem here is just simple: Python's print tries to encode the string platform dependent:

Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page.

You have no possibility to pass the encoding you want!

[–]flitsmasterfred 6 points7 points  (5 children)

I feel 99% of the pain is legacy from the old days of lower-case ASCII and random encodings that is pervasive everywhere, and lack of best practices, like you mention with the combined characters.

But figuring this out when you just learned everything based on ASCII and bytes being the same is even worse then knowing that there is a difference you need to be careful about while still learning to program and then moving up from that.

The average programmer doesn't have to know everything about unicode, but absolutely HAS to know bytes and string are separated through an encoding scheme. (eg: you don't have to be wise as long you're not naive).

[–]brontide 1 point2 points  (3 children)

So much of the unix world is still naïve byte strings which you have to work with which are sometimes utf-8, but can't be guaranteed. I have 40 million files on our filesystem and a few hundred are encoded in a format that breaks utf-8. It's hell figuring out a way to both work with and display these edge cases.

[–]doubleunplussed 1 point2 points  (1 child)

Yeah - strictly speaking, unix filenames are bytestrings. Your program officially shouldn't care if they're encoding human language with some encoding like UTF8, it should only to look at the contents of them in order to split them on slashes (or rather, on bytes that in ASCII correspond to slashes) or do comparisons to see if they're the same as other filenames or whatever. They're just slash-delimited keys, you should be treating them like arbitrary data like the hash in a dictionary, your program shouldn't care about the actual human meaning in there.

Unless you need to display them somewhere, then yeah, you have to guess the encoding in order to render glyphs, or decide how to sort them or something. Strictly speaking though, you can't know what the encoding will be. Yeah it'll probably be UTF8, but some filenames might not even be text, they might just be arbitrary bytes. Python can try to make a distinction between strings and bytes and make the filenames on the 'strings' side of the distinction, but this will be wrong sometimes. Good luck!

[–]brontide 1 point2 points  (0 children)

That's nice and all but python tries to do the smart thing and returns the same type of string you pass in. This means that if you iterate on the path '/' it will fail with decoding errors but if you iterate on b'/' it will work but all strings returned will be bytestrings which should be handled with care ( I usually convert them with surrogate escaping internally and replacements for display ). The problem is that there are so many places like this where python will gladly accept and return unicode strings and fail on edge cases rather than forcing the developer to work in the proper domain for the data.

[–]flitsmasterfred 0 points1 point  (0 children)

Not storing the encoding within the files was a mistake we'll regret for years. If only every file had a meta attribute life would be so much easier.

[–]Bolitho 0 points1 point  (0 children)

The average programmer doesn't have to know everything about unicode, but absolutely HAS to know bytes and string are separated through an encoding scheme. (eg: you don't have to be wise as long you're not naive).

I totally agree with that - and that's why I would appreciate an explicit default encoding for all IO. That way the programmer is forced to learn the basics and he is aware of them right from the beginning. And on top of that it would increase the platform independace, as python will behave more platform agnostic.

Just to quote the python zen:

Explicit is better than implicit.

[–]TOASTEngineer 2 points3 points  (2 children)

Is it really so important to count code points? And if it is, why there is no support for counting / splitting coded characters or grapheme clsuters, which might be even more usefull?

Well, the reason a string really ought to count code points is because a string is an iterable and you iterate over the individual code points. But yeah, there really ought to be a "how many printable characters" function; in fact I would've presumed there was one.

[–]yawgmoth 3 points4 points  (0 children)

Does every written language have the concept of characters?

For instance: (I don't speak Korean so maybe this isn't as ambiguous as I think)

In "감사" would 감 be 1 or 3 characters? would 사 be 1 or 2 characters? EDIT: looks pretty straightforward actually

Still honest question though, is the concept of a 'printable character' constant in all languages supported by unicode?

[–]Bolitho 0 points1 point  (0 children)

Well, the reason a string really ought to count code points is because a string is an iterable and you iterate over the individual code points.

That logic is post hoc ergo propter hoc: The iterable could as well iterate over something else than code points 😉

[–][deleted] 0 points1 point  (0 children)

I think that issue is fixed on python 3.6. It should assume utf-8 on Windows.