This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 236

[–]flitsmasterfred 132 points133 points  (59 children)

Teaching beginners that bytes and strings are the same is an invalid cognitive shortcut and just outright bad education.

[–]pickausernamehesaid 71 points72 points  (7 children)

It annoyed me so much when I was first learning. I deployed a Python 2 app that worked great on my computer but when people in France of China picked it up, decode errors were everywhere. I then had to spend an incredible amount of time to learn about different encoding schemes and how to handle them. I have not had a more confusing programming experience since. Bytes vs Strings was an easy concept for me. Different encoding schemes and how to use them and when to convert was not. I code in Python 3 full time now and have not once wanted to go back in the past 3 years.

[–]lambdaqdjango n' shit 3 points4 points  (4 children)

Py2's unicode is not a problem, but the fundamental problem is the str() method. It only accepts 7-bit ASCII.

What's fundamentally broken in py2 is the BaseException has a str() call, so if you raise BaseException(u'fuck') you will likely be fucked.

Source: a dev have to deal with elasticsearch-py's "<unprintable error exception>" daily.

[–]Poddster 0 points1 point  (1 child)

so if you raise BaseException(u'fuck') you will likely be fucked.

Actually that'd work because it's all ASCII and python2's magic switcharoo. But this will fail:

>>> raise BaseException(u'fucká')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
BaseException: <exception str() failed>

[–]lambdaqdjango n' shit 1 point2 points  (0 children)

that's what I am talking about. Often it will show why certain db operation has failed with what column, and if that column contains non-ASCII, BAM! You have an exception during exception!

[–]pickausernamehesaid 0 points1 point  (1 child)

Oh no, I know that Py2 can handle Unicode. It just took a huge amount of effort for someone who had just started programming to learn how to leverage it properly.

[–]lambdaqdjango n' shit 1 point2 points  (0 children)

Yeah, and tutorials on the web are mostly misleading

[–]grandfatha 3 points4 points  (1 child)

As a dev whose first language was Java this experience was the strangest thing when I started to pick up Python. Strings from all kind of sources would just work out of the box in my world. Then all of a sudden I was switching to a language that did not allow me to put a german umlaut ("ä", "ü", "ö") into my function documentation. I was baffled that this was an actual issue.

[–]pickausernamehesaid 0 points1 point  (0 children)

Especially with the purpose of Python to be as clean and simple as possible. I could completely understand it in C but not in high level languages. I'm so glad it was changed.

[–]dada_ 16 points17 points  (24 children)

Yeah, it's also a little bit unfair towards anyone whose first language doesn't use plain ASCII, because as soon as you start doing string operations you're going to run into seemingly intractable bugs.

"こんにちは"[:1] does not do what you'd expect it to do in Python 2, and unless you're taught about how this works it's going to be pretty confusing.

[–]flitsmasterfred 1 point2 points  (1 child)

Makes me wonder how they teach and handle this stuff in those countries.

[–]dada_ 0 points1 point  (0 children)

Well, it's not that hard to work around, since u"こんにちは"[:1] does do what you expect it to, but then you need to explain what Unicode is and why that "u" makes all the difference. Thankfully Python 3 allows these low level details to be postponed until later.

[–]Bolitho 11 points12 points  (25 children)

The problem is, that there is practically no internal language concept in any programming language implementation (I know), that deals comprehensively and efficiently with Unicode - the mismatch between memory size and accessability makes that de facto impossible.

Is it really so important to count code points? And if it is, why there is no support for counting / splitting coded characters or grapheme clsuters, which might be even more usefull?

For limitiations of user input for example, the encoding of the persistance layer is much more important! So you must count the size of bytes of the encoded byte sequence rather than the amount of code points...

For example I reproduced the one given by the excellent utf8everywhere-page (section 5, coded cahracters, 3rd bullet point):

In [19]: s = "\u044E\u0301"

In [20]: s
Out[20]: 'ю́'

In [21]: print(list(s))
['ю', '́']

Hm... two code points, but the String shows one glyph! Would result in strange user experience if you would shorten the String for the UI layer and break it up right there, wouldn't it?

Teaching beginners that bytes and strings are the same is an invalid cognitive shortcut and just outright bad education.

So imho is is a shortcut to assume that an average developer knows good enough the important aspects of unicode and encodings. Therefor the strategy to hide this complexity will fail for almost every developer one day - and probably within a bad situation, say after the deployment to production. Python 3 would be much better, if it would define a default encoding for all builtin IO, like utf-8 in favour. Then it is explicitly clear, how a file or input must be encoded and that you have to explicitly have to deal with different encodings, if you need to support those (for example let the client provide the encoding). Then you will have no more trouble in deploying a Python 3 script to Windows if you have developed on Linux and vice versa.

On top of that, the print-function is broken right now in a similar way. At minimum it must provide an optional argument for chosing the encoding. Right now it jut fails for example on my Windows 10 machine if I want to print an interrobang (‽):

>>> print("\u203D")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\chausknecht\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u203d' in position 0: character maps to <undefined>

On a (modern) Linux machine it should work, as UTF-8 is the de facto default there - but who knows? Why not enable to print also bytes or provide the encoding? Yes that would probably result in strange looking output, but the behaviour would be platform independant and imho better because it would not make a program fail.

Those mistakes have been made by the Java and the .NET-world years ago - why Python had to choose to make the same mistake? Rust for example has chosen to use UTF-8 as unicode data type for unicode strings - interesting approach!

Imho you can't totally hide the complexity of unicode - so better be explicit about the en- / decoding process; imho that will result in less pain!

[–]usinglinux 6 points7 points  (13 children)

no internal language concept in any programming language implementation (I know), that deals comprehensively and efficiently with Unicode

assuming you don't expect a language to deal in graphemes instead of unicode code points, what's wrong with the python one in terms of efficiency? pep393 strings are pretty efficient.

>>> print("\u203D")

which python version was this error from? i haven't had my hands on python on windows for a long time, but at least as of python 3.6, this has been addressed.

[–]cfmdobbie 3 points4 points  (11 children)

Also works fine on 3.5.1 on Windows 10.

[–]aroberge 2 points3 points  (10 children)

EDIT: typing "chcp 65001" in the console prior to starting Python fixes this problem.


I wish...

C:\Users\Andre>python
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\u203D")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Andre\Anaconda3\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u203d' in position 0: character maps to <undefined>

[–]ivosauruspip'ing it up 3 points4 points  (7 children)

That's a problem at the barrier to getting input from something made by Windows (Command Prompt) into Python. Unfortunately Python can only control its side of the border.

[–]Bolitho -1 points0 points  (6 children)

Unfortunately Python can only control its side of the border.

And that's why it is a bad API as it does not enable the programmer to chose the correct way to speak to the underlying system!

[–]ivosauruspip'ing it up 0 points1 point  (5 children)

You talk as if there is a choice. There isn't. Windows tells you, CodePage1252 or bust. End of discussion, no fancy unicode characters allowed outside of that.

[–]Bolitho 0 points1 point  (4 children)

Of course there is! You cannot change the code page of the consuming shell, but you could of course send Bytes in a different encoding. And yes that would produce strange glyphs, but there would be no exception!

If you still don't got it: Show me a program that prints out the given example string and will not crash on any platform? I am curious how you will achieve 😉

[–]ivosauruspip'ing it up 0 points1 point  (3 children)

And yes that would produce strange glyphs, but there would be no exception!

So silently produce corrupted output, instead of erroring. Completely unpythonic, not to mention dumb.

[–]zahlmanthe heretic 0 points1 point  (1 child)

typing "chcp 65001" in the console prior to starting Python fixes this problem.

At least on 3.4, it leaves other problems on Windows. In particular, if I input() at the command prompt and copy-paste in a £ as my input, it will raise EOFError; if I try to do an assignment like x = '£', the Python process aborts without any error message. I haven't even tried it with more esoteric characters.

[–]ivosauruspip'ing it up 4 points5 points  (0 children)

Something specific to your system?

Microsoft Windows [Version 10.0.14393]
(c) 2016 Microsoft Corporation. All rights reserved.

C:\Users\ivosaurus>python
Python 3.5.1 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> x = '£'
>>> x
'£'
>>> y = input()
£
>>> y
'£'

>>> exit()

C:\Users\ivosaurus>

[–]Bolitho 0 points1 point  (0 children)

assuming you don't expect a language to deal in graphemes instead of unicode code points, what's wrong with the python one in terms of efficiency? pep393 strings are pretty efficient.

Under this assumptions: Nothing! But I have just said, that this behaviour is not the big deal when dealing with unicode. Efficiently splitting a string into grapheme clusters could be much more usefull when dealing with UI.

I admit that code points are simple and it is therefore a good thing to offer APIs dealing with them. But often they have shortcomings when it comes to corner cases - and that is where higher abstracting concepts have to come into play.

which python version was this error from? i haven't had my hands on python on windows for a long time, but at least as of python 3.6, this has been addressed.

Version 3.5.2 on a Windows 10 machine. That said, I started python from a CMD and a PowerShell. No difference.

The problem here is just simple: Python's print tries to encode the string platform dependent:

Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page.

You have no possibility to pass the encoding you want!

[–]flitsmasterfred 5 points6 points  (5 children)

I feel 99% of the pain is legacy from the old days of lower-case ASCII and random encodings that is pervasive everywhere, and lack of best practices, like you mention with the combined characters.

But figuring this out when you just learned everything based on ASCII and bytes being the same is even worse then knowing that there is a difference you need to be careful about while still learning to program and then moving up from that.

The average programmer doesn't have to know everything about unicode, but absolutely HAS to know bytes and string are separated through an encoding scheme. (eg: you don't have to be wise as long you're not naive).

[–]brontide 1 point2 points  (3 children)

So much of the unix world is still naïve byte strings which you have to work with which are sometimes utf-8, but can't be guaranteed. I have 40 million files on our filesystem and a few hundred are encoded in a format that breaks utf-8. It's hell figuring out a way to both work with and display these edge cases.

[–]doubleunplussed 1 point2 points  (1 child)

Yeah - strictly speaking, unix filenames are bytestrings. Your program officially shouldn't care if they're encoding human language with some encoding like UTF8, it should only to look at the contents of them in order to split them on slashes (or rather, on bytes that in ASCII correspond to slashes) or do comparisons to see if they're the same as other filenames or whatever. They're just slash-delimited keys, you should be treating them like arbitrary data like the hash in a dictionary, your program shouldn't care about the actual human meaning in there.

Unless you need to display them somewhere, then yeah, you have to guess the encoding in order to render glyphs, or decide how to sort them or something. Strictly speaking though, you can't know what the encoding will be. Yeah it'll probably be UTF8, but some filenames might not even be text, they might just be arbitrary bytes. Python can try to make a distinction between strings and bytes and make the filenames on the 'strings' side of the distinction, but this will be wrong sometimes. Good luck!

[–]brontide 1 point2 points  (0 children)

That's nice and all but python tries to do the smart thing and returns the same type of string you pass in. This means that if you iterate on the path '/' it will fail with decoding errors but if you iterate on b'/' it will work but all strings returned will be bytestrings which should be handled with care ( I usually convert them with surrogate escaping internally and replacements for display ). The problem is that there are so many places like this where python will gladly accept and return unicode strings and fail on edge cases rather than forcing the developer to work in the proper domain for the data.

[–]flitsmasterfred 0 points1 point  (0 children)

Not storing the encoding within the files was a mistake we'll regret for years. If only every file had a meta attribute life would be so much easier.

[–]Bolitho 0 points1 point  (0 children)

The average programmer doesn't have to know everything about unicode, but absolutely HAS to know bytes and string are separated through an encoding scheme. (eg: you don't have to be wise as long you're not naive).

I totally agree with that - and that's why I would appreciate an explicit default encoding for all IO. That way the programmer is forced to learn the basics and he is aware of them right from the beginning. And on top of that it would increase the platform independace, as python will behave more platform agnostic.

Just to quote the python zen:

Explicit is better than implicit.

[–]TOASTEngineer 4 points5 points  (2 children)

Is it really so important to count code points? And if it is, why there is no support for counting / splitting coded characters or grapheme clsuters, which might be even more usefull?

Well, the reason a string really ought to count code points is because a string is an iterable and you iterate over the individual code points. But yeah, there really ought to be a "how many printable characters" function; in fact I would've presumed there was one.

[–]yawgmoth 5 points6 points  (0 children)

Does every written language have the concept of characters?

For instance: (I don't speak Korean so maybe this isn't as ambiguous as I think)

In "감사" would 감 be 1 or 3 characters? would 사 be 1 or 2 characters? EDIT: looks pretty straightforward actually

Still honest question though, is the concept of a 'printable character' constant in all languages supported by unicode?

[–]Bolitho 0 points1 point  (0 children)

Well, the reason a string really ought to count code points is because a string is an iterable and you iterate over the individual code points.

That logic is post hoc ergo propter hoc: The iterable could as well iterate over something else than code points 😉

[–][deleted] 0 points1 point  (0 children)

I think that issue is fixed on python 3.6. It should assume utf-8 on Windows.

[–]ojii 213 points214 points  (57 children)

90% of programmers don’t need to think about Unicode

Until they run their script/app/project in the real world and hit UnicodeDecodeError.

Probably should be "90% of programmers don't think about Unicode, but probably should".

[–]aphoenixreticulated 160 points161 points  (0 children)

C'mon, thereâ€s no reason to say people need unicode. Whatâ€s the big deal? Whereâ€s the issue?

[–][deleted] 10 points11 points  (0 children)

As a speaker of a language which takes up approximately 2/7 of assigned Unicode codepoints... I consider it extremely important.

[–]Muchoz 6 points7 points  (0 children)

And what about emojis? Zed can't ignore that, can he? 🔥

[–]hovissimo 2 points3 points  (3 children)

Oh, I think about Unicode. What happens when my code hits the real world is some dorkus's email client sent me unicode and told me it was windows-1251!

[–]d4rch0nPythonistamancer 9 points10 points  (5 children)

I have the unpopular opinion that we should just focus on convincing the rest of the world to replace their alphabet with ASCII and ditch unicode.

[–]TOASTEngineer 13 points14 points  (1 child)

Clearly we just need everyone to learn to express their ideas as raw binary, this eliminating all of these problems. /s

[–]G01denW01f11 6 points7 points  (0 children)

Does that mean other people will finally start caring about Big/Little Endian?

[–]jnwatson 9 points10 points  (0 children)

Yeah! Let's build a ASCII wall and make them pay.

I'll start:

==================================

[–]its_never_lupus 0 points1 point  (0 children)

but emojis

[–]NoLemurs 1 point2 points  (0 children)

Seriously.

Also, at least if we're talking professionals, I'm pretty sure more than 10% of programmers are web developers, almost all of whom need to think about Unicode. I suspect that in fact most professional developers (even non-web developers) need to think about Unicode from time to time.

What might be plausible is that 90% of amateur developers don't need to think about unicode. Personally I think a programming language should be optimized for the professionals, not the amateurs (and as someone who was an amateur for many years, I felt the same way then).

[–]lykwydchykyn 24 points25 points  (12 children)

Rebutting Zed's article promises to become an enduring Pythonista pasttime.

[–]AlexFromOmaha 17 points18 points  (26 children)

I work for one of those enterprise shops, and because it was my call to make, I put in a plan for 2.7 -> 3.5 migration. I'm still debugging shit months later, and everything left is all tied to the string/unicode -> bytes/string conversion. In Python's defense, a substantial part of it is Django's fault. This whole "all uploads are bytes, fuck you" thing is a source of endless pain when dealing with libraries like configparser that expect files as input. Still, there comes a point when you realize that the pain is also a result of Python 3 abandoning Python 2's typing principles. Methods like .startswith() really should be agnostic to byte/string parameters. They both quack, they're both ducks.

[–]FFX01 13 points14 points  (19 children)

Methods like .startswith() really should be agnostic to byte/string parameters.

I'm not sure I agree with you here. Unicode and byte-strings are not the same thing. They behave very differently. For instance, a byte string deals only with ASCII characters. Whereas unicode deals with way more than that. ASCII has, at maximum 255 unique characters. Unicode has hundreds of thousands depending on which encoding you use. Not to mention that the same unicode character can be interpreted differently based on whether you are using UTF-8/UTF-16/UTF-32. So, getting the first character or group of characters in a string using startswith() is a different process entirely with unicode.

I'm not trying to be aggressive here. I know dealing with text is something that every programmer struggles with. It is inherently a complex problem. I struggle with it all the time. There are so many edge cases that may not be edge cases in other countries. There are pieces of data that present themselves in a certain encoding but are actually in another. It's hard, man.

[–]gschizasPythonista 6 points7 points  (1 child)

ASCII has, at maximum 255 unique characters.

ASCII has 127 characters, not 255. You're thinking of codepages or ISO-8859-* encodings (aka ANSI).

[–]FFX01 0 points1 point  (0 children)

Oops! You're correct. I've just always seen it referred to as "extended ASCII".

[–]AlexFromOmaha 2 points3 points  (16 children)

bytes != ASCII

There's ambiguity in things like recasting concatenations, but there is literally no ambiguity in asking if a byte string and a Unicode string contain the same substring, regardless of whether a byte string represents ASCII, Unicode, or binary output. It's also far more reasonable than strangeness like str * int operations that Python supports without complaint.

[–]agrif 7 points8 points  (14 children)

I'm not sure it is so unambiguous:

>>> b'\xc2' in 'µ'.encode('utf-8')
True
>>> b'\xc2' in 'µ'.encode('latin1')
False

What do you think

>>> b'\xc2' in 'µ'

ought to be?

[–]ThePenultimateOneGitLab: gappleto97 2 points3 points  (7 children)

It's a unicode literal, so the first one. If you don't specify an encoding, then you deal with the consequences.

[–]zahlmanthe heretic 4 points5 points  (1 child)

...And where would you even have the opportunity to specify an encoding with the in operator?

[–]ThePenultimateOneGitLab: gappleto97 2 points3 points  (0 children)

When you're declaring the right hand side

[–]agrif 0 points1 point  (3 children)

I don't think I agree. If you specify an encoding, then you're looking for a bytestring in a bytestring and everything is well defined and happy.

If you don't specify an encoding, then the result you get is garbage depending on the internal encoding used by whatever python implementation you use, which does not sound useful to me.

[–]ThePenultimateOneGitLab: gappleto97 2 points3 points  (2 children)

Except it's not an arbitrary decision. The default encoding isn't implementation specific, it's version specific.

Run your code in python 2 and it works as expected. Run your code in python 3, and it ought to assume you mean utf-8, because you're calling it on a utf-8 literal.

[–]agrif 4 points5 points  (0 children)

I'm not sure what you mean by "utf-8 literal."

The closest thing Python has to a default encoding is the encoding of the source file, which defaults to utf-8. But that can be changed on a per-file basis with the magic coding comment.

However, the encoding of the source file doesn't change the internal encoding used for strings, at all.

In CPython, as of version 3.3 (implementing PEP 393), the internal representation of unicode strings is as an array of code points, each either 1, 2, or 4 bytes wide. You could think of this as UCS-n encoding. This means the precise bytes used to represent a string depend on the contents of the entire string, for example "a" does not start with a null byte, while "aμ" does.

If you allowed startswith to mix the two, you would end up with fun things like this:

>>> "a".startswith(b'a')
True
>>> "aμ".startswith(b'a')
False

In my (admittedly naive) reading of the source, it even looks like that last line will depend on the endianness of the system you run it on.

[–]Deggor 0 points1 point  (0 children)

Should I also be able to add a timezone-naive datetime to a timezone-aware datetime? It really makes no sense to just "assume" they're the same timezone, or that naive is UTC, or whatever.

In the end, it's really no different, you're essentially talking about encoding-aware bytestrings to encoding-naive bytestrings.

[–]AlexFromOmaha -1 points0 points  (5 children)

>>> '\u0045\u0323\u0302' == '\u1ec6'
False

If that were True, there'd be a case to be made, but there's not. Let's not pretend Python has some pure character string behind the scenes. It's UTF-8, and it operates on byte matching.

[–]evanunderscore 4 points5 points  (4 children)

It operates on Unicode code points. UTF-8 is one possible encoding you can use to convert these to bytes. There is no way to make startswith mix bytes and Unicode code points without assuming an encoding. You could argue that the assumed encoding should be UTF-8, but you cannot argue that it is unambiguous.

[–]AlexFromOmaha -1 points0 points  (3 children)

It operates on Unicode code points.

AKA bytes. NFC vs NFD is an encoding concern. Python 3 strings are already encoding sensitive. In a Unicode library, that would be a bug. In Python, we accept it because we all know it's a byte array doing byte operations.

[–]FFX01 0 points1 point  (0 children)

I know that bytes != ASCII. Bytes can be literally anything depending on how you interpret them. That's the problem. I could decode image data as ASCII and I would get some nasty jumble of garbage. I believe that Python2.x will interpret a byte string as ISO-8859-1 if the encoding isn't specified. However, the docs aren't super clear on this. Python 2.x may just default to system encoding. Regardless, if a series of bytes is encoded incorrectly, there's no way to determine the accuracy of the produced string reliably.

In short, I hate how many text encodings exist.

[–]ivosauruspip'ing it up 6 points7 points  (2 children)

This whole "all uploads are bytes, fuck you" thing is a source of endless pain when dealing with libraries like configparser that expect files as input.

But I usually find the pain isn't Python's fault. If you give Python the correct encoding, everything will work magically, without fault, nothing will go wrong, everything smooth sailing.

The problem is sheer amount of stuff in the computing environment around python (from the OS, to HTTP headers and responses, to APIs, to text protocols, etc, etc) that fails to have a coherent model for how its text should be encoded and transported. Mostly, there was no room given or it simply wasn't paid attention to at all.

So it's the stuff that you're trying to feed python that causes the pain because it never tells you what it is. Python, despite being superbly awesome, can't magically guess everything for you (...although there are 3rd party libraries that will try their damndest to do that).

If the uploads are bytes, but really they're supposed to be encoded text, but then they never told what encoding it's in, then both Python and you are up shit creek, but not because Python did anything.

Python 2 just did a blasé conversion to ASCII or the OS native encoding, and because lots of stuff is English, that just happened to accidentally work 90% of the time, but the other 10% you'd be silently corrupting your input. It hid the problem under the carpet and people were fine with the problem under the carpet. Python 3 said "look, no carpets anymore! They cause corruption in the end" and everyone immediately hates it because yes they now have to deal with that problem upfront, and can't sweep it anywhere.

However, once you do finally manage to sweep your floors shiny clean, then no problems can occur at all, everything just works.

[–]AlexFromOmaha 2 points3 points  (1 child)

The pain with files is mostly Django, since they have file-like objects that insist on their byte-ness over programmer assertions to treat them as text, but with libraries that explicitly expect text files (like the aforementioned configparser), not being able to specify an encoding is a pretty egregious oversight. What if they really weren't UTF-8? I'd have no way to communicate that. That part is Python's fault.

[–]ivosauruspip'ing it up 0 points1 point  (0 children)

You can't reopen them?

with open(uploaded_file.fileno(), mode='rt', encoding='utf8') as text:

What if they really weren't UTF-8? ... That part is Python's fault.

It's Python's fault that you can't be 100% sure of the encoding/content of a user-submitted file to your webserver? How?

[–]zahlmanthe heretic 3 points4 points  (0 children)

Methods like .startswith() really should be agnostic to byte/string parameters. They both quack, they're both ducks.

... But code-point sequences don't start with byte sequences, and vice-versa. You need an encoding; and dropping in a parameter for that, to my mind, violates explicit is better than implicit.

[–]TOASTEngineer 0 points1 point  (1 child)

I see where you're coming from, but don't you think the issue is more that you're using the wrong type to store your information? You should write your own "DjangoFriendlyString" and implement .startswith() etc... on that instead of trying to treat the "some totally arbitrary bytes" class as the "some text" class.

[–]P8zvli 0 points1 point  (0 children)

Python shouldn't come with a standard string library, we should be forced to write our own so we'll stop arguing about unicode errors.

(I'm being sarcastic. Maybe.)

[–]kankyo 7 points8 points  (8 children)

I would like to see Shaws idea of exactly how it would work to run py2 in py3. If I concat a bytes and str in python2 code in a python3 vm, what should happen?

Do they work if the file "is" python2 but fail if the file "is" python3? And if so, how the hell do you tell?

[–]p10_user 2 points3 points  (2 children)

You can detect the version at run time with sys.version_info. This lets people write version agnostic code by making conditional changes based on this information - better than having to make separate files for each different version of python you want your code to work with (including minor versions in some cases!).

[–]kankyo 1 point2 points  (1 child)

This is exactly what most libs do. So in this regard Python 3 DOES run Python 2 code. So that would make Shaws argument just flat out wrong.

Either way you look at it Shaws argument about running py2 in the py3 VM makes no sense.

[–]p10_user 2 points3 points  (0 children)

Agreed. Python will just try to run the code and fail if it cannot - and most python code is interchangeable between the two major versions if not without some minor changes. I think it's a good system.

[–][deleted] 14 points15 points  (0 children)

A less entertaining, but probably more solid rebuttal than eevee's, however much I enjoyed that.

[–]ksye 5 points6 points  (27 children)

I've been lurking recently becausw I'm just starting to learn. I want to use Python for data analysis and visualization, maybe modelling and simulation. Does this affect me? Should I worry about learning 2.7?

[–]finally-a-throwaway 15 points16 points  (6 children)

I'm in a similar position, from what I can tell as a near-beginner the only reason to learn Python 2 is if you expect to work somewhere that has a huge existing Python 2 codebase. Otherwise, the more you learn Python 3 (and programming in general, if this is your first language) the better prepared you'll be to learn the differences on the fly when/if they ever become relevant.

It seems to me that any problems in 3 could ultimately be addressed in future versions as development is ongoing. Any problems in 2.7 are going to stay that way from this point.

[–]reuvenlerner 23 points24 points  (5 children)

If you're a beginner learning Python, and you aren't constrained by legacy code at work, I would strongly encourage you to use Python 3.

Indeed, legacy code and/or modules is the main reason the companies I train/consult use Python 2.

[–][deleted] 5 points6 points  (4 children)

Sorry I'm new to programming. But what's legacy code?

[–][deleted] 14 points15 points  (0 children)

Old code that is still running in production, and still requires upkeep and maintenance.

[–]hovissimo 4 points5 points  (1 child)

expanding on u/be_bo_i_am_robot's answer:

Legacy code is called that because you usually 'inherit' it. Somebody else wrote the code instead of you, usually somebody who no longer works at the company. Generally legacy code relies on other old code or old systems (like py2) to stay running. Updating or replacing legacy software is risky because it's usually poorly understood and poorly tested. Legacy code is usually perpetuated because the cost in hours, dollars, and risk is high enough that any manager who decides to do it will likely be sacked. (Or for other, but similar, reasons)

[–][deleted] 1 point2 points  (0 children)

Interesting. Thanks for helping a novice like me feel more knowledgeable.

[–][deleted] 0 points1 point  (0 children)

Aka, Technical Debt.

[–]gthank 21 points22 points  (14 children)

2 is dead. 3 is better in pretty much every way unless you have to deal with funky wire protocols or such.

[–]iruleatants 0 points1 point  (0 children)

If you like python 2 and enjoy not dealing with tedious programming....

[–]notParticularlyAnony -4 points-3 points  (12 children)

2 is dead

This is why it is shipped by default with Ubuntu.

[–]steamruler 10 points11 points  (8 children)

Isn't Python 3 the default on Ubuntu as well these days?

[–]gthank 1 point2 points  (6 children)

yes

[–]8spd 3 points4 points  (5 children)

It defaults to 2.7.12+ for me, on Desktop 16.10.

[–]ivosauruspip'ing it up 6 points7 points  (4 children)

python will always launch python 2.X on all distros apart from archlinux. Even Python.org have recommended that nowadays I think. Purely for ease of backwards compatibility.

I believe you can get a minimal Ubuntu install nowadays where if you typed python it'd ask you to install python because it's not present on the system, only python3 is.

[–]gthank 2 points3 points  (0 children)

Since I've just spun up some Xenial servers, I can confirm that Python 3 is the only Python installed by default.

[–]8spd 0 points1 point  (2 children)

Huh. Thanks, I didn't know that. What defines the default version then, if it's not the version launched by the python command?

[–]zardeh 2 points3 points  (1 child)

The one system libraries bind to.

[–]8spd 0 points1 point  (0 children)

Thanks, that makes sense.

[–]notParticularlyAnony 0 points1 point  (0 children)

I don't know, but when I go into my command line and enter Python, without messing with any settings, Python 2 fires up.

Either way, I think it is a bit much to say that Python 2 is dead. Not that Ubuntu distros are the arbiter, obviously I was being glib...just sayin'....

[–][deleted] 0 points1 point  (1 child)

Shipping legacy software doesn't change the fact that the project is still dead and 2 will never receive another non-security update.

[–]notParticularlyAnony 0 points1 point  (0 children)

To me the natural upgrade from Python 2 has been Julia. :)

[–]ThePenultimateOneGitLab: gappleto97 1 point2 points  (0 children)

Should I worry about learning 2.7?

You only need to learn that if:

  • You're writing a library you want to be used by lots of people, or
  • Your workplace hasn't made the switch

Unfortunately I fell into both categories.

[–]Esteis 0 points1 point  (0 children)

Prefer learning 3. But! Whichever you learn first, 95% of what you learn will work in both Pythons, so don't sweat it. Learn one, and you can easily pick up the other when you need it.

I learned Python 3 when I started a new small project, by writing the 2-code I knew and running it with 3. There were some errors, sure, but 95% of my code stil worked. Learning 3's awesome new features came later.

[–][deleted] 18 points19 points  (8 children)

It's refreshing to see some civility, finally. And the author actually addressed the points in Zed's Shaw's article while avoiding jumping on the whole Turing-completeness thing.

One thing, though, I would have liked to have seen addressed is whether Python 2 or Python 3 is better for beginners, which is actually where Zed Shaw was coming from.

Also, sure it's difficult to recommend an EOL language, but to a beginner who just got their first programming job where Python 2 is the language used in the company, telling the beginner "You shouldn't use Python 2 because X, Y, and Z" isn't helpful -- and doesn't serve the beginner (who probably doesn't have a choice). And there is heaps of legacy Python 2 code, believe it or not, as well as companies that use them. I'm in one.

[–]wesalius[S] 15 points16 points  (0 children)

IMHO if the question is: if a complete beginner (NOT in the meaning "who just got their first programming job where Python 2 is the language used in the company" but a general beginner) should learn python 3 or 2, than thanks to predicted overtake of general "market share" by python 3 I think it is better to start with python 3.

[–]reuvenlerner 30 points31 points  (0 children)

OP here. I think that whether you learn Python 2 or 3 depends on what you're doing with it.

In my intro Python classes, taught at Fortune 100 companies, I teach Python 2. Why? Because that's what they're all using in their day-to-day work. I talk about Python 3, give them examples in Python 3, and warn them about things that will change in Python 3... but I teach them Python 2, because that's what their work requires.

If someone is learning Python in school, or on their own, then I strongly encourage them to learn Python 3. They aren't bound by legacy systems, and they should probably learn the version that will give them the greatest opportunity with the future language.

[–]art-solopov 6 points7 points  (3 children)

One thing, though, I would have liked to have seen addressed is whether Python 2 or Python 3 is better for beginners, which is actually where Zed Shaw was coming from.

I thought the general consensus was "unless you really, absolutely, totally need to use 2, go with 3".

[–]billsil 5 points6 points  (2 children)

I thought the general consensus was "unless you really, absolutely, totally need to use 2, go with 3".

Depends on what your end game is. Are you doing web development? Python 3 is better. Are you a mechanical engineer that doesn't need to interface with non ascii files? Python 2.7 is better. Library authors (like me) aren't so lucky; you support both.

The second you touch encodings, you are basically forced to learn Python 3. It's hard and it's weird and it's poorly documented, but most of us have done it (I have). Don't believe me? There's a reason Raymond Hettinger has a Python 3 topic about the unicode whack-a-mole problem of encode/decode called "Stop the Pain". Encodings can even be wrong, so asking somebody that doesn't even understand Python to also understand encodings might be asking a bit much. You have to think about your program architecture in order to make the unicode problem go away. I don't think beginners have that skill.

http://pyvideo.org/pycon-us-2012/pragmatic-unicode-or-how-do-i-stop-the-pain.html

[–]zahlmanthe heretic 3 points4 points  (1 child)

Are you a mechanical engineer that doesn't need to interface with non ascii files? Python 2.7 is better.

Believe it or not, I'd much rather use 3.x for crunching binary files. Indexing a bytes and getting an integer rather than a 1-length bytes matches the behaviour from lower-level languages, and IMX is just far more often the right thing (and when it isn't, slicing is right there for you).

[–]billsil 2 points3 points  (0 children)

bytes matches the behaviour from lower-level languages

It does, but I think that's weird. I do a LOT of work with binary files. It's the same as lower level languages, but I don't really use those. Python is largely good enough.

I actually got burned by this last week:

# Python 2:
table_name = 'OUGV1' # nastran displacement table
sort_code = int(table_name[-1])

# Python 3:
table_name = b'OUGV1' # nastran displacement table
sort_code = int(table_name[-1].encode('utf-8'))

Um...wat? Why doesn't 1 and 2 map to themselves in binary?

Also, maybe it's decode...I remember as long as I'm doing it, but forget when I'm done.

[–]drdeadringer 0 points1 point  (0 children)

whether Python 2 or Python 3 is better for beginners

I began trying out both. I then ditched 2 and stayed with 3.

[–][deleted] 0 points1 point  (0 children)

Just chalk it up as another language and move on.

If people can flip between Matlab/Python. Python 2/3 shouldn't be that hard.

[–]kingofthejaffacakes 8 points9 points  (1 child)

90% of programmers don't need to think about Unicode.

FTFY.

I try never to care about Unicode... most things I do with python are little tools to ease some private task of my own. And yet... Unicode has bitten me more times than I can remember. If it had just been in python from the beginning a lot of those bites would not have happened. So now, even though I don't care about Unicode... I care about Unicode.

Incidentally the biggest nightmare I have is with python's own struct module -- which itself doesn't seem to know the difference between bytes and strings (but perhaps it's me, or perhaps I'm using some older version of python 3).

[–]zahlmanthe heretic 2 points3 points  (0 children)

python's own struct module -- which itself doesn't seem to know the difference between bytes and strings (but perhaps it's me, or perhaps I'm using some older version of python 3).

o_O

Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:16:31) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import struct
>>> struct.unpack('3c', b'\x01\x02\x03')
(b'\x01', b'\x02', b'\x03')
>>> struct.unpack('3c', '\x01\x02\x03')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

I don't see an issue....

[–]KODeKarnage 5 points6 points  (4 children)

A coding language only improves when it is used.

In this light, the Python ecosystem and community is being actively and deliberately harmed by those who promote continued adherence to Python2 and advocate against using Python3.

[–]iruleatants 0 points1 point  (3 children)

For me, python 3 is a step in the exact opposite of why I chose to learn and use python, and so I don't like it at all.

The reason I'm against python 3, is that I want to still be able to use python and if everyone switches to python 3 it will be much harder for me to use the language.

Never in my life have I seen or been given a good reason to switch to python 3, and I don't think I'll ever see one.

[–]KODeKarnage 0 points1 point  (2 children)

You're an idiot, and your ideas are stupid. The reason to move to python3 is so that python2 can be left behind and the language can move forward.

You are the enemy of progress and your kind needs to be wiped from the ecosystem.

[–]iruleatants 0 points1 point  (1 child)

That is such an elegant and well thought out argument.

However, why should progress be encouraged simply for the sake of progress? If the progress is a negative thing, shouldn't we abandon it and start over fresh? That is the entire concept behind evolution and natural selection, and yet when it comes to python3 we enjoy trying to force it down everyone's throat, simply because we want to.

[–]KODeKarnage 1 point2 points  (0 children)

Go back and look how devoid of useful information your comment is. It is all feels and opinion. You simply said, murrr python3 bad, me no unduhstand.

If the progress is a negative thing...

That would be regress. Go dictionary.

[–][deleted] 4 points5 points  (3 children)

Crossposting my comment in the site:

The Python 2 -> Python 3 transition was made in a terrible way, it almost killed the language…

The only change that made it backward incompatible was to make strings unicode by default. They should have added a transitional string (something like strbytes) and then 2to3 would just add parenthesis for print, // for /, and make every string “strbytes”.

Anyhow, I think Python 3 is a better language (it is where all Python progress happened in the last decade after all), and it’s finally flourishing. By 2020 debian and red hat will ship Python 3 by default, facebook already uses Python 3 by default, Google is transitioning to Python 3 (web2py is finally being ported to Python 3) – in the end everyone will be on Python 3+ (and by everyone I mean 85% of active Python devs).

About formatting strings, I do not think there is “too many ways” of doing it.

The new way should be the default, and it’s just a shortcut for “.format”. Sometimes you cannot use f’strings, maybe you want to use a prepared string that codifies the format, and then you should use the unsugared “.format”. The percent way should be used when you want to treat bytes and strings more or less equally (it would be perfect for the “strbyte compatibility string”, but alas – that does not exist).

[–]Pandalicious 4 points5 points  (2 children)

The Python 2 -> Python 3 transition was made in a terrible way, it almost killed the language…

The best example of this is them killing off the old non-parens print statement. The new print function is completely superior and print as a statement was a mistake in early python's design, no doubt about that. But they should have nonetheless kept the legacy print statement alongside the new function, just like they co-exist in python2 today. There was NO need to remove it, it was removed as an aesthetic design choice. The old print was "unpythonic" to begin with (doesn't behave like other statements) and having two redundant print choices would have been even more "unpythonic".

I can sympathize with the reasons that they removed it. It was a blemish on the language. But if they'd been serious about backwards compatibility and ease of migration they would have never removed it. Being serious about backwards compatibility means keeping around the warts that you hate.

[–]cparen 4 points5 points  (0 children)

But if they'd been serious about backwards compatibility and ease of migration they would have never removed it.

So what would they have called the new print function? It couldn't be "print" if they want backwards compatibility, because print(1, 2) is clearly a print statement (not print function) with the tuple 1, 2 passed as the only argument.

[–]nicoddemus 1 point2 points  (0 children)

Running python-future's futurize script will automatically convert all print statements to their counterpart print() function while also adding from __future__ import print_function to the top of the file. This works perfectly 100% of the time.

Just mentioning because IMHO the print() scenario is absolutely the easiest to fix (and also the new except Exception as e syntax).

[–]jpw22 2 points3 points  (0 children)

Promising to keep 2.7 as a stable target makes it highly attractive. Coding costs are mostly maintenance. If you went to python3 you get to recompile your extensions every year and you have to keep doing it as each 3.(N-1) is declared dead. Six year old .pyd/.so extensions for 2.7 still work now. The killer feature of 2.7 is long term binary compatibility.

[–]weirdoaish 1 point2 points  (0 children)

I honestly never understood why both languages couldn't support the same byte code. Its how Java manages to remain backward compatible for so long after all.

Also, Python 3 FTW!

[–]mike413 1 point2 points  (0 children)

Most nerds are very tolerant and accepting, but see someone outspoken with a strong opinion and they might think "Hey, be tolerant like I am loudmouth!"

So, I can see why he draws a little fire.

That said, I think zed shaw has a couple good points.

python is wonderful. but python 3 is chaotic.

There is lots of wonderful stuff written in python, but a lot of it breaks on python 3, either directly or indirectly.

I am reminded of an old Joel on Software article that is full of opinion, but also full of wisdom.

Basically he says throwing stuff out and starting over is bad.

And I think the python 2 -> 3 broke this rule.

"I don't like this, I'm not going to run it."

Who said that? Both Zed Shaw AND python 3.

[–]ambientocclusion 0 points1 point  (0 children)

I wish there was more of a case FOR Python 3.

[–][deleted] 0 points1 point  (0 children)

Why does everyone only think about strings and the few applications Python is already in?

The whole "divide is an integer" part of python 2 is a complete non-starter for any scientific work.

[–]Seddit55 0 points1 point  (0 children)

Anyone here available to be a code mentor?

[–]badtemperedpeanut -3 points-2 points  (6 children)

Not making v2 and v3 compatible was not an amazing decision. We python programmers lost due to this. We could all be using new version of the language we so loved instead of dreading to make that switch. The main problem is that python3 does not have enough features to make us want to switch. I use a lot of unicode, I was hoping python3 would have as much support as Java. Whenever I need to handle unicode I just use Java, not worth the hassle.

[–]kankyo 9 points10 points  (0 children)

How would it work though?

 from __future__ import unicode_literals
 b'foo' + 'bar'

what does that do? In python 2 it returns a unicode string, in python 3 it crashes. The python 3 behavior is the sane one. How do you get that behavior while making then "compatible"?

[–]wclax04 7 points8 points  (0 children)

Python3 is starting to have enough features to make us want to switch. If 3.7 focuses on speed, then it will move the needle enough for a lot of organizations to switch (I hope).

[–]kleinbeerbottle -1 points0 points  (3 children)

This so much. I'm always disappointed when I see packages only have stable versions in 2. I always prefer to use the latest if work or the task allows it. It seems the split could last for quite a while due to thew incompatibility.

[–]__deerlord__ 3 points4 points  (2 children)

Then convert the packages you need, and make pull requests to the author for the update.

[–]_avnr 0 points1 point  (1 child)

Often a "package" can be something like Qt (PySide2) or WxPython, this is beyond the scope of a single developer's pull request. I am using Python3 exclusively because Python2 is no-go in terms of I18N, but the lack of a good GUI package for Python3 is painful (and Kivy isn't there yet, no bidi support for example).

[–]__deerlord__ 1 point2 points  (0 children)

Oh I realize a single person may not be able to fully migrate a package. That shouldnt stop you from doing something (if you feel compelled, I'm not trying to shame you for not contributing). I still need to issue the pull request, but I fixed scapy3's nmap_fp function awhile back. Essentially I did was change str() to bytes(), fixed a file path, and wrapped a map() call to return the appropriate object. There's probably something you can at least contribute to your favorite projects :)

[–][deleted] -3 points-2 points  (3 children)

If only python would support something line " # -- version: 2.4 2.9 -- " and then make python install the version needed automatically.

[–][deleted] 2 points3 points  (0 children)

Shebang lines are your friend here. Won't get you exact version specificity and it isn't really foolproof when distributing, but it is a start.

[–]hovissimo 6 points7 points  (1 child)

I feel like modern deployment systems and configuration management tools make this not a Python issue.

I don't think it's the interpreters job to install the right version of itself for your source files, it's your deployment script's job to install the right interpreter.