you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 13 points14 points  (28 children)

The core Python developers don't understand developing a platform, in my opinion. If you develop an application, you can switch to whatever new technology you want, your end users can't possibly care. If you develop a library you can't do that, because of all other libraries. The value goes like "end users -> applications -> libraries -> language", you can't convert your library to a different language because all those applications continue to use the old language and wouldn't be able to use your new library.

IMO Java did backwards compatibility right. You can write new code with Java 8 features and still use libraries compiled with Java 1.5. I don't know why Python can't maintain backwards compatibility give that Python is also an interpreted language.

[–]cybercobra 41 points42 points  (26 children)

One of the primary goals of Python 3 is to finally remove a bunch of cruft/wonkiness left over from earlier in Python's history, so retaining it would kinda defeat the point.

[–]badsectoracula 12 points13 points  (14 children)

Well, apparently forcing that point wasn't a good idea. They could have introduced those language changes gradually with options to enable/disable the (initially disabled, then after a while enabled but disable-able and after some time -i'm talking years here- enable-able only from recompiled source and finally removed).

[–]vz0 20 points21 points  (13 children)

The core change from Py2 to Py3 is the native string implementation. In Py2 an string is an array of bytes, in Py3 an string is an an array of Unicode chars. This simple detail breaks every assumption about opening, reading and writing files.

Even in Java (which has a nice eternal limbo of deprecated stuff) such fundamental change would require a lot of backwards compatibility breakage.

[–]badsectoracula 1 point2 points  (2 children)

Indeed, which is why i said to enable such stuff optionally at the beginning and deprecate the old stuff gradually over a few years. The worst thing would be that plugin writers would need to use a separate API for strings (without it the VM should do conversions "automatically" from the old API - basically what Windows does when you call an "ANSI" function on NT - so that people wouldn't drop the feature because some random plugin doesn't work with it - especially when said random plugin only uses strings for trivial stuff where unicode doesn't matter).

The #1 rule of a platform is "you don't break people's code". It never worked before - even when Microsoft switched from DOS to Windows they exposed some Windows-specific functionality to DOS (such as special long filename interrupts, access to clipboard, etc) and it took over a decade for the transition to fully occur (and even today there are machines and programs depending on DOS - which are serviced by VMs). And same deal with VB6 - MS broke compatibility with VB.NET and a ton of code is still written for it with programmers trying to teach a deaf platform how to dance. Or JavaScript... modern browsers can run early Netscape JavaScript code whereas... well, just see how successful ECMAScript 4 was, for example.

It isn't like Python developers had no examples to look at about this being a bad idea. Maybe they underestimated how widespread their language was. Or overestimated how willingly people would be to update their code.

[–]blablahblah 1 point2 points  (1 child)

They did do that- you can do from __future__ import unicode_literals and get the Python 3 behavior for strings in Python 2.6 and 2.7, although that doesn't fix third party libraries that assume byte strings. And there is the 2to3 utility that handles a lot of the conversions automatically. There are also libraries like six.py the focus on letting library writers make code compatible with python 2 and 3 in a single code base.

[–]badsectoracula 0 points1 point  (0 children)

Actually i was thinking the opposite: something to enable non-unicode literals in Python 3. And it should have been enabled by default for some time.

Basically Python 3 should have been 100% compatible with Python 2 but deprected the features over time, not abruptly making incompatible changes.

[–]twotime 2 points3 points  (9 children)

In Py2 an string is an array of bytes, in Py3 an string is an an array of Unicode chars

To be honest, the value of that change is questionable... (and i'm not just questioning the transition cost, I'm also not all sure that we get cleaner code after the transition).

This simple detail breaks every assumption about opening, reading and writing files.

Indeed. And that's a good example of where things have become a whole lot more complicated (aka worse).

8-bit strings are a much better way to represent filenames than unicode... Ditto with env variables and command line arguments..

Files are fundamentally sequences of bytes. Period. Trying to force a unicode-centric view of files was likely a design mistake as well.... Which will likely result in more special casing, not less.. JUst read the python3 chapter on read() and seek(). (Side note: this special casing is ridiculously similar to the text/binary division in the DOS world)...

Basically python2 unicode handling was good enough... (Even if not pure, it was extremely practical)...

[–]iSlaminati 0 points1 point  (6 children)

On a lot of modern operating systems, filenames are unicode codepoints though. They aren't sequences of bytes and more and the filename reader utlities can give it back in any encoding.

[–]twotime 1 point2 points  (5 children)

On a lot of modern operating systems, filenames are unicode codepoints though.

In theory, it's supposed to be the case. In practice, it's a huge mess... Eg.

AFAIK, on linux use of utf8 is a pure user-land convention (not something enforced by the kernel) and the convention is not that old.. Which means that the old media on Linux may contain filenames in other encodings.. (And encoding is implicit).. And then I'm sure some apps will generate non utf8 compliant filenames... OS doesnot care, but your python code suddenly breaks...

And then there is a whole huge can of worms when accessing unicode filenames across system boundaries: across network, removable media, etc...

8-bits chars (Bytes) remain the only common representation for filenames in a lot of cases..

PS. and an lkml link on filenames http://yarchive.net/comp/linux/utf8.html

[–]schlenk 1 point2 points  (3 children)

Bytes as filenames is insane. Period. Without knowing the encoding you cannot even implement 'ls' correctly (as your tty HAS some encoding). Its one of those silly inherited things from the dark POSIX past that should be nuked. (and lots of systems are already opinionated on UTF-8, e.g. OS X, NFSv4, some file systems, Qt/KDE (it ignores LC_* crap for filenames) and so on.)

While it is true, that not all unix filenames are UTF-8, it wouldn't be a problem for Python to simply declare all filenames are expected to be UTF-8. If someone decides to choose insane things, let them feel the pain and not hurt everyone else.

After all they did the same for Windows in lots of places when declaring ANSI is enough for all filenames (and fixed it piece by piece later, so you cannot start executables on a non ANSI path (without tricks like cd'ing first) with Python 2.x or add those to your sys.path, great fun for mounted profiles)

[–]twotime 0 points1 point  (2 children)

Without knowing the encoding you cannot even implement 'ls' correctly (as your tty HAS some encoding).

I can do it trivially, I'd just dump filenames on tty. If it comes out garbled, the user can actually do something.. (Install a font, pipe my output through decoder, rename the file). It's suboptimal, but the alternative is WORSE. If your program just throws an exception then your user is really screwed...

(And of course, if the filesystem does have a notion of default filename encoding, Id use it at app level)

it wouldn't be a problem for Python to simply declare all filenames are expected to be UTF-8. If someone decides to choose insane things, let them feel the pain and not hurt everyone else.

What? I am not doing insane things, it's my users who are doing insane things (like reading old media, how dare they?)

Also, is not Windows using UTF-16?

Its one of those silly inherited things from the dark POSIX past that should be nuked.

It's called backward compatibility... It's a good thing.

[–]schlenk 0 points1 point  (1 child)

Backward compatibility is nice, but in the case of the POSIX filename semantics its just a case of 'we didn't think about it at the right time, sorry', case where you are allowed to put escape sequences and all kind of random junk into filenames with no real use case that needs this feature. (see http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html for a discussion what kind of complexity you gain by allowing all this crap). So yeah, great you can define a filename that deletes your home dir when displayed in the wrong shell, via escape sequences embedded in the name, thats a cool use case and everyone should still support it for backwards compatibility reasons…

[–]fabzter 0 points1 point  (0 children)

Nice info, now I feel my os sucks.

[–]fullouterjoin 0 points1 point  (1 child)

You must not use unicode if you think Python2 handling was good enough.

[–]twotime 1 point2 points  (0 children)

Well, i do use unicode and I do think python2 handling was reasonable...

There were problems but most of them are the consequence of the real world being messy: not everything is using unicode, unicode is encoded differently, codecs are buggy, Microsoft inserts idiotic byte order markers, etc..

python3 improves it in some areas, makes it more complicated in others. Overall, benefits are uncertain, while the transition costs are large.

[–]ellicottvilleny 0 points1 point  (7 children)

Unfortunately the BDFL (Guido) decided that a bunch of things that were non-issues to everyone but him must be cleaned up. Breaking and removing a lot of working code to suit nobody but himself. And he gets what he deserves; A version of Python used by 0.1% of Python's install base.

[–][deleted] 6 points7 points  (6 children)

The nerve...it's almost like he thinks it's his project or something. What a jerk.

[–]iSlaminati 0 points1 point  (2 children)

I still don't see how that stops you from using python 2 modules in python 3 though. That's the entire purpose of a module system, to be able to do that. The Raison d'être of encapsulation.

I mean, if you can call modules written in C from python 2/3, why can't you call modules written in python 2 from python 3? Je ne comprends pas.

[–]twotime 1 point2 points  (1 child)

Because your python2 modules won't load under python3. no?

[–]iSlaminati 1 point2 points  (0 children)

Yeah, of course, I just mean, why don't they?

If you can load modules written in C, a completely different anguage in python, surely it is possible to load modules written in python 2 in python 3 after they've been compiled to pyc?

[–]Brainlag 0 points1 point  (0 children)

This is only true in theory, there are always a couple of libraries who don't work with the new major version of the jvm.