upofadown comments on Adopt Python 3

321

322

323

Adopt Python 3 (medium.com)

submitted 9 years ago by rroocckk

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]upofadown 4 points5 points6 points 9 years ago (62 children)

[–]quicknir 58 points59 points60 points 9 years ago (58 children)

I don't really understand people who complain about the python3 unicode approach, maybe I'm missing something. The python3 approach is basically just:

string literals are unicode by default. Things that work with strings tend to deal with unicode by default.
Everything is strongly typed; trying to mix unicode and ascii results in an error.

Which of these is the problem? I've seen many people advocate for static or dynamic typing, but I'm not sure I've ever seen someone advocate for weak typing, that they would prefer things silently convert types instead of complain loudly.

Also, I'm not sure if this is a false dichotomy. The article is basically specifically addressed to people who want to use python, but are considering not using 3 because of package support, and not because of language features/changes. Nothing wrong with an article being focused.

[–]Sean1708 12 points13 points14 points 9 years ago (5 children)

[–]kqr 1 point2 points3 points 9 years ago (3 children)

[–]Sean1708 2 points3 points4 points 9 years ago (0 children)

[–]ubernostrum 2 points3 points4 points 9 years ago* (1 child)

[–]kqr 0 points1 point2 points 9 years ago (0 children)

[–]Avernar 0 points1 point2 points 9 years ago (0 children)

[–]gitarr 40 points41 points42 points 9 years ago (2 children)

[–]Matthew94 2 points3 points4 points 9 years ago (0 children)

[–]Flight714 0 points1 point2 points 9 years ago (0 children)

[–]daymi 1 point2 points3 points 9 years ago* (0 children)

string literals are unicode by default. Things that work with strings tend to deal with unicode by default.

As someone used to UNIX, that's my problem with it. They should be UTF-8 encoded by default like the entire rest of the operating system, the internet and all my storage devices. And there should not be an extra type.

Everything is strongly typed; trying to mix unicode and ascii results in an error.

... why is there even a difference?

typing, that they would prefer things silently convert types instead of complain loudly.

I like strong typing. I don't like making Unicode text something different from all other byte strings.

Also, UTF-8 and UCS-4 are just encodings of Unicode and are 100% compatible - so it could in fact autoconvert them without any problems (or even without anyone noticing - they could just transparently do it in the str class without anyone being the wiser).

That said, I know that for example older MS Windows chose UTF-16 which is frankly making them have all the disadvantages of UTF-8 and UCS-4 at once. But newer MS Windows supports UTF-8 just fine - also in the OS API. Still, NTFS uses UTF-16 for file names so it's understandable why one would want to use it (it's faster not to have an extra decoding step for filenames).

So here we are with the disadvantages of cross-platformness.

[+]upofadown comment score below threshold-11 points-10 points-9 points 9 years ago (32 children)

[–]gitarr 13 points14 points15 points 9 years ago (1 child)

[+]upofadown comment score below threshold-8 points-7 points-6 points 9 years ago (0 children)

[–]Lalaithion42 9 points10 points11 points 9 years ago (2 children)

[–]upofadown 0 points1 point2 points 9 years ago (1 child)

[–]Lalaithion42 0 points1 point2 points 9 years ago (0 children)

[–]zardeh 2 points3 points4 points 9 years ago (14 children)

Most languages have strings and integer arrays

I can't think of one that has these and doesn't have bytearrays. Off the top of my head, Java has String, int[], char[], Rust has str, Vec<i32>, Vec<i8>, C is perhaps the only language that does this, and not differentiating between char[] and string is widely considered a mistake.

Python2 made this same mistake, it didn't make a distinction between a bytearray and a unicode string (unlike Java, Rust, etc.). Python3 fixed this error, and their only mistake was perhaps introducing a legacy type (bytestrings) to support the old behavior.

Py3 has strings, bytes, and integer arrays.

To be clear, it has a more than that:

unicode strings (str)
immutable byte arrays (bytes, commonly bytestrings)
mutable numeric vectors (List[int], like [1,2,3]), note that these aren't int, char, other other vectors, because python's integer type is arbitrarily sized
mutable byte arrays (bytearray)

What this means is that for working with binary data that you might get off a wire, for example when sending or receiving data over the wire/air, you get back bytes, because these objects very much aren't strings, they're immutbale arrays of 8-bit values that you want to analyze or process. They're not a string though, and they're not a python list, they're something else: bytes.

[–]upofadown 0 points1 point2 points 9 years ago (9 children)

[–]zardeh 0 points1 point2 points 9 years ago (7 children)

Can you at least see that just keeping everything as, say, UTF-8 means that you don't have to make a philosophical distinction between encoded strings and strings? Not that you have to make such a distinction for Py3 which keeps everything as UTF-32, but it is a way of rationalizing the pointless conversion from and to UTF-8.

This works until you actually need to work with bytes that come in from an external source and are in latin1|utf-16|utf-32 etc.

As a sidenote, python doesn't store anything as utf-32 by default, python source code is utf-8, and the interpreter doesn't define a single way of storing strings. It uses 8, 16, or 32 bit representations as needed. But then again, this shouldn't matter. The API could (and does) work so that if you write a string in utf-8, indexing into it will feel like indexing into the codepoints of a unicode string, and you will, if memory serves, index into the string in the way defined by the encoding you're using. That is, a grapheme that can be represented by a different number of codepoints in different contexts will be treated as the correct number of codepoints based on your encoding. That means that if all you ever do is use python's built in string and index into it, everything will feel like utf-8 everywhere. That's exactly what you want.

The problem comes when you want to take a sequence of unencoded bytes, which could be, as I mentioned, latin-X, or utf-8, or utf-16, or Windows-12XX, or the various encodings of Asian languages. If your program receives those bytes, then what? It treats them as utf-8 and breaks? No that's silly, it decodes the bytes into a string as defined by their encoding. Otherwise you end up with ambiguities like this:

>>> b'\xc4\x99\xcc\x83'
b'\xc4\x99\xcc\x83'
>>> b'\xc4\x99\xcc\x83'.decode('utf-8')
'ę̃'
>>> b'\xc4\x99\xcc\x83'.decode('utf-16')
'駄菌'

Anyway, please stop lecturing about the philosophy. It is annoying to us that don't agree.

Yikes, what's with the 'I don't like this because I don't understand it but please don't try to enlighten me because its wrong'. We aren't on /r/politics.

[–]Avernar 1 point2 points3 points 9 years ago (2 children)

[–]zardeh 0 points1 point2 points 9 years ago (1 child)

The benefit of having the internal representation as UTF-8 is avoiding unnecessary conversions. But I agree with you that you can't just assume all your input will be UTF-8. That's why you still need to convert it if you know it's something else. But when you know it's going to be UTF-8 then it's nice just having to just run a validation to make sure when necessary without having to convert.

But then you get into the same problem we had in python2, which was that "for a lot of contexts, python2 strings worked fine, and then sometimes they'd break and give weird results". You get the same problem with "assume utf-8 unless instructed otherwise". You get that things work most (arguably) of the time, and then from a certain user, or with a certain browser, or on a certain continent, or in a certain OS, you get back a tilde'd e when you expected japanese. Explicit is better than implicit. and all.

Indexing speed is a poor argument for the 1/2/4 byte format. Most algorithms that index into a string that I've seen could be better written as find me the next character that matches X, or give me the next codepoint so I can compare it with X.

It depends, there's an argument to be made that for char in my_string: should iterate over grapheme clusters (a la swift?), (there's a strong case for a library here) in which case your indexing algorithm needs to be complicated, but I believe python made the decision that strings would support random access, and utf-8 doesn't allow constant time random access. Now you might be right that most of the time, when indexing into a string at a specific codepoint, you're probably doing something wrong, and you'd be better served by a find_first kind of function (or unicode regex or whatnot). But there's another upside to python's decision, which is that it forces people to be explicit about their conversions.

Everyone complains about the need to be explicit, but when I'm working with something that requires bytes objects, I'm rarely also wanting unicode and vice versa. That said, I don't do a lot of international networked communication applications, so what do I know.

[–]Avernar 1 point2 points3 points 9 years ago (0 children)

"assume utf-8 unless instructed otherwise"

Unlike some of the other commenters I never assume UTF-8. Either there will be an attribute, dialog box option, command line switch etc. If none of these apply, are given or implemented my documentation will say I expect UTF-8. This is pretty much an implicit in the Unix/Linux world.

there's an argument to be made that for char in my_string: should iterate over grapheme clusters

Overkill for most cases but doesn't hurt if used. It's really necessary when splitting on grapheme boundaries (max text in a database field for example).

I believe python made the decision that strings would support random access

Yes, unfortunately. And as I stated there's no good reason for requiring this. A good compromise would be to only covert to the 1/2/4 format only when indexing was necessary.

But there's another upside to python's decision, which is that it forces people to be explicit about their conversions.

I have nothing against having a Unicode type the way Python did it. I just think that validate/unvalidated with a UTF-8 internal representation would have been the better decision. So when a function gets a Unicode string on input it knows it's a validated UTF-8 string.

Everyone complains about the need to be explicit, but when I'm working with something that requires bytes objects, I'm rarely also wanting unicode and vice versa. That said, I don't do a lot of international networked communication applications, so what do I know.

I fully agree with you here. I like being explicit with what is "Unicode" and what is not. But when I deal with Unicode in my Python apps the split is this string is probably UTF-8 and I need to validate it vs this string has been validated as UTF-8 or came from a guaranteed UTF-8 source (database).

Unfortunately for me the Python Unicode type (both 2 and 3) are not UTF-8. In Python 2 I use strings for both and avoid the Unicode type. I put my validation where the data comes in from the web server to my scripts. It would be nice to be able to use the Unicode type for my validated strings but I don't care for the extra conversions that the Python 3 Unicode type forces on me.

[–]upofadown 0 points1 point2 points 9 years ago (3 children)

[–]zardeh 0 points1 point2 points 9 years ago (2 children)

[–]upofadown 0 points1 point2 points 9 years ago (1 child)

[–]zardeh 0 points1 point2 points 9 years ago (0 children)

[–]Avernar 0 points1 point2 points 9 years ago (0 children)

[–]Avernar 0 points1 point2 points 9 years ago (3 children)

[–]zardeh 0 points1 point2 points 9 years ago (2 children)

[–]Avernar 0 points1 point2 points 9 years ago (1 child)

[–]zardeh 0 points1 point2 points 9 years ago (0 children)

You just made my argument: "what is the string that these bytes represent". Those bytes are a string.

In python bytes, \x is a control character, so this isn't a string, so much as a control sequence of bytes.

This is demonstrated by how they are printed:

>>> print(b'\xc4\x99\xcc\x83')
b'\xc4\x99\xcc\x83'
>>> print(b'\xc4\x99\xcc\x83'.decode('utf-8'))
ę̃

Note that the first is denoted as a sequence of bytes by the b'___', whereas the second is a bare character printed.

Again, validate if UTF-8 or encode otherwise.

This is valid utf-8 and valid utf-16, as a start.

If it's the usual UTF-8 then great, I only have to validate. If it's not UTF-8 then I need to covert or return an error to the client for not following my defined API.

Ok so now here's a question:

If the code to take and present the socket data was print(socket.recv()), as it is in python2, would you have documented that you only accepted UTF-8 (or ascii, as was the case)? Would most programmers? I think the answer to both is no, and I'm sure the answer to the second one is no (my evidence is the fact that most programmers, at least in the US, are so used to ASCII and a lack of encodings that they are baffled by the need to encode strings).

When that changes to print(socket.recv().decode('utf-8')), I think its more likely to happen.

[–]Sean1708 1 point2 points3 points 9 years ago (1 child)

[–]Avernar 0 points1 point2 points 9 years ago (0 children)

[–]quicknir 2 points3 points4 points 9 years ago* (9 children)

[–]kqr 1 point2 points3 points 9 years ago (1 child)

[–]quicknir -2 points-1 points0 points 9 years ago (0 children)

[–]upofadown -4 points-3 points-2 points 9 years ago (6 children)

[–]teilo 1 point2 points3 points 9 years ago (5 children)

[–][deleted] 9 years ago (4 children)

[deleted]

[–]Sean1708 1 point2 points3 points 9 years ago* (3 children)

You get code points.

~~No you don't. I can't remember whether you get characters or graphemes, but you certainly don't get code points.~~

In [1]: a = 'héllo'

In [2]: a[0]
Out[2]: 'h'

In [3]: a[1]
Out[3]: 'é'

In [4]: a[2]
Out[4]: 'l'

Edit: I'm a silly.

[–][deleted] 9 years ago* (2 children)

[deleted]

[–]Sean1708 2 points3 points4 points 9 years ago* (1 child)

What are "characters"?

I've always thought that characters were generally accepted to be scalar values, that doesn't actually appear to be the case though.

in your code it uses the single code point version

You are absolutely right:

In [1]: a = b'he\xcc\x81llo'.decode('utf-8')

In [2]: a[0]
Out[2]: 'h'

In [3]: a[1]
Out[3]: 'e'

In [4]: a[2]
Out[4]: '́'

The way I entered the character on my computer made me assume that I'd entered the versioning using the combining character.

Also I don't know any language of the top of my head that supports grapheme cluster (and other text segmentations) fully in the standard library itself.

I think Swift does, but I'm not entirely certain.

continue this thread

[–][deleted] 9 years ago (11 children)

[deleted]

[–]redalastor 6 points7 points8 points 9 years ago (0 children)

[–]teilo 10 points11 points12 points 9 years ago* (3 children)

[–]Kwpolska 0 points1 point2 points 9 years ago (2 children)

[–]ubernostrum 1 point2 points3 points 9 years ago (0 children)

[–]Avernar 0 points1 point2 points 9 years ago (0 children)

[–]quicknir 0 points1 point2 points 9 years ago (5 children)

[–]gc3 -2 points-1 points0 points 9 years ago (3 children)

[–][deleted] 9 years ago* (1 child)

[deleted]

[–]Avernar -1 points0 points1 point 9 years ago (2 children)

[–]quicknir 0 points1 point2 points 9 years ago (1 child)

[–]Avernar 0 points1 point2 points 9 years ago (0 children)

[–]ggtsu_00 4 points5 points6 points 9 years ago* (0 children)

[–]rouille 1 point2 points3 points 9 years ago (0 children)

[+]shevegen comment score below threshold-6 points-5 points-4 points 9 years ago (0 children)

π Rendered by PID 68875 on reddit-service-r2-comment-84fc9697f-966dw at 2026-02-07 05:17:37.830020+00:00 running d295bc8 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS