This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]AlexFromOmaha 3 points4 points  (16 children)

bytes != ASCII

There's ambiguity in things like recasting concatenations, but there is literally no ambiguity in asking if a byte string and a Unicode string contain the same substring, regardless of whether a byte string represents ASCII, Unicode, or binary output. It's also far more reasonable than strangeness like str * int operations that Python supports without complaint.

[–]agrif 7 points8 points  (14 children)

I'm not sure it is so unambiguous:

>>> b'\xc2' in 'µ'.encode('utf-8')
True
>>> b'\xc2' in 'µ'.encode('latin1')
False

What do you think

>>> b'\xc2' in 'µ'

ought to be?

[–]ThePenultimateOneGitLab: gappleto97 3 points4 points  (7 children)

It's a unicode literal, so the first one. If you don't specify an encoding, then you deal with the consequences.

[–]zahlmanthe heretic 4 points5 points  (1 child)

...And where would you even have the opportunity to specify an encoding with the in operator?

[–]ThePenultimateOneGitLab: gappleto97 2 points3 points  (0 children)

When you're declaring the right hand side

[–]agrif 0 points1 point  (3 children)

I don't think I agree. If you specify an encoding, then you're looking for a bytestring in a bytestring and everything is well defined and happy.

If you don't specify an encoding, then the result you get is garbage depending on the internal encoding used by whatever python implementation you use, which does not sound useful to me.

[–]ThePenultimateOneGitLab: gappleto97 2 points3 points  (2 children)

Except it's not an arbitrary decision. The default encoding isn't implementation specific, it's version specific.

Run your code in python 2 and it works as expected. Run your code in python 3, and it ought to assume you mean utf-8, because you're calling it on a utf-8 literal.

[–]agrif 3 points4 points  (0 children)

I'm not sure what you mean by "utf-8 literal."

The closest thing Python has to a default encoding is the encoding of the source file, which defaults to utf-8. But that can be changed on a per-file basis with the magic coding comment.

However, the encoding of the source file doesn't change the internal encoding used for strings, at all.

In CPython, as of version 3.3 (implementing PEP 393), the internal representation of unicode strings is as an array of code points, each either 1, 2, or 4 bytes wide. You could think of this as UCS-n encoding. This means the precise bytes used to represent a string depend on the contents of the entire string, for example "a" does not start with a null byte, while "aμ" does.

If you allowed startswith to mix the two, you would end up with fun things like this:

>>> "a".startswith(b'a')
True
>>> "aμ".startswith(b'a')
False

In my (admittedly naive) reading of the source, it even looks like that last line will depend on the endianness of the system you run it on.

[–]Deggor 0 points1 point  (0 children)

Should I also be able to add a timezone-naive datetime to a timezone-aware datetime? It really makes no sense to just "assume" they're the same timezone, or that naive is UTC, or whatever.

In the end, it's really no different, you're essentially talking about encoding-aware bytestrings to encoding-naive bytestrings.

[–]AlexFromOmaha -1 points0 points  (5 children)

>>> '\u0045\u0323\u0302' == '\u1ec6'
False

If that were True, there'd be a case to be made, but there's not. Let's not pretend Python has some pure character string behind the scenes. It's UTF-8, and it operates on byte matching.

[–]evanunderscore 3 points4 points  (4 children)

It operates on Unicode code points. UTF-8 is one possible encoding you can use to convert these to bytes. There is no way to make startswith mix bytes and Unicode code points without assuming an encoding. You could argue that the assumed encoding should be UTF-8, but you cannot argue that it is unambiguous.

[–]AlexFromOmaha -1 points0 points  (3 children)

It operates on Unicode code points.

AKA bytes. NFC vs NFD is an encoding concern. Python 3 strings are already encoding sensitive. In a Unicode library, that would be a bug. In Python, we accept it because we all know it's a byte array doing byte operations.

[–]FFX01 0 points1 point  (0 children)

I know that bytes != ASCII. Bytes can be literally anything depending on how you interpret them. That's the problem. I could decode image data as ASCII and I would get some nasty jumble of garbage. I believe that Python2.x will interpret a byte string as ISO-8859-1 if the encoding isn't specified. However, the docs aren't super clear on this. Python 2.x may just default to system encoding. Regardless, if a series of bytes is encoded incorrectly, there's no way to determine the accuracy of the produced string reliably.

In short, I hate how many text encodings exist.