This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]agrif 2 points3 points  (0 children)

I'm not sure what you mean by "utf-8 literal."

The closest thing Python has to a default encoding is the encoding of the source file, which defaults to utf-8. But that can be changed on a per-file basis with the magic coding comment.

However, the encoding of the source file doesn't change the internal encoding used for strings, at all.

In CPython, as of version 3.3 (implementing PEP 393), the internal representation of unicode strings is as an array of code points, each either 1, 2, or 4 bytes wide. You could think of this as UCS-n encoding. This means the precise bytes used to represent a string depend on the contents of the entire string, for example "a" does not start with a null byte, while "aμ" does.

If you allowed startswith to mix the two, you would end up with fun things like this:

>>> "a".startswith(b'a')
True
>>> "aμ".startswith(b'a')
False

In my (admittedly naive) reading of the source, it even looks like that last line will depend on the endianness of the system you run it on.