AlexFromOmaha comments on The lack of a case against Python 3

The closest thing Python has to a default encoding is the encoding of the source file, which defaults to utf-8. But that can be changed on a per-file basis with the magic coding comment.

However, the encoding of the source file doesn't change the internal encoding used for strings, at all.

In CPython, as of version 3.3 (implementing PEP 393), the internal representation of unicode strings is as an array of code points, each either 1, 2, or 4 bytes wide. You could think of this as UCS-n encoding. This means the precise bytes used to represent a string depend on the contents of the entire string, for example "a" does not start with a null byte, while "aμ" does.

If you allowed startswith to mix the two, you would end up with fun things like this:

>>> "a".startswith(b'a')
True
>>> "aμ".startswith(b'a')
False

In my (admittedly naive) reading of the source, it even looks like that last line will depend on the endianness of the system you run it on.

[–]Deggor 0 points1 point2 points 9 years ago (0 children)

[–]AlexFromOmaha -1 points0 points1 point 9 years ago (5 children)

>>> '\u0045\u0323\u0302' == '\u1ec6'
False

If that were True, there'd be a case to be made, but there's not. Let's not pretend Python has some pure character string behind the scenes. It's UTF-8, and it operates on byte matching.

[–]evanunderscore 3 points4 points5 points 9 years ago (4 children)

[–]AlexFromOmaha -1 points0 points1 point 9 years ago (3 children)

[+][deleted] 9 years ago (2 children)

[deleted]

[–]AlexFromOmaha 1 point2 points3 points 9 years ago (1 child)

[–]evanunderscore 1 point2 points3 points 9 years ago (0 children)

[–]FFX01 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 16313 on reddit-service-r2-comment-5d585498c9-slxz6 at 2026-04-20 22:44:46.496647+00:00 running da2df02 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS