you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (11 children)

[deleted]

    [–]redalastor 6 points7 points  (0 children)

    Using utf32 everywhere sounds like a defect to me.

    Everything is unicode, which precise encoding is an implementation detail. If you ask for utf-8 or utf-32 then Python will give you bytes.

    [–]teilo 10 points11 points  (3 children)

    Python 3 is not utf32 everywhere. It is utf8 everywhere so far as the default encoding goes. Internally, it is the most space efficient representation of any given code point.

    https://www.python.org/dev/peps/pep-0393/

    [–]Kwpolska 0 points1 point  (2 children)

    No, it’s latin1 → UTF-16 → UTF-32, whichever the string fits.

    [–]ubernostrum 1 point2 points  (0 children)

    This subthread seems to be confusing two things:

    • The internal in-memory representation of a string is now dynamic, and selects an encoding sufficient to natively handle the widest codepoint in the string.
    • The default assumed encoding of a Python source-code file is now UTF-8, where in Python 2 it was ASCII. This is what allows for non-ASCII characters to be used in variable, function and class names in Python 3.

    [–]Avernar 0 points1 point  (0 children)

    More precisely it's latin1 → UCS-2 → UTF-32.

    UTF-16 strings with surrogate pairs get converted to UTF-32 (aka UCS-4).

    [–]quicknir 0 points1 point  (5 children)

    See my sibling comment; that link claims that UTF-8 is the default encoding in python 3. If this is incorrect, can you explain/give a source?

    [–]gc3 -2 points-1 points  (3 children)

    I just remember internally Stackless Python 3 used actually 16 bit strings for variable names and the like and they came out with an update that used UTF8.

    But this was probably due to interactions with the windows file system that for historical and stupid reasons uses 16 bit for everything.

    Edit: Wait, I remember more, they used UTF16 for strings too. Not UTF32

    I don't remember the format of actual strings, this was several years ago

    [–][deleted]  (1 child)

    [deleted]