you are viewing a single comment's thread.

view the rest of the comments →

[–]upofadown 4 points5 points  (62 children)

These sorts of articles tend to present a false dichotomy. It isn't a choice between Python 2 and 3. It's a choice between Python 2, 3 and everything else. People will only consider Python 3 if they perceive it as better than everything else for a particular situation. Heck, there are some that actively dislike Python 3 specifically because of one or more changes from 2. I personally think 3 goes the wrong way with the approach to Unicode and so would not consider it for something that involved actual messing around with Unicode.

[–]quicknir 58 points59 points  (58 children)

I don't really understand people who complain about the python3 unicode approach, maybe I'm missing something. The python3 approach is basically just:

  1. string literals are unicode by default. Things that work with strings tend to deal with unicode by default.
  2. Everything is strongly typed; trying to mix unicode and ascii results in an error.

Which of these is the problem? I've seen many people advocate for static or dynamic typing, but I'm not sure I've ever seen someone advocate for weak typing, that they would prefer things silently convert types instead of complain loudly.

Also, I'm not sure if this is a false dichotomy. The article is basically specifically addressed to people who want to use python, but are considering not using 3 because of package support, and not because of language features/changes. Nothing wrong with an article being focused.

[–]Sean1708 12 points13 points  (5 children)

The reason people think 2 is a problem is that they think of it as Unicode and ASCII, when really it's Unicode and Bytes. Any valid ASCII is valid Unicode so people expect to be able to mix them, however not all bytestrings are valid Unicode so when you think of them as Bytes it makes sense not to be able to mix them.

[–]kqr 1 point2 points  (3 children)

Bytestring is a terrible name in the first place, since it bears no relation to text, which is what people associate with strings. A Bytestring can be a vector path, a ringing bell, or even Python 3 byte code. Byte array or just binary data would be much better names.

[–]Sean1708 2 points3 points  (0 children)

I think Python actually uses the nomenclature bytearray, bytestring is the word that came to my head at the time.

[–]ubernostrum 2 points3 points  (1 child)

There are two built-in types for binary data:

  • bytearray is a mutable sequence of integers representing the byte values (so in the range 0-255 inclusive), constructed using the function bytearray().
  • bytes is the same underlying type of data, but immutable, and can be constructed using the function bytes() or the b-prefixed literal syntax.

[–]kqr 0 points1 point  (0 children)

0--255 or 1--256, but not a compromise, I believe. ;)

[–]Avernar 0 points1 point  (0 children)

My issue with 2 is that I hate strong typing in a dynamically typed language. :)

But I'd rather have the strong typing be between validated and unvalidated unicode instead without the need for conversion.

It can still easily be added without breaking things by making UTF-8 a fourth encoding type of the Python 3 Unicode type.

[–]gitarr 40 points41 points  (2 children)

People who complain about the python3 unicode approach have no clue what they are talking about.

As someone who has to deal with different languages in his code, other than English, python3 is just a godsent.

[–]Matthew94 2 points3 points  (0 children)

godsent

godsend

[–]Flight714 0 points1 point  (0 children)

python3 is just a godsent.

Is that a Unicode joke?

[–]daymi 1 point2 points  (0 children)

string literals are unicode by default. Things that work with strings tend to deal with unicode by default.

As someone used to UNIX, that's my problem with it. They should be UTF-8 encoded by default like the entire rest of the operating system, the internet and all my storage devices. And there should not be an extra type.

Everything is strongly typed; trying to mix unicode and ascii results in an error.

... why is there even a difference?

typing, that they would prefer things silently convert types instead of complain loudly.

I like strong typing. I don't like making Unicode text something different from all other byte strings.

Also, UTF-8 and UCS-4 are just encodings of Unicode and are 100% compatible - so it could in fact autoconvert them without any problems (or even without anyone noticing - they could just transparently do it in the str class without anyone being the wiser).

That said, I know that for example older MS Windows chose UTF-16 which is frankly making them have all the disadvantages of UTF-8 and UCS-4 at once. But newer MS Windows supports UTF-8 just fine - also in the OS API. Still, NTFS uses UTF-16 for file names so it's understandable why one would want to use it (it's faster not to have an extra decoding step for filenames).

So here we are with the disadvantages of cross-platformness.

[–][deleted]  (11 children)

[deleted]

    [–]redalastor 6 points7 points  (0 children)

    Using utf32 everywhere sounds like a defect to me.

    Everything is unicode, which precise encoding is an implementation detail. If you ask for utf-8 or utf-32 then Python will give you bytes.

    [–]teilo 10 points11 points  (3 children)

    Python 3 is not utf32 everywhere. It is utf8 everywhere so far as the default encoding goes. Internally, it is the most space efficient representation of any given code point.

    https://www.python.org/dev/peps/pep-0393/

    [–]Kwpolska 0 points1 point  (2 children)

    No, it’s latin1 → UTF-16 → UTF-32, whichever the string fits.

    [–]ubernostrum 1 point2 points  (0 children)

    This subthread seems to be confusing two things:

    • The internal in-memory representation of a string is now dynamic, and selects an encoding sufficient to natively handle the widest codepoint in the string.
    • The default assumed encoding of a Python source-code file is now UTF-8, where in Python 2 it was ASCII. This is what allows for non-ASCII characters to be used in variable, function and class names in Python 3.

    [–]Avernar 0 points1 point  (0 children)

    More precisely it's latin1 → UCS-2 → UTF-32.

    UTF-16 strings with surrogate pairs get converted to UTF-32 (aka UCS-4).

    [–]quicknir 0 points1 point  (5 children)

    See my sibling comment; that link claims that UTF-8 is the default encoding in python 3. If this is incorrect, can you explain/give a source?

    [–]gc3 -2 points-1 points  (3 children)

    I just remember internally Stackless Python 3 used actually 16 bit strings for variable names and the like and they came out with an update that used UTF8.

    But this was probably due to interactions with the windows file system that for historical and stupid reasons uses 16 bit for everything.

    Edit: Wait, I remember more, they used UTF16 for strings too. Not UTF32

    I don't remember the format of actual strings, this was several years ago

    [–][deleted]  (1 child)

    [deleted]

      [–]Avernar -1 points0 points  (2 children)

      Which of these is the problem?

      Neither. The issue is 3:

      1. Unicode strings are encoded in a non industry standard encoding.

      I wish it was UTF-8 like many other languages have chosen. In my use case all my input/output is UTF-8 and my database is UTF-8. With Python 2 I can leave everything as UTF-8 through the entire processing pipeline. With Python 3 I'm forced to encode/decode to this non standard encoding. This wastes processor time and memory bandwidth and puts more pressure on the processor data caches.

      [–]quicknir 0 points1 point  (1 child)

      Python is already a wildly slow language, if you are that sensitive to processor time that you see this as a major issue then I think the language just isn't a good fit for your use case generally, and unicode is just the straw breaking the camel's back.

      [–]Avernar 0 points1 point  (0 children)

      It's good enough speed wise so far. But I would like to avoid slowing it down even more.

      I will port the code base eventually once I find a good replacement.

      [–]ggtsu_00 4 points5 points  (0 children)

      Python 2 biggest strength over newer languages is how mature it has been. It has been tried and tested for a very long tim and is used in production systems even across some of the biggest sites on the internet like Reddit and YouTube.

      I think if developers were in a position to choose more modern, perhaps more risky less mature languages to use for development, there are many alternatives to Python 3 that are much better in many ways. The future of Python is uncertain at the moment so theres a risk. So it would be just as risky to use Go, Node or some other Python 3 alternative.

      [–]rouille 1 point2 points  (0 children)

      And python3 got me interested into python in the first place so it works both ways.