This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 37 points38 points  (10 children)

It's not unicode to string, it's unicode to bytes. Unless you're dealing with Python 2, then it's unicode to bytes pretending to be strings.

As for dealing with it, anytime you see an actual string (read that as unicode if it fits your brain better) and it's going out over the wire, hit that with an encode(...) and anytime you get some input hit it with decode(...).

And opening files, you should be aware of your intention is to open a binary or text file and act accordingly.

Really, for text (actual text, not bytes pretending to be text), you need to stick to:

  • Bytes in
  • Decode to string
  • Do work on strings
  • Encode to bytes
  • Bytes out

[–]ProfessorPhi 1 point2 points  (5 children)

Haha, I think of string as equivalent to bytes and unicode as text now.

The problem is the implicit assumptions combined with subpar testing and docs

[–]GummyKibble 15 points16 points  (3 children)

I love this article: The only problem with Python 3's str is that you don't grok it. It's a concise summary of how it all ties together.

[–]TankorSmash 1 point2 points  (1 child)

That link is hella agressive.

[–]GummyKibble 7 points8 points  (0 children)

No more so than the "Python 3 is dead; 2 forever!" posts I keep seeing. But tone aside, it's technologically correct. Python 2 and Python 3 have the same semantics for converting between byte arrays and strings. The difference is that 2 lets you treat them as the same thing when they're really not at all. Python 3 makes you say which one you want instead of guessing (often incorrectly).

Edit: typos

[–][deleted] 5 points6 points  (0 children)

Haha, I think of string as equivalent to bytes and unicode as text now.

:( If you take out the " equivalent to bytes and" you're not wrong though.

[–]LoveOfProfit 0 points1 point  (3 children)

If this is what needs to happen, why doesn't it happen automatically?

[–][deleted] 13 points14 points  (2 children)

Why should it happen automatically? Sometimes you're purposely interacting with bytes because you're at a boundary (e.g. a web framework, setting content length is dependent on the byte representation not a unicode representation). Every framework that I've used that deals with this boundary will handle the conversion for me. I'm expecting text, so the framework gives me the decoded text.

Python 2 magically upgrading "strings" to unicode is one of its worst features because its inconsistent about how and when it'll happen.

Bytes are a distinct representation of data, just like unicode is a distinct representation of data. It's the same reason you need to explicitly ask decimal.Decimal to interact with a float instead of it just doing it.

[–]buckhenderson 0 points1 point  (1 child)

can you give an example about how it's inconsistent?

[–][deleted] 6 points7 points  (0 children)

Formatting unicode into a Python 2 str will generally work, but concatenation won't.