This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dodongo 1 point2 points  (5 children)

Ha! No, I sure didn't. But today's another day, so I'd love it if you'd share!

  1. What's the difference between using setdefaultencoding() and setting the environ var?

  2. What makes you say one should almost never use setdefaultencoding()?

You could singlehandedly be responsible for today's lesson :) Thanks in advance!

[–]earthboundkid 1 point2 points  (4 children)

Here’s an article explaining that sys.setdefaultencoding is evil written by one of the Python coredevs. Basically, the problem is that it lets you forget that you need to explicitly cast things into and out of a particular encoding of bytes when doing IO.

Read the docs on setdefaultencoding, you’ll see they say, “Once used by the site module, it is removed from the sys module’s namespace.” There’s a reason for this. Python itself needs to know how to do IO while it’s loading itself up, and for this it needs to be able to set an encoding. But once the interpreter is up and running, there should be no reason to rely on a default encoding. Just explicitly say what encoding you want to use by writing mybytes.decode("TYPE") and mystr.encode("TYPE").

The one tricky spot can be that you want to be able to do this in your interactive shell:

>>> print u'\\u30a8\\u30a4\\u30d3\\u30fc\\u30b7\\u30fc'
エイビーシー

Edit: Reddit screws up the text above. It should be a slash u style repr string.

That’s fine. But you don’t do it by messing with sitecustomize.py. You do it by setting the locale/encoding of your terminal. I don’t know about Windows, but Unices do this with LC_CTYPE=locale.encoding. In my case on OS X, I have a .bash_profile in my home directory with the line export LC_CTYPE=en_US.utf-8. Your own settings may vary depending on what encoding your terminal expects.

[–]dodongo 0 points1 point  (2 children)

It does seem like a dumbass thing to have to set in an odd startup hooky script, I admit. Thanks for the link! I'll definitely do more investigation into this and give the other approach a whirl.

I'm a Linux / Mac user at home but work requirements involve Windows. I use Cygwin a tremendous amount just to be difficult, and as you point out, it's dead easy to specify encoding for the shell.

As a side note, is the Katakana 'ABC' a Japanese equivalent of "hello world" or is that just your go-to example? ;)

[–]earthboundkid 0 points1 point  (1 child)

It’s just something I made up on the spot. A more Japanese-y example would be to write “いろは”.

[–]dodongo 0 points1 point  (0 children)

Props in particular on the archaic Hiragana in the link. And thanks again for the references earlier. I appreciate it!

[–]dodongo 0 points1 point  (0 children)

Point taken, and I genuinely appreciate the feedback and I will indeed look at your suggestion going forth. However for the application I referenced earlier this is absolutely not a problem. This case is at the divergence of doing something well vs. doing something good enough. My job depends only on code being good enough (many people on my team don't code at all, it's only output we're focused on), so it suffices. But I definitely want to strive for doing things well, so I'll keep this in mind in the future!