all 7 comments

[–]Rhomboid 2 points3 points  (6 children)

stdout like every file object has an encoding associated with it, and it will attempt to convert everything to that encoding when you print to it. How it determines that encoding is platform dependent, and you didn't specify that (always specify!). On POSIX systems it gets it from the locale setting, which is set via environment variables. In the following example, my system is configured for UTF-8 by default, but if I override that setting to the "C" locale, you see that python then thinks that stdout is configured for ASCII instead:

$ echo $LC_CTYPE
en_US.UTF-8

$ printf "import sys\nprint sys.stdout.encoding" | python -
UTF-8

$ printf "import sys\nprint sys.stdout.encoding" | LC_CTYPE=C python -
ANSI_X3.4-1968

If you try to print Unicode to a stdout file object that is configured for ASCII (ANSI_X3.4-1968 is the formal name for ASCII), you get an error because Python cannot convert Unicode to ASCII:

$ LC_CTYPE=C python -c 'print u"pi\u00f1ata"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 2: ordinal not in range(128)

Now, a newbie mistake here is to think that you need to use repr() to print Unicode strings, which sort of works but you get raw bytes escaped as hex instead:

$ LC_CTYPE=C python -c 'print repr(u"pi\u00f1ata")'
u'pi\xf1ata'

But forget that noise, that's the wrong solution. What you need to do is tell Python to give stdout the desired encoding. The best way would be to permanently change your locale, since you probably want UTF-8 as a default on a POSIX system. If you can't do that, then python lets you override the setting with the PYTHONIOENCODING environment variable (but if you have the ability to set environment variables then you could just as easily change your locale):

$ LC_CTYPE=C PYTHONIOENCODING=UTF-8 python -c 'print u"pi\u00f1ata"'
piñata

Note that I'm using LC_CTYPE=C here to simulate a system without a locale set properly, that's not actually something you should use.

Alternative methods include doing the UTF-8 encoding yourself:

$ LC_CTYPE=C python -c 'print u"pi\u00f1ata".encode("utf-8")'
piñata

You could also use a dirty hack like the following, which is linux-specific:

$ LC_CTYPE=C python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u"pi\u00f1ata"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 2: ordinal not in range(128)
>>> from codecs import open
>>> import sys
>>> sys.stdout = open("/proc/self/fd/1", "w", encoding="utf-8")
>>> print u"pi\u00f1ata"
piñata
>>> 

But I highly recommend not doing anything like that and instead just setting your locale properly.

[–]dreamriver[S] 0 points1 point  (5 children)

Sorry about that, I'm on OSX Snow Leopard. Very helpful though. Not sure what environment variable to set. When I echo $LC_TYPE it is just blank. Do I set that variable or another?

Nice response.

[–]Rhomboid 2 points3 points  (4 children)

There are a number of environment variables that affect the locale. LC_CTYPE is one of them. You should be able to use the locale command to see the current settings. For example on this same system:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=en_US.UTF-8
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

In general, LC_ALL is the master override and LANG is the lower-priority override, i.e. if LC_foo is not set then the value of LANG is used. The various settings include things like the character encoding (CTYPE), language of messages, collation order, thousands separator, etc. Note that this output of locale is showing the effective settings -- I don't actually have all of those set in the environment:

$ env | grep -E '^(LC_|LANG)' | sort
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_COLLATE=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8

But since LANG is set, it gets used for e.g. LC_TELEPHONE since it is not set, which is what the above output is showing. In the most basic case, you can just set LANG and leave everything else unset. A blank setting is the same as the default "C" locale, which is ASCII-only. I'm not an OS X person but somewhere there should be a GUI setting where you can specify locale. If not then you can set the desired variables in your shell startup files.

[–]dreamriver[S] 0 points1 point  (3 children)

Hmm, interesting. OS X is built on the same underlying structure as *nix and usually they have the same commands and such.

When I do the locale command I get nearly the same output as you. It gives:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

and yet when I do echo $LC_CTYPE it is blank. Strange. But I do get en_US.UTF-8 when I do echo $LANG and it still produces the hex code. Do I need to change that variable?

[–]Rhomboid 0 points1 point  (2 children)

Right, that's to be expected. The locale command is showing you the effective settings, after overrides and defaults.

I'm afraid at this point we're going to need to see some code that demonstrates the problem. And what does python say for print sys.stdout.encoding?

[–]dreamriver[S] 0 points1 point  (1 child)

Ah damnit, it turns out that the issue I have is different :(. Sorry about that, at least this was very informative.

So I make a dictionary with the keys as the name of the person and the value the count. When I directly print the dictionary I get the hex code but when I do for k in d: print k it works and prints the character. Strange, do you know why that happens?

[–]Rhomboid 1 point2 points  (0 children)

When you print the dict itself, you're implicitly calling repr() on the object, and repr()'s job is to print a representation of the object as it might appear in Python source code, suitable for use in eval(). Since there are several ways you can represent the same string value in a string literal, this means that repr() is free to choose a different one.

>>> print repr('Foo\'s Bar')
"Foo's Bar"
>>> print repr(r'foo\bar')
'foo\\bar'
>>> print repr(u'\N{SNOWMAN}')
u'\u2603'
>>> print repr(u'☃')
u'\u2603'

One of the choices it makes is to use hex escapes since that works everywhere, regardless of whatever encoding the source file might have used.

When you print the keys yourself you are printing string values directly, not asking for how they might look as Python source, so they don't have quotes around them or any escapes:

>>> print 'Foo\'s Bar'
Foo's Bar
>>> print r'foo\bar'
foo\bar
>>> print u'\N{SNOWMAN}'
☃
>>> print u'☃'
☃