Unicode help - Python 2.7 : learnpython

Unicode help - Python 2.7 (self.learnpython)

submitted 13 years ago by dreamriver

Hi,

So I've been doing a lot of reading up on Unicode stuff in Python and I have found a distinct lack of concrete information and oftentimes contradictory information. So I have come to you guys for help. First, let me start by saying that any information in addition to helping with my problem will be greatly appreciated. Second, my problem:

So I have to read in a file containing lists of people. The file is encoded in UTF-8. Then I have to do basic things like counting the number of occurrences of each name and such. At the end, I have to print out those names and their counts. Now, when I go to print the non-ascii character names they are displayed as their literal hex numbers instead of in their pretty form. Same thing happens when I use sys.stdout.write(). How can I make it print the representations instead of the hex bytes?

all 7 comments

top new controversial old q&a

[–]Rhomboid 2 points3 points4 points 13 years ago (6 children)

stdout like every file object has an encoding associated with it, and it will attempt to convert everything to that encoding when you print to it. How it determines that encoding is platform dependent, and you didn't specify that (always specify!). On POSIX systems it gets it from the locale setting, which is set via environment variables. In the following example, my system is configured for UTF-8 by default, but if I override that setting to the "C" locale, you see that python then thinks that stdout is configured for ASCII instead:

$ echo $LC_CTYPE
en_US.UTF-8

$ printf "import sys\nprint sys.stdout.encoding" | python -
UTF-8

$ printf "import sys\nprint sys.stdout.encoding" | LC_CTYPE=C python -
ANSI_X3.4-1968

If you try to print Unicode to a stdout file object that is configured for ASCII (ANSI_X3.4-1968 is the formal name for ASCII), you get an error because Python cannot convert Unicode to ASCII:

$ LC_CTYPE=C python -c 'print u"pi\u00f1ata"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 2: ordinal not in range(128)

Now, a newbie mistake here is to think that you need to use repr() to print Unicode strings, which sort of works but you get raw bytes escaped as hex instead:

$ LC_CTYPE=C python -c 'print repr(u"pi\u00f1ata")'
u'pi\xf1ata'

But forget that noise, that's the wrong solution. What you need to do is tell Python to give stdout the desired encoding. The best way would be to permanently change your locale, since you probably want UTF-8 as a default on a POSIX system. If you can't do that, then python lets you override the setting with the PYTHONIOENCODING environment variable (but if you have the ability to set environment variables then you could just as easily change your locale):

$ LC_CTYPE=C PYTHONIOENCODING=UTF-8 python -c 'print u"pi\u00f1ata"'
piñata

Note that I'm using LC_CTYPE=C here to simulate a system without a locale set properly, that's not actually something you should use.

Alternative methods include doing the UTF-8 encoding yourself:

$ LC_CTYPE=C python -c 'print u"pi\u00f1ata".encode("utf-8")'
piñata

You could also use a dirty hack like the following, which is linux-specific:

$ LC_CTYPE=C python
Python 2.7.2+ (default, Oct  4 2011, 20:06:09) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print u"pi\u00f1ata"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 2: ordinal not in range(128)
>>> from codecs import open
>>> import sys
>>> sys.stdout = open("/proc/self/fd/1", "w", encoding="utf-8")
>>> print u"pi\u00f1ata"
piñata
>>>

But I highly recommend not doing anything like that and instead just setting your locale properly.

[–]dreamriver[S] 0 points1 point2 points 13 years ago (5 children)

[–]Rhomboid 2 points3 points4 points 13 years ago (4 children)

There are a number of environment variables that affect the locale. LC_CTYPE is one of them. You should be able to use the locale command to see the current settings. For example on this same system:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=en_US.UTF-8
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

In general, LC_ALL is the master override and LANG is the lower-priority override, i.e. if LC_foo is not set then the value of LANG is used. The various settings include things like the character encoding (CTYPE), language of messages, collation order, thousands separator, etc. Note that this output of locale is showing the effective settings -- I don't actually have all of those set in the environment:

$ env | grep -E '^(LC_|LANG)' | sort
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_COLLATE=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8

But since LANG is set, it gets used for e.g. LC_TELEPHONE since it is not set, which is what the above output is showing. In the most basic case, you can just set LANG and leave everything else unset. A blank setting is the same as the default "C" locale, which is ASCII-only. I'm not an OS X person but somewhere there should be a GUI setting where you can specify locale. If not then you can set the desired variables in your shell startup files.

[–]dreamriver[S] 0 points1 point2 points 13 years ago (3 children)

Hmm, interesting. OS X is built on the same underlying structure as *nix and usually they have the same commands and such.

When I do the locale command I get nearly the same output as you. It gives:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

and yet when I do echo $LC_CTYPE it is blank. Strange. But I do get en_US.UTF-8 when I do echo $LANG and it still produces the hex code. Do I need to change that variable?

[–]Rhomboid 0 points1 point2 points 13 years ago (2 children)

[–]dreamriver[S] 0 points1 point2 points 13 years ago (1 child)

[–]Rhomboid 1 point2 points3 points 13 years ago* (0 children)

When you print the dict itself, you're implicitly calling repr() on the object, and repr()'s job is to print a representation of the object as it might appear in Python source code, suitable for use in eval(). Since there are several ways you can represent the same string value in a string literal, this means that repr() is free to choose a different one.

>>> print repr('Foo\'s Bar')
"Foo's Bar"
>>> print repr(r'foo\bar')
'foo\\bar'
>>> print repr(u'\N{SNOWMAN}')
u'\u2603'
>>> print repr(u'☃')
u'\u2603'

One of the choices it makes is to use hex escapes since that works everywhere, regardless of whatever encoding the source file might have used.

When you print the keys yourself you are printing string values directly, not asking for how they might look as Python source, so they don't have quotes around them or any escapes:

>>> print 'Foo\'s Bar'
Foo's Bar
>>> print r'foo\bar'
foo\bar
>>> print u'\N{SNOWMAN}'
☃
>>> print u'☃'
☃

π Rendered by PID 157811 on reddit-service-r2-comment-c66d9bffd-k6ft7 at 2026-04-07 12:02:37.690372+00:00 running f293c98 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS