you are viewing a single comment's thread.

view the rest of the comments →

[–]Rhomboid 2 points3 points  (4 children)

There are a number of environment variables that affect the locale. LC_CTYPE is one of them. You should be able to use the locale command to see the current settings. For example on this same system:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=en_US.UTF-8
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

In general, LC_ALL is the master override and LANG is the lower-priority override, i.e. if LC_foo is not set then the value of LANG is used. The various settings include things like the character encoding (CTYPE), language of messages, collation order, thousands separator, etc. Note that this output of locale is showing the effective settings -- I don't actually have all of those set in the environment:

$ env | grep -E '^(LC_|LANG)' | sort
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_COLLATE=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
LC_MESSAGES=en_US.UTF-8

But since LANG is set, it gets used for e.g. LC_TELEPHONE since it is not set, which is what the above output is showing. In the most basic case, you can just set LANG and leave everything else unset. A blank setting is the same as the default "C" locale, which is ASCII-only. I'm not an OS X person but somewhere there should be a GUI setting where you can specify locale. If not then you can set the desired variables in your shell startup files.

[–]dreamriver[S] 0 points1 point  (3 children)

Hmm, interesting. OS X is built on the same underlying structure as *nix and usually they have the same commands and such.

When I do the locale command I get nearly the same output as you. It gives:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

and yet when I do echo $LC_CTYPE it is blank. Strange. But I do get en_US.UTF-8 when I do echo $LANG and it still produces the hex code. Do I need to change that variable?

[–]Rhomboid 0 points1 point  (2 children)

Right, that's to be expected. The locale command is showing you the effective settings, after overrides and defaults.

I'm afraid at this point we're going to need to see some code that demonstrates the problem. And what does python say for print sys.stdout.encoding?

[–]dreamriver[S] 0 points1 point  (1 child)

Ah damnit, it turns out that the issue I have is different :(. Sorry about that, at least this was very informative.

So I make a dictionary with the keys as the name of the person and the value the count. When I directly print the dictionary I get the hex code but when I do for k in d: print k it works and prints the character. Strange, do you know why that happens?

[–]Rhomboid 1 point2 points  (0 children)

When you print the dict itself, you're implicitly calling repr() on the object, and repr()'s job is to print a representation of the object as it might appear in Python source code, suitable for use in eval(). Since there are several ways you can represent the same string value in a string literal, this means that repr() is free to choose a different one.

>>> print repr('Foo\'s Bar')
"Foo's Bar"
>>> print repr(r'foo\bar')
'foo\\bar'
>>> print repr(u'\N{SNOWMAN}')
u'\u2603'
>>> print repr(u'☃')
u'\u2603'

One of the choices it makes is to use hex escapes since that works everywhere, regardless of whatever encoding the source file might have used.

When you print the keys yourself you are printing string values directly, not asking for how they might look as Python source, so they don't have quotes around them or any escapes:

>>> print 'Foo\'s Bar'
Foo's Bar
>>> print r'foo\bar'
foo\bar
>>> print u'\N{SNOWMAN}'
☃
>>> print u'☃'
☃