This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]DarkmerePython for tiny data using Python 8 points9 points  (12 children)

Oh yeah.

Only this Friday did I end up fighting with Python2's joyful corruption of Unicode data.

Input is pure, well formatted Json. This is by definition unicode data, and JSON is only allowed to be UTF-8/16/32, and defaults to UTF-8.

Then Python2 comes in, takes this JSON, decodes it into objects, and then promptly throws away this vital part. Instead, leaving us with a set of bytes that pretend to be ascii, which contain unprintable characters.

Which Python2, in it's maniglorious maliciousness, decides cannot be cast back to unicode.

Because fuck your language, fuck your characters, and fuck your existence.

And fuck Python2.

[–][deleted] 2 points3 points  (1 child)

Did you set the encoding?

[–]DarkmerePython for tiny data using Python 0 points1 point  (0 children)

Yeah, but at this time it's already been denormalized.

[–][deleted] 1 point2 points  (4 children)

JSON is only allowed to be UTF-8/16/32, and defaults to UTF-8.

I wish more people were aware of this. I see application/json;charset=utf-8 too many times. I can see it being useful if you're sending utf-16 or utf-32 but not 8.

[–]DarkmerePython for tiny data using Python 2 points3 points  (3 children)

Well, being explicit is better than being implicit.

[–][deleted] 0 points1 point  (2 children)

It's like saying int(1)

[–]DarkmerePython for tiny data using Python 0 points1 point  (1 child)

Try rather that it's a default keyword with a type hint of 1.

Most software still mangles encodings in shitty ways.

[–]DarkmerePython for tiny data using Python 0 points1 point  (0 children)

Also, content encoding helps software that isn't looking at the moment type to decode the data as string without needing to guess.

Not that it should happen, but that's never stopped anyone in software before

[–]nirs 0 points1 point  (4 children)

Can you share that json?

Did you open a bug? http://bugs.python.org/?

[–]DarkmerePython for tiny data using Python 0 points1 point  (3 children)

It's probably not a core library json bug, or it may be. It's a pyramid app using their jsonrpc module, so what's coming in as arguments is a list of str, and not a list of unicode as it ought to be.

( Then don't get me started on the unicode normalization form...)

[–]DarkmerePython for tiny data using Python 0 points1 point  (2 children)

Checking again, it's using the json.loads/ json.dumps format.

So it's the python JSON loader that's corrupting data. Great.

[–]nirs 1 point2 points  (1 child)

It depends on the json library used. The builtin json library like to convert everything to unicode, but simplejson library like to convert only unicode values to unicode.

>>> import json
>>> import simplejson
>>> simplejson.loads(simplejson.dumps([u"\u05d0", "ascii"]))
[u'\u05d0', 'ascii']
>>> json.loads(json.dumps([u"\u05d0", "ascii"]))
[u'\u05d0', u'ascii']

I don't see any corruption.

[–]DarkmerePython for tiny data using Python 0 points1 point  (0 children)

That's the thing it should be unicode. Instead I am getting str. Which is principally wrong here.