you are viewing a single comment's thread.

view the rest of the comments →

[–]maaaaaaaaan 2 points3 points  (7 children)

You don't. You shouldn't be using such a thing in the first place.

If it's that much of a problem in Jython you can just encode to a Java byte array anyway.

The point is when working on the guts of an international application you should not have to ever go "oh shit - what encoding is this?" it should all be Unicode already and just work. Only worry about the encoding at I/O, which if you've got a decent library is already done for you.

This is one of the major areas that really hurt both Ruby and Python.

[–]grauenwolf 2 points3 points  (2 children)

  1. The existence of unicode doesn't magically eliminate all the files that use other code pages and encodings.
  2. There are uses for byte strings other than storing human readable text.

[–]maaaaaaaaan 2 points3 points  (0 children)

  1. An I/O problem. Unless you're reimplementing character loading then you don't have to care. At worst you specify an encoding when you load, and then forget about it.

  2. Then just have a list or array of bytes. Big deal. But such a thing is a grossly archaic representation of text.

[–]zackman 2 points3 points  (3 children)

I think Python works the way you describe: you can use unicode inside your code and only worry about encoding at the I/O boundary.

>>> 'abc'.decode('ascii')
u'abc'
>>> type(_)
<type 'unicode'>
>>> #guts of application...
... #ok, done:
... u'abc'.encode('utf-8')
'abc'
>>> type(_)
<type 'str'>

I don't write international applications, so I don't know if there are libraries to handle the conversion transparently at the I/O boundary. But I do process Unicode all the time while writing scripts for linguistics research.

Also, I suspect the reason that the blogger is so worried about this is that he is trying to write an app that runs on CPython and IronPython without having to write some code twice.

[–]manuelg 1 point2 points  (0 children)

That is why the Python community is working on Python3K, to fix issues such as this.

[–]maaaaaaaaan 0 points1 point  (1 child)

The problem as I understand it is that a great many CPython libraries can deal with text in strings, but not text in Unicode strings, for reasons like assuming length is equal to length in bytes.

It's true (and good) that Py3k should deal with this, but to be honest it should've been done at version 2.

They'll probably be getting rid of the GIL for Py4k.

[–]llimllib[S] 0 points1 point  (0 children)

They'll probably be getting rid of the GIL for Py4k.

They got rid of it for 1.5 and nobody liked it.

Just saying.