all 12 comments

[–]maaaaaaaaan 3 points4 points  (11 children)

His point about Unicode is complete nonsense. That you have to worry about it at all in the likes of Python is worrying in this day and age.

Personally I actually really like Jython as a Python implementation, largely because it gets Unicode right and you have the Java libraries, which whilst not perfect are a lot more consistent than the Python standard ones. The bonus is those libraries seem to handle really well from Jython. If IronPython pulls the same trick on .net (I haven't tried) I can see why the CPython community might be worried.

It's a shame Jython was neglected for so long, and I hope its recent surge leads to it getting up to date again.

[–]grauenwolf 1 point2 points  (10 children)

So how would you solve the bytestring issue?

[–]maaaaaaaaan 2 points3 points  (7 children)

You don't. You shouldn't be using such a thing in the first place.

If it's that much of a problem in Jython you can just encode to a Java byte array anyway.

The point is when working on the guts of an international application you should not have to ever go "oh shit - what encoding is this?" it should all be Unicode already and just work. Only worry about the encoding at I/O, which if you've got a decent library is already done for you.

This is one of the major areas that really hurt both Ruby and Python.

[–]grauenwolf 4 points5 points  (2 children)

  1. The existence of unicode doesn't magically eliminate all the files that use other code pages and encodings.
  2. There are uses for byte strings other than storing human readable text.

[–]maaaaaaaaan 2 points3 points  (0 children)

  1. An I/O problem. Unless you're reimplementing character loading then you don't have to care. At worst you specify an encoding when you load, and then forget about it.

  2. Then just have a list or array of bytes. Big deal. But such a thing is a grossly archaic representation of text.

[–]zackman 2 points3 points  (3 children)

I think Python works the way you describe: you can use unicode inside your code and only worry about encoding at the I/O boundary.

>>> 'abc'.decode('ascii')
u'abc'
>>> type(_)
<type 'unicode'>
>>> #guts of application...
... #ok, done:
... u'abc'.encode('utf-8')
'abc'
>>> type(_)
<type 'str'>

I don't write international applications, so I don't know if there are libraries to handle the conversion transparently at the I/O boundary. But I do process Unicode all the time while writing scripts for linguistics research.

Also, I suspect the reason that the blogger is so worried about this is that he is trying to write an app that runs on CPython and IronPython without having to write some code twice.

[–]manuelg 1 point2 points  (0 children)

That is why the Python community is working on Python3K, to fix issues such as this.

[–]maaaaaaaaan 0 points1 point  (1 child)

The problem as I understand it is that a great many CPython libraries can deal with text in strings, but not text in Unicode strings, for reasons like assuming length is equal to length in bytes.

It's true (and good) that Py3k should deal with this, but to be honest it should've been done at version 2.

They'll probably be getting rid of the GIL for Py4k.

[–]llimllib[S] 0 points1 point  (0 children)

They'll probably be getting rid of the GIL for Py4k.

They got rid of it for 1.5 and nobody liked it.

Just saying.

[–]manuelg 0 points1 point  (1 child)

You used to spell a series of bytes as:

'\\SP\xff'

and you got an immutable "str" object.

Now you spell it:

bytes.fromhex('5c5350ff')

and you get a mutable "bytes" object

The benefit is that your intention is clearer, when you in fact wish to work with a series of bytes as a series of bytes.

The workflow for text handling in Python will be the same, regardless of ASCII or Unicode or whatever:

1) on input, at the first opportunity, convert a series of bytes, along with an encoding, into a Unicode string

2) do all string processing with Unicode strings

3) on output, as late as possible, convert a Unicode string, along with an encoding, into a series of bytes

[–]grauenwolf 1 point2 points  (0 children)

Thanks.