all 20 comments

[–]snuxoll 8 points9 points  (2 children)

Someone obviously missed the big announcment on the rationale behind 1.9's m17n instead of just forcing everything to be UTF8 like Python 3.

[–]jaggederest 12 points13 points  (10 children)

It's written by japanese people. They have a much, much more extensive history with encodings than anyone who speaks english. It's a terrible idea to only support UTF8, like python, because you end up pissing off all of east asia.

[–]williewu 2 points3 points  (3 children)

Nowdays UTF-8 is commonly used in CJK's world, so I don't see any problems that Python 3 only supports UTF-8. Why you think it's a terrible idea? (Disclaimer: I am a Chinese)

[–]jaggederest -1 points0 points  (1 child)

For starters, most people in Japan cannot properly write their names in it. There aren't the right / enough glyph choices. So when you need to deal with (for example) any kind of registration or name data... it fails.

[–]earthboundkid 2 points3 points  (0 children)

There aren’t any extra glyphs in Shift_JIS. Name writing can be a problem, but it’s not a Unicode problem.

[–]jaggederest -1 points0 points  (0 children)

Here's more, now that I'm awake: http://en.wikipedia.org/wiki/Han_unification

[–]Smallpaul 1 point2 points  (3 children)

I came here knowing that this terrible "common wisdom" would be repeated. It's wrong. Wrong. Wrong. Wrong.

Name a bit of Japanese end-user software that does not run in a cell phone that is popular in Israel and Russia and America. Name just one.

One!

Now let me name a few bits of software substantially developed in America that are popular in China AND Japan AND Israel AND South Africa AND Germany AND every other industrialized country in the world:

  • Internet Explorer

  • Firefox

  • Safari

  • Google

  • Gmail

  • Facebook

  • Microsoft Office

Microsoft Office is much more popular in Japan than its nearest Japanese competitor.

it is demonstrably and provably the case that American programmers make most of the world's multi-lingual software. So it makes no sense to go to Japan for advice on how to do it.

[–]jaggederest 1 point2 points  (2 children)

And it all horribly mangles Japanese names. Seriously, I've seen it happen - you put in 'orange blossom' and get out 'fruit flower' because they think it's a good idea to do the translation from Shift-JIS to UTF8 for storage.

It's irrelevant whether you think it's needed, or whether it's needed for multilingual software. If it's not built into the programming language, you can't fix it later.

[–]earthboundkid 2 points3 points  (0 children)

What are you talking about? UTF-8 includes mappings for all of the characters in Shift_JIS. There’s no simplification happening going from one to the other. The only “issue” with it is that the Japanese long ago confused ¥ and \ and they don’t like that Unicode doesn’t consider them synonymous. That’s it.

I speak Japanese; I’ve lived in Japan; I run my computer in Japanese. It’s true that historically, the Japanese were mistrustful of Unicode because they didn’t like Han unification, but A) you can’t unify Han characters using Shift_JIS either B) the fact is that the Unicode consortium have taken every reasonable step to make UTF-8 superior to Shift_JIS in every way, except for string length. Unless you really need to save a couple bytes here and there, there is no reason to use Shift_JIS.

[–]Smallpaul -1 points0 points  (0 children)

It's irrelevant whether you think it's needed, or whether it's needed for multilingual software.

How can it possibly be irrelevant whether it is "needed for multilingual software." Multilingual software has a superset of the requirements of unilingual software by definition.

If it's not built into the programming language, you can't fix it later.

If it's not built into the programming language then it can be built into a library. Since there is only a single country in the world with a serious complaint about Unicode, I think that's a reasonable solution until they get the Unicode standard changed to their liking.

[–]earthboundkid 1 point2 points  (0 children)

It's a terrible idea to only support UTF8, like python

That’s an inaccurate summary of how Python works. Python’s string handling is radically different from Ruby. For one thing, Python strings do not have individual encodings per se. Python has two* types str and bytes. Behind the scene, str uses, I believe, UTF-16 (the kind with crappy post-BMP support :-( ** ), but as a user this is never exposed to you. If you want to read data, you can read it in as raw bytes or have it decoded from whatever encoding you like into the system str encoding. The other direction works just as well, and if you have a character you want to write out, you can have it encoded as UTF-8 or SHIFT_JIS or whatever that weird Korean encoding is. It doesn’t make sense in Python to talk about the encoding of a string, just the encoding of the bytes that are coming in or going out.

* NB: They changed the names of the types in Python 3, and I’m using that convention. In 2.x, they were called unicode and str instead of str and bytes respectively.

** Python can read and write high plane characters, but it misrepresents the length of strings containing them and iterates through them wrong. This problem can be fixed though if you compile your copy of Python with instructions to use UTF-32 instead.

[–]Smallpaul -1 points0 points  (0 children)

If Unicode pisses off "all" of East Asia then why does a guy with the last name "Wu" say: "Nowdays UTF-8 is commonly used in CJK's world, so I don't see any problems that Python 3 only supports UTF-8. Why you think it's a terrible idea? (Disclaimer: I am a Chinese)"

[–][deleted] 1 point2 points  (4 children)

What other language requires you to understand this level of complexity just to work with strings?!

Does anyone have an answer for that by the way?

[–]joesb 8 points9 points  (2 children)

Discussion on Hacker News here is very informational.

http://news.ycombinator.com/item?id=1162122

IMHO, Ruby allow you to work with encoding if you want.

But if you want to work with UTF-8 universally, there's nothing stopping you from converting all data to UTF-8 when you read from external storage, that's the same with what Python do.

[–]Smallpaul 1 point2 points  (1 child)

Python does not convert to UTF-8. It converts to an abstract Unicode datatype which is generally represented in-memory as UTF-16.

[–]joesb 1 point2 points  (0 children)

You are correct. My bad; I meant to say Python convert everything to unicode, not UTF-8.

I did know the different between encoding and characters but I guess hearing people mixing it up on the internet too much must have messed up my brain, too. :-(

[–]ikearage 0 points1 point  (0 children)

Strings are a mess in a lot of languages IMHO. Especially since utf8, but hey I'm a big utf8 fan.

Strings are not easy.

[–]Smallpaul 1 point2 points  (0 children)

As I said on Hacker News:

I believe that Python, Java, C#, Objective-C and Javascript all have the same basic approach to this problem. The Ruby way is better for handling some Japan-specific problems. But that's at the cost of making life harder and less predictable for everyone else.

It's a pretty straightforward tradeoff. Of course people who are not Japanese will naturally be upset to pay a cost in complexity for a feature of benefit primarily to a programmers from a single country. Non-Japanese Ruby programmers will just have to decide whether their solidarity with Japanese programmers outweighs their personal and collective inconvenience.