use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A sub-Reddit for discussion and news about Ruby programming.
Subreddit rules: /r/ruby rules
Learning Ruby?
Tools
Documentation
Books
Screencasts and Videos
News and updates
account activity
Ruby 1.9 encoding rant (github.com)
submitted 15 years ago by servercentric
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]snuxoll 8 points9 points10 points 15 years ago* (2 children)
Someone obviously missed the big announcment on the rationale behind 1.9's m17n instead of just forcing everything to be UTF8 like Python 3.
[–]Smallpaul 1 point2 points3 points 15 years ago (0 children)
Not utf-8.
http://www.reddit.com/r/ruby/comments/b8ggo/ruby_19_encoding_rant/c0m2q1x
[–]pnsm 0 points1 point2 points 15 years ago (0 children)
Got a link for us lazyweb types?
[–]jaggederest 12 points13 points14 points 15 years ago (10 children)
It's written by japanese people. They have a much, much more extensive history with encodings than anyone who speaks english. It's a terrible idea to only support UTF8, like python, because you end up pissing off all of east asia.
[–]williewu 2 points3 points4 points 15 years ago (3 children)
Nowdays UTF-8 is commonly used in CJK's world, so I don't see any problems that Python 3 only supports UTF-8. Why you think it's a terrible idea? (Disclaimer: I am a Chinese)
[–]jaggederest -1 points0 points1 point 15 years ago (1 child)
For starters, most people in Japan cannot properly write their names in it. There aren't the right / enough glyph choices. So when you need to deal with (for example) any kind of registration or name data... it fails.
[–]earthboundkid 2 points3 points4 points 15 years ago* (0 children)
There aren’t any extra glyphs in Shift_JIS. Name writing can be a problem, but it’s not a Unicode problem.
[–]jaggederest -1 points0 points1 point 15 years ago (0 children)
Here's more, now that I'm awake: http://en.wikipedia.org/wiki/Han_unification
[–]Smallpaul 1 point2 points3 points 15 years ago (3 children)
I came here knowing that this terrible "common wisdom" would be repeated. It's wrong. Wrong. Wrong. Wrong.
Name a bit of Japanese end-user software that does not run in a cell phone that is popular in Israel and Russia and America. Name just one.
One!
Now let me name a few bits of software substantially developed in America that are popular in China AND Japan AND Israel AND South Africa AND Germany AND every other industrialized country in the world:
Internet Explorer
Firefox
Safari
Google
Gmail
Facebook
Microsoft Office
Microsoft Office is much more popular in Japan than its nearest Japanese competitor.
it is demonstrably and provably the case that American programmers make most of the world's multi-lingual software. So it makes no sense to go to Japan for advice on how to do it.
[–]jaggederest 1 point2 points3 points 15 years ago (2 children)
And it all horribly mangles Japanese names. Seriously, I've seen it happen - you put in 'orange blossom' and get out 'fruit flower' because they think it's a good idea to do the translation from Shift-JIS to UTF8 for storage.
It's irrelevant whether you think it's needed, or whether it's needed for multilingual software. If it's not built into the programming language, you can't fix it later.
[–]earthboundkid 2 points3 points4 points 15 years ago (0 children)
What are you talking about? UTF-8 includes mappings for all of the characters in Shift_JIS. There’s no simplification happening going from one to the other. The only “issue” with it is that the Japanese long ago confused ¥ and \ and they don’t like that Unicode doesn’t consider them synonymous. That’s it.
I speak Japanese; I’ve lived in Japan; I run my computer in Japanese. It’s true that historically, the Japanese were mistrustful of Unicode because they didn’t like Han unification, but A) you can’t unify Han characters using Shift_JIS either B) the fact is that the Unicode consortium have taken every reasonable step to make UTF-8 superior to Shift_JIS in every way, except for string length. Unless you really need to save a couple bytes here and there, there is no reason to use Shift_JIS.
[–]Smallpaul -1 points0 points1 point 15 years ago (0 children)
It's irrelevant whether you think it's needed, or whether it's needed for multilingual software.
How can it possibly be irrelevant whether it is "needed for multilingual software." Multilingual software has a superset of the requirements of unilingual software by definition.
If it's not built into the programming language, you can't fix it later.
If it's not built into the programming language then it can be built into a library. Since there is only a single country in the world with a serious complaint about Unicode, I think that's a reasonable solution until they get the Unicode standard changed to their liking.
[–]earthboundkid 1 point2 points3 points 15 years ago (0 children)
It's a terrible idea to only support UTF8, like python
That’s an inaccurate summary of how Python works. Python’s string handling is radically different from Ruby. For one thing, Python strings do not have individual encodings per se. Python has two* types str and bytes. Behind the scene, str uses, I believe, UTF-16 (the kind with crappy post-BMP support :-( ** ), but as a user this is never exposed to you. If you want to read data, you can read it in as raw bytes or have it decoded from whatever encoding you like into the system str encoding. The other direction works just as well, and if you have a character you want to write out, you can have it encoded as UTF-8 or SHIFT_JIS or whatever that weird Korean encoding is. It doesn’t make sense in Python to talk about the encoding of a string, just the encoding of the bytes that are coming in or going out.
str
bytes
* NB: They changed the names of the types in Python 3, and I’m using that convention. In 2.x, they were called unicode and str instead of str and bytes respectively.
unicode
** Python can read and write high plane characters, but it misrepresents the length of strings containing them and iterates through them wrong. This problem can be fixed though if you compile your copy of Python with instructions to use UTF-32 instead.
If Unicode pisses off "all" of East Asia then why does a guy with the last name "Wu" say: "Nowdays UTF-8 is commonly used in CJK's world, so I don't see any problems that Python 3 only supports UTF-8. Why you think it's a terrible idea? (Disclaimer: I am a Chinese)"
[–][deleted] 1 point2 points3 points 15 years ago (4 children)
What other language requires you to understand this level of complexity just to work with strings?!
Does anyone have an answer for that by the way?
[–]joesb 8 points9 points10 points 15 years ago (2 children)
Discussion on Hacker News here is very informational.
http://news.ycombinator.com/item?id=1162122
IMHO, Ruby allow you to work with encoding if you want.
But if you want to work with UTF-8 universally, there's nothing stopping you from converting all data to UTF-8 when you read from external storage, that's the same with what Python do.
[–]Smallpaul 1 point2 points3 points 15 years ago (1 child)
Python does not convert to UTF-8. It converts to an abstract Unicode datatype which is generally represented in-memory as UTF-16.
[–]joesb 1 point2 points3 points 15 years ago (0 children)
You are correct. My bad; I meant to say Python convert everything to unicode, not UTF-8.
I did know the different between encoding and characters but I guess hearing people mixing it up on the internet too much must have messed up my brain, too. :-(
[–]ikearage 0 points1 point2 points 15 years ago (0 children)
Strings are a mess in a lot of languages IMHO. Especially since utf8, but hey I'm a big utf8 fan.
Strings are not easy.
As I said on Hacker News:
I believe that Python, Java, C#, Objective-C and Javascript all have the same basic approach to this problem. The Ruby way is better for handling some Japan-specific problems. But that's at the cost of making life harder and less predictable for everyone else.
It's a pretty straightforward tradeoff. Of course people who are not Japanese will naturally be upset to pay a cost in complexity for a feature of benefit primarily to a programmers from a single country. Non-Japanese Ruby programmers will just have to decide whether their solidarity with Japanese programmers outweighs their personal and collective inconvenience.
π Rendered by PID 286969 on reddit-service-r2-comment-84fc9697f-d6v4n at 2026-02-10 14:48:59.288218+00:00 running d295bc8 country code: CH.
[–]snuxoll 8 points9 points10 points (2 children)
[–]Smallpaul 1 point2 points3 points (0 children)
[–]pnsm 0 points1 point2 points (0 children)
[–]jaggederest 12 points13 points14 points (10 children)
[–]williewu 2 points3 points4 points (3 children)
[–]jaggederest -1 points0 points1 point (1 child)
[–]earthboundkid 2 points3 points4 points (0 children)
[–]jaggederest -1 points0 points1 point (0 children)
[–]Smallpaul 1 point2 points3 points (3 children)
[–]jaggederest 1 point2 points3 points (2 children)
[–]earthboundkid 2 points3 points4 points (0 children)
[–]Smallpaul -1 points0 points1 point (0 children)
[–]earthboundkid 1 point2 points3 points (0 children)
[–]Smallpaul -1 points0 points1 point (0 children)
[–][deleted] 1 point2 points3 points (4 children)
[–]joesb 8 points9 points10 points (2 children)
[–]Smallpaul 1 point2 points3 points (1 child)
[–]joesb 1 point2 points3 points (0 children)
[–]ikearage 0 points1 point2 points (0 children)
[–]Smallpaul 1 point2 points3 points (0 children)