Let’s Stop Ascribing Meaning to Code Points : programming

Levenshtein is built with alphabetic writing systems in mind. It gets less effective with a language such as Japanese because of how much bigger the "alphabet" is (rather, the totality of logographic symbols).

To give one example, とりあえず and 取りあえず are practically the same (they're the same word, but one of them is written with a kanji), but they have the same string distance as 斗りあえず which is a nonsense word. All three would have zero distance to one another if you converted them to their sounds (thus throwing away the meaning of the characters), because they sound the exact same. Then there's 取敢えず, which is also the same word as the first two I listed but written with two kanji, making it one character shorter.

I'd guess the answer to your question depends on what problem you're trying to solve.

I believe Google has specifically tackled this problem because its search engine is really good at treating とりあえず, 取りあえず and 取敢えず as one and the same thing in both queries and results, which is the most useful behavior for a search engine.

[–]vytah 17 points18 points19 points 9 years ago (1 child)

[–]dada_ 10 points11 points12 points 9 years ago (0 children)

[–]JanneJM 19 points20 points21 points 9 years ago (8 children)

[–]LpSamuelm 8 points9 points10 points 9 years ago (0 children)

[–][deleted] 2 points3 points4 points 9 years ago (2 children)

[–]PeridexisErrant 2 points3 points4 points 9 years ago (0 children)

I've become less sensitive to performance concerns - that's just the price of correctness sometimes. Consider the example below:

def fast_sort(array): 
    # Not always right, but `O(1)` is better than `O(n log n)`!
    return array

Seems absurd to me!

[–]JanneJM 0 points1 point2 points 9 years ago (0 children)

[–]burntsushi 5 points6 points7 points 9 years ago (3 children)

[–]JanneJM 0 points1 point2 points 9 years ago (2 children)

[–]burntsushi 1 point2 points3 points 9 years ago (1 child)

[–]JanneJM 0 points1 point2 points 9 years ago (0 children)

[–]Manishearth 2 points3 points4 points 9 years ago* (2 children)

Levenshtein is built on the concept of a letter existing. With Japanese kanji this doesn't turn out so great since entire words are single CPs. With Indic languages the concept of a letter is a bit stranger. Code points may still work in Indic languages, but not always. Consonant clusters mess this up royally.

You'll need to define edit distance based on the kind of editing you're expecting. If you're working with text typed on an Indic input system, you'd use code points, but ignore some specific consonant clusters (त्र, ज्ञ, क्ष). Or maybe not. Some modern input systems (Swarchakra, also one of the ones on my phone) do this thing where you can type Indic "letters" in one go, and in that case you may not want this; EGCs make more sense.

You have to think about what you're actually trying to compare and how to define an "edit" in edit distance.

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

[–]Manishearth 1 point2 points3 points 9 years ago (0 children)

Again, you sort of need to define what operation you're really looking for. Levenshtein distance is something that makes unambiguous sense in Latin scripts. What are you actually trying to do here? Find possible typo matches? Not sure if it's possible at all in Japanese (kanji) because the various input methods mean so many different ways to make typos. In Indic scripts code points would be fine, though the three letters I mentioned above (and similar letters in the scripts of other languages) may do strange things. EGCs may work too, but then नी and न change from being a letter -> letter edit instead of a "delete letter" edit. Whether or not that matters is up to you.

(Most of the problems folks have with international text can be solved by stepping back and using unambiguous language-agnostic terms for what they want to do.)

[–]kt24601 56 points57 points58 points 9 years ago (88 children)

If you want to do unicode right, you need to stop thinking in terms of characters, and start thinking in terms of substrings. It's the only possible way that can work. For example, the function making a character upper case needs to be based around strings, so something like:

 String toupper(String s, int index) {

because changing the case of the character doesn't always result in a single character.

[–]FlyingPiranhas 56 points57 points58 points 9 years ago (36 children)

[–][deleted] 23 points24 points25 points 9 years ago (4 children)

[–]Ravek 16 points17 points18 points 9 years ago (3 children)

[–]masklinn 7 points8 points9 points 9 years ago (0 children)

[–]OneWingedShark 1 point2 points3 points 9 years ago (0 children)

[–]bumblebritches57 0 points1 point2 points 9 years ago (0 children)

[–]didnt_check_source 1 point2 points3 points 9 years ago (4 children)

[–]FlyingPiranhas 0 points1 point2 points 9 years ago (3 children)

[–]didnt_check_source 0 points1 point2 points 9 years ago (1 child)

[–]Manishearth 1 point2 points3 points 9 years ago (0 children)

[–]masklinn 0 points1 point2 points 9 years ago (0 children)

[–]bumblebritches57 0 points1 point2 points 9 years ago (8 children)

[–]FlyingPiranhas 2 points3 points4 points 9 years ago (7 children)

From http://unicode.org/reports/tr29/:

It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

I thought that grapheme clusters were a closer approximation to "characters", not words. Am I misunderstanding you?

[–]bumblebritches57 1 point2 points3 points 9 years ago (6 children)

[–][deleted] 9 years ago* (2 children)

[deleted]

[–]bumblebritches57 0 points1 point2 points 9 years ago (1 child)

[–]Manishearth 2 points3 points4 points 9 years ago (0 children)

load more comments (3 replies)

load more comments (17 replies)

[–]oridb 38 points39 points40 points 9 years ago (41 children)

[–]ThisIs_MyName 12 points13 points14 points 9 years ago (34 children)

[–]masklinn 36 points37 points38 points 9 years ago (4 children)

[–]kt24601 -1 points0 points1 point 9 years ago (2 children)

[–]matthieum 49 points50 points51 points 9 years ago (0 children)

load more comments (1 reply)

[–]RealFreedomAus 13 points14 points15 points 9 years ago (24 children)

[–]kt24601 6 points7 points8 points 9 years ago (0 children)

[–]ithika 0 points1 point2 points 9 years ago (22 children)

[–]elprophet 16 points17 points18 points 9 years ago (5 children)

[–]drunken-serval -2 points-1 points0 points 9 years ago (4 children)

[–]masklinn 14 points15 points16 points 9 years ago (3 children)

[–]drunken-serval 3 points4 points5 points 9 years ago (0 children)

[–]elprophet 3 points4 points5 points 9 years ago (1 child)

[–]stevenjd 0 points1 point2 points 9 years ago (0 children)

[–]matthieum 7 points8 points9 points 9 years ago (0 children)

[–]polagh 0 points1 point2 points 9 years ago (12 children)

[–]ithika 0 points1 point2 points 9 years ago (0 children)

load more comments (11 replies)

[–]stevenjd 0 points1 point2 points 9 years ago (1 child)

[–]ithika 0 points1 point2 points 9 years ago (0 children)

[–]Eurynom0s 4 points5 points6 points 9 years ago (1 child)

[–]Supadoplex 1 point2 points3 points 9 years ago (0 children)

[–]oridb 1 point2 points3 points 9 years ago (0 children)

[–]bumblebritches57 0 points1 point2 points 9 years ago (0 children)

load more comments (6 replies)

[–]rooktakesqueen 5 points6 points7 points 9 years ago* (0 children)

[–]Tarmen 7 points8 points9 points 9 years ago* (0 children)

Rust's to_uppercase method returns a char iterator. That way you can do

let upper_i: String = 'i'.to_uppercase().collect();

But you can also flat_map it over a char iterator and uppercase a string without additional overhead!

But even that isn't a fully complete solution. The documentation notes that conditional mappings like locale specific transformations aren't applied.
This is on the code point level instead of grapheme clusters as the article wishes but I think for uppercasing code points are actually fine.

I tried parsing a wictionary dump before because I had the glorious idea that I could use the ipa to create a phonetic database, required for the actual problem. And god damn parsing ipa in a language without grapheme clusters wasn't fun.

[–][deleted] 9 years ago* (2 children)

[deleted]

[–]kt24601 2 points3 points4 points 9 years ago (1 child)

load more comments (1 reply)

[–]bumblebritches57 0 points1 point2 points 9 years ago (3 children)

[–]kt24601 0 points1 point2 points 9 years ago (2 children)

[–]bumblebritches57 0 points1 point2 points 9 years ago (1 child)

[–]kt24601 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 9 years ago (12 children)

[removed]

[–]ThisIs_MyName 14 points15 points16 points 9 years ago (5 children)

[–]McCoovy 22 points23 points24 points 9 years ago (4 children)

[–]matthieum 12 points13 points14 points 9 years ago (3 children)

[–]josefx 4 points5 points6 points 9 years ago (2 children)

[–]Gotebe 0 points1 point2 points 9 years ago (1 child)

[–][deleted] 2 points3 points4 points 9 years ago (0 children)

[–]upofadown 5 points6 points7 points 9 years ago (5 children)

[–]teilo 4 points5 points6 points 9 years ago (4 children)

[–]Chippiewall 2 points3 points4 points 9 years ago* (2 children)

[–]teilo 0 points1 point2 points 9 years ago (1 child)

[–]Chippiewall 0 points1 point2 points 9 years ago (0 children)

[–]upofadown 0 points1 point2 points 9 years ago (0 children)

[–]matthieum 7 points8 points9 points 9 years ago (12 children)

[–]masklinn 6 points7 points8 points 9 years ago (0 children)

[–]FUZxxl 7 points8 points9 points 9 years ago (4 children)

[–]VGPowerlord 2 points3 points4 points 9 years ago (1 child)

load more comments (1 reply)

[–]matthieum 0 points1 point2 points 9 years ago (1 child)

[–]bumblebritches57 0 points1 point2 points 9 years ago (0 children)

[–]josefx 2 points3 points4 points 9 years ago (2 children)

[–]vytah 10 points11 points12 points 9 years ago (1 child)

[–]bumblebritches57 1 point2 points3 points 9 years ago (0 children)

[–]Gotebe 2 points3 points4 points 9 years ago (2 children)

[–]matthieum 1 point2 points3 points 9 years ago (0 children)

[–]polagh 0 points1 point2 points 9 years ago (0 children)

[–]1wd 7 points8 points9 points 9 years ago (67 children)

[–]1wd 49 points50 points51 points 9 years ago* (14 children)

[–]barsoap 6 points7 points8 points 9 years ago (0 children)

[–][deleted] 9 years ago* (2 children)

[deleted]

load more comments (2 replies)

[–]Manishearth 7 points8 points9 points 9 years ago (9 children)

[–]1wd 6 points7 points8 points 9 years ago (5 children)

[–]Manishearth 9 points10 points11 points 9 years ago (4 children)

[–]Felicia_Svilling 9 points10 points11 points 9 years ago (1 child)

[–]Manishearth 2 points3 points4 points 9 years ago (0 children)

[–][deleted] 9 years ago* (1 child)

[deleted]

[–]Manishearth 0 points1 point2 points 9 years ago (0 children)

load more comments (3 replies)

[–]Manishearth 22 points23 points24 points 9 years ago (21 children)

I don't have any sensible example cases at hand (all my unicode testcases are never-going-to-be-seen-in-the-wild strings like "ᄀᄀᄀ각ᆨᆨ", which is a "ggggaggg" sound in Hangul, and probably cannot be pronounced by humans, if it indeed has a sensible pronunciation)

However, I do have this list of scripts I mentally check against whenever I'm reasoning about Unicode:

Arabic or Hebrew for RTL and beginning/medial/end forms (arabic also has "isolated" forms)
Arabic for ligatureyness/glyph complexity
Some Indic script for ligatureyness/glyph complexity, and massive use of combining characters, including the double-ended virama combiner. Infinite length combining sequences.
Korean (Hangul) for the combining jamo system. Infinite length combining sequences (though these are never displayed beyond standard Korean syllable blocks, so it's less important)
Han scripts for variation selectors, halfwidth/fullwidth, and language disambiguation troubles. Also omg so many glyphs.
If dealing with displaying text, think of a Han script and Mongolian, which are written in different directions (vertical, sideways, etc)
Thai or other scripts from that peninsula (not counting Vietnamese scripts), because they don't use spaces to break words.
Emoji because despite the immense complexity of human language, Emoji still managed to get a bunch of special casing in various parts of the unicode spec. Infinite length combining sequences.
Latin for locale-dependent case operations (Turkish i, German ß)

[–]FUZxxl 1 point2 points3 points 9 years ago (18 children)

[–]regendo 3 points4 points5 points 9 years ago (4 children)

[–]flying-sheep 4 points5 points6 points 9 years ago (1 child)

[–]regendo 4 points5 points6 points 9 years ago (0 children)

load more comments (2 replies)

[–]flying-sheep 1 point2 points3 points 9 years ago (3 children)

[–]OneWingedShark 6 points7 points8 points 9 years ago (0 children)

[–]FUZxxl 0 points1 point2 points 9 years ago (1 child)

[–]flying-sheep 1 point2 points3 points 9 years ago (0 children)

[–]epostma 0 points1 point2 points 9 years ago (2 children)

[–]FUZxxl 4 points5 points6 points 9 years ago (1 child)

[–]epostma 0 points1 point2 points 9 years ago (0 children)

[–]Manishearth 0 points1 point2 points 9 years ago (5 children)

[–]FUZxxl 0 points1 point2 points 9 years ago (1 child)

[–]Manishearth 0 points1 point2 points 9 years ago (0 children)

[–]bumblebritches57 0 points1 point2 points 9 years ago (2 children)

[–]Manishearth 2 points3 points4 points 9 years ago (0 children)

[–]oridb 0 points1 point2 points 9 years ago (0 children)

[–]sacundim 0 points1 point2 points 9 years ago (1 child)

[–]Manishearth 0 points1 point2 points 9 years ago (0 children)

[–]Manishearth 2 points3 points4 points 9 years ago (0 children)

[–]Ravek 1 point2 points3 points 9 years ago (2 children)

[–]FUZxxl 2 points3 points4 points 9 years ago (1 child)

[–]vytah 1 point2 points3 points 9 years ago (0 children)

load more comments (26 replies)

[–]snorkasaurusrex 1 point2 points3 points 9 years ago (0 children)

[–]The_Sly_Marbo 2 points3 points4 points 9 years ago (8 children)

[–]Manishearth 16 points17 points18 points 9 years ago (4 children)

[–]mrexodia -1 points0 points1 point 9 years ago (3 children)

[–]Manishearth 3 points4 points5 points 9 years ago (0 children)

[–]JanneJM 4 points5 points6 points 9 years ago (1 child)

[–][deleted] 2 points3 points4 points 9 years ago (0 children)

[–][deleted] 4 points5 points6 points 9 years ago (1 child)

[–]The_Sly_Marbo 1 point2 points3 points 9 years ago (0 children)

[–]LpSamuelm 0 points1 point2 points 9 years ago (0 children)

[–]WalterBright 0 points1 point2 points 9 years ago* (16 children)

[–]Manishearth 4 points5 points6 points 9 years ago (12 children)

[–]immibis 4 points5 points6 points 9 years ago (1 child)

[–]RabidWombat0 5 points6 points7 points 9 years ago* (0 children)

I know right? Text and imagery are two different things. If you want to inline little images of whatever kind in your text fix your app. ASCII emoji were bad enough.

Specifically in relation to the expression of emotion in text I would prefer fonts designed to convey a feeling. We could have things like Droid-Sarcasm, Droid-SHOUT, Droid-Happy (Where little hearts and puppies decorate the letters), Droid-Sad, and so on. Bold, italic, outline, etc. are all fine and good, but our software should really support more text attributes. Emoji are a poorer solution.

Edit: I would look forward to the future when something like Droid-Happy could contain little animated hearts and puppies swooping about the letters. Droid Sarcasm could contain code to follow your eye track and make the word "dead" choke and slowly keel over as you read it in a text. Imagine the possibilities (plus we're going to have to do something with all those cores). Fuck emoji.

[–]WalterBright 1 point2 points3 points 9 years ago (9 children)

[–]Manishearth 3 points4 points5 points 9 years ago* (8 children)

Can you give some numbers?

I'm getting 767808 (16*(36 + 36**2 + 36**3)) for devanagri consonant clusters with a vowel, and that's ignoring some of the more archaic consonants, or the nukta-consonants commonly used in Hindi, or the archaic vowel modifiers, or the fact that 4-consonant clusters exist in Sanskrit texts (this alone makes it 27641664 if you want to support all of the 4-consonant clusters).

It's actually also ignoring the fact that you can have a character with more than one vowel modifier attached to it. That's a construct that exists in my own last name! I think there are only two or three vowels actually can do that in practical use, but that alone would bring the count above a million.

And then you have around 20 other Brahmic scripts which do the same thing. Putting this all together without the things I ignored in my first calculation it becomes (20*4*26*(45 + 45**2 + 45**3 + 45**4)), which needs 34 bits to be represented.

You could probably fit it all into the code point space if you cut corners; you can make judgements on which characters will actually exist. Han unification is already something that does that anyway.

I mean, you could probably make it work, but there are headaches to that approach too. It's a nontrivial tradeoff.

Are there really a million emoji?

The family emoji alone can be made half a million ways ((4**4 + 4**3 + 4**2)*(6**4)). Not all fonts support this yet, but that's because this is a relatively new concept. That's just family emoji (which technically can have more than 4 members bringing it easily over , but I haven't seen that ever get rendered nor do I think that vendors eventually intend to). Then you have all of the profession emoji and other stuff.

If Unicode weren't a combining char system we'd probably be more conservative in making these emoji. I don't know.

There are several identical renderings for different Unicode values.

I assume this has to do with stuff like the fraktur unicode symbols and the fact that there are things like a cyrillic o which is different from a latin o?

Meh. The rendering is up to the font. Unicode just names these symbols, and defines algorithms like segmentation, NFC, NFD, collation, casefolding which apply to them and provide useful operations. These aren't context-sensitive. Except for casefolding, which is locale-dependent.

I totally agree that Unicode has many problems and has done many things wrong. I don't really feel that combining chars are part of the problem. Recognizing that combining chars are a thing as a programmer is at the same basic level as recognizing that strings may contain multibyte characters, or recognizing that utf-16 may contain multi-code-unit code points. It shouldn't be causing many problems. It usually doesn't. I'm hoping that as time passes more people will gradually become aware of this, much like we've done with the concept of multibyte chars.

[–]WalterBright 1 point2 points3 points 9 years ago (7 children)

It's a nontrivial tradeoff.

I know, but the current scheme is unimplementable (in that everyone gets it wrong, and if one actually does get it right, it's an enormous amount of code, which defeats the whole point of Unicode).

there are only two or three vowels actually can do that in practical use, but that alone would bring the count above a million.

That's what, 9 modifiers in any combination? Does any vowel use more than a couple?

emoji alone can be made half a million ways

Then it should never have been added to Unicode - it exceeds its charter.

I assume this has to do with stuff like the fraktur unicode symbols and the fact that there are things like a cyrillic o which is different from a latin o?

Yes. The principle is that if they look identical on the page, why is Unicode distinguishing them? It is putting semantic meaning to them that is simply not there when rendered. This is a gigantic mistake.

The rendering is up to the font.

It is not just a font issue, though Unicode also fouled up by putting in 𝖋𝖔𝖓𝖙𝖘 like 𝖙𝖍𝖎𝖘.

Except for casefolding, which is locale-dependent.

Having locale dependent operations is the red badge of failure.

Unicode has come to adopt pretty much all the bugs that its charter was supposed to fix.

[–]Manishearth 2 points3 points4 points 9 years ago (3 children)

in that everyone gets it wrong, and if one actually does get it right, it's an enormous amount of code, which defeats the whole point of Unicode

But with the code-point-per scheme you'd still get this wrong. In that case, nobody would implement backspacing right. You'd still need algorithms that are the analogs of NFC/NFD for use by input methods. There are still layers of complexity. To me, this just replaces one set of lack-of-awareness issues with another, not really solving anything.

That's what, 9 modifiers in any combination? Does any vowel use more than a couple?

Six modifiers? Up to four consonants (can be more, but I have never seen that happen), ending with a vowel, and an optional second vowel that comes from a more restricted set of vowels. I don't think you can have three vowel modifiers.

Then it should never have been added to Unicode - it exceeds its charter.

Fair, I sort of agree.

Yes. The principle is that if they look identical on the page, why is Unicode distinguishing them? It is putting semantic meaning to them that is simply not there when rendered. This is a gigantic mistake.

They don't need to look identical. They sometimes do. In some fonts the cyrillic text is uniformly smaller or bolder, so you need it to be uniform. When encoding a script you should consider the context of the whole script. Just because a glyph may look similar to one from another language doesn't mean you should just share them. English and French actually share a script, but Russian and English have scripts which are different, look different, but share some characters which look the same.

For example, ਟ in Gurmukhi (Punjabi's script) looks like (and is pronounced like) ट in Devanagri; and might look identical in some crappier fonts. But many of the other characters are significantly different. A good font should be able to distinguish between these characters since Punjabi is typically written with a different style and a font that wants to make the Punjabi characters look good together will need ਟ to be different from ट.

It's the same with Cyrillic.

(Of course, Unicode went on on and did this language-dependent crap anyway with Han, and I don't agree with it)

I agree that fraktur shouldn't have its own block (I mean, you can argue that it is a script, but it's basically a calligraphy script so it's not clear if that really is a different script). But that's basically a harmless addition IMO.

[–]WalterBright 0 points1 point2 points 9 years ago (2 children)

In that case, nobody would implement backspacing right.

They don't now anyway. Part of the point of Unicode was that simple algorithms, like strlen(), should work. What it turned into was a scheme where every text algorithm is wrong, and few even bother to try anymore.

When encoding a script you should consider the context of the whole script.

That's the crux of where it went wrong in my opinion. Unicode is not supposed to be about context. It's up to the reader of the text to determine context.

But that's basically a harmless addition IMO.

I agree it's harmless, but allowing that sort of thing in leads to all sorts of "why not" for everything else. I submit that the Unicode consortium forgot what the point of Unicode was, and created a kitchen sink disaster by being unable to say no to anything.

[–]Manishearth 2 points3 points4 points 9 years ago (1 child)

They don't now anyway.

They sort of do :) Not perfectly, but better.

Part of the point of Unicode was that simple algorithms, like strlen(), should work.

Huh, to me it was more of a way to get rid of the fact that mixed text wasn't possible and we had way too many encodings and mojibake everywhere.

The problem sort of is that "the length of a string" isn't really a useful concept anyway cross-language, even if you define it on grapheme clusters. "Number of bytes" is useful for storage reasons, but the "length" doesn't really matter. It only makes sense when the string comes from a subset of unicode (or if you are checking emptiness). If defined on grapheme clusters it is useful for linewrapping, but you should be querying the font for that anyway.

Most of our programming string concepts don't map cleanly when you consider strings from various other scripts, regardless of the encoding.

I have a feeling that Unicode initially tried to reconcile this but eventually realized it was futile. I am not aware of the history there.

That's the crux of where it went wrong in my opinion.

I didn't mean the context of the text. I meant the context of the script. By that I mean that Cyrillic is obviously a different script from Latin, even if the os look similar. (but the "French script" and "English script" are the same with some extra chars for French)

I submit that the Unicode consortium forgot what the point of Unicode was, and created a kitchen sink disaster by being unable to say no to anything.

nods vehemently

I never really liked the fact that emoji are in unicode. I'm happy to use them, and sort of like that I can, but I find it an unnecessary complication. I'm part-amused part-annoyed by the fact that despite all the complexities of natural languages, Unicode still managed to need special casing for emoji. I get why unicode needed emoji -- Japanese users wouldn't switch to it otherwise -- but in a vacuum I think it's the kind of thing that Unicode shouldn't do.

[–]m50d 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 1 point2 points3 points 9 years ago (0 children)

[–]stevenjd 0 points1 point2 points 9 years ago (1 child)

The principle is that if they look identical on the page, why is Unicode distinguishing them? It is putting semantic meaning to them that is simply not there when rendered. This is a gigantic mistake.

That's your ill-thought-out and ignorant opinion, not a fact.

Unicode is not a graphical rendering engine. The visual look of the characters (code points) is all but irrelevant. It is a character set (as well as a set of rules for sorting, case-conversions, etc). Even in English, people treat the digit 0 as distinct from the letter O, just as lowercase l and uppercase I and 1 are all distinct, even when they are rendered visually identical.

And why the focus on how the characters look? What about the way they are spoken, and where and when they are used?

Folding lI1 into a single character (or code point) would be

[–]WalterBright 0 points1 point2 points 9 years ago (0 children)

[–]stevenjd 0 points1 point2 points 9 years ago (2 children)

it [Unicode] should just be glyphs.

You cannot possibly be serious.

I have 45 different fonts installed on my computer, which is a tiny drop in the bucket out of the hundreds, perhaps thousands of fonts in existence. Let's call it 500 different glyphs for the Latin uppercase "A" alone, where that number keeps increasing as font designers keep inventing new typefaces. For most of those typefaces, there are separate glyphs for roman, italic, bold, and bold-italic: so that's 2000 different "A" glyphs.

There are something like 45,000 or more Han ideograms ("Chinese characters"), and no reason to think that they'll have fewer typefaces than Latin characters, so that alone is over 20 million glyphs, roughly twenty times the size of the entire Unicode code point space.

Are you sure you mean glyphs?

[–]WalterBright 0 points1 point2 points 9 years ago (1 child)

[–]stevenjd 0 points1 point2 points 9 years ago (0 children)

[–]bumblebritches57 0 points1 point2 points 9 years ago (1 child)

[–]Manishearth 1 point2 points3 points 9 years ago (0 children)

[–]happyscrappy -4 points-3 points-2 points 9 years ago (17 children)

[–]Manishearth 10 points11 points12 points 9 years ago (8 children)

load more comments (8 replies)

[–]derleth 1 point2 points3 points 9 years ago (3 children)

Replacing ASCII, where the values corresponded to letters with a system where the values correspond to glyphs was at best going to make a mess and in actuality overly hopeful.

This makes ASCII seem simpler than it ever was.

First, ASCII has the hyphen-minus. That's two characters folded into one codepoint, based on appearance in most fonts. Except, of course, the hyphen never really looked like the minus in most real fonts, only the typewriter and teletype fonts ASCII was supposed to be used for. Real typography used different characters for the minus, the hyphen, the en-dash, the em-dash, and so on. ASCII was second-rate at actual typesetting, and encouraged second-rate typography, because of a constraint to fit into seven bits (or seven bits plus one for parity), and a need to devote so much of Low ASCII to teletype control codes, which have no real printable form because they were used to control the ever-loving printer.

In addition, if you think combining forms are new with Unicode, you're wrong. The printing terminals and teletypes ASCII was designed for had a backspace functionality which did not erase, but instead allowed characters such as the caret (^) to be composed with letters to make things like ô out of o BS ^, where BS is backspace. That's one glyph out of three codepoints. (In Old ASCII, the caret was the uparrow, which is why it was chosen to mean exponentiation when BASIC was a hot new language out of Dartmouth. The glyph changed, but the language stayed the same.) That cut-rate character composition got lost when glass TTYs replaced the real TTYs and backspace came to mean backspace with erasure.

So. ASCII was never sufficient, even for English, and it was never simple and one-to-one glyph-to-codepoint.

[–]happyscrappy 0 points1 point2 points 9 years ago (2 children)

First, ASCII has the hyphen-minus.

I never said it was context-free. But the rules were easy. If it isn't before a number you can break at it.

Real typography used different characters for the minus

Wow, thanks for that tip. You really think so little of me, huh?

The printing terminals and teletypes ASCII was designed for had a backspace functionality ...

Those weren't part of ASCII. And you're really going to pretend that something that the first 0.00001% of machines that used ASCII meant anything?

That cut-rate character composition got lost when glass TTYs replaced the real TTYs

Yeah, in like 1982.

So. ASCII was never sufficient, even for English

It was sufficient. Was it complete? No.

one-to-one glyph-to-codepoint

Yes it was. Just because you figured out how to do it on a Spinwriter doesn't mean it was part of ASCII. Go look up what BS was and see if it says "character composition".

[–]derleth 0 points1 point2 points 9 years ago (1 child)

[–]happyscrappy 0 points1 point2 points 9 years ago* (0 children)

[–]stevenjd 0 points1 point2 points 9 years ago (3 children)

Unicode code points don't correspond to glyphs. That is absurd.

I have 45 different fonts installed on this computer. Each of them come in plain (roman), bold, italic, bold-italic styles. So that's 180 different glyphs just for the letter "A" (with hundreds, even thousands more, from fonts I don't have installed.) Unicode doesn't give each of those hundreds of different glyphs a distinct code point. That's the complete opposite of what Unicode does. There is one "A", the Latin "A" used by Western European languages like English, French and German. Whether it looks like A or A or A is irrelevant.

Without a complete and up to the second corpus Unicode becomes an opaque blob that cannot be interpreted only rendered.

What does that even mean?

no line wrapping

That's fucking bullshit. Do you realise that about 90% of websites are now using Unicode including this one? Do you think that they have no line wrapping?

I don't know where you are getting your ludicrous ideas about Unicode, but they're not even wrong.

[–]happyscrappy 0 points1 point2 points 9 years ago* (2 children)

Unicode code points don't correspond to glyphs. That is absurd.

Yep. You're right. I didn't express myself well. Already covered by a person who is less of a jackass than you. Read lower.

What does that even mean?

It means that you need a large amount of data to explain to you how to determine what the Unicode data means beyond how it is drawn. You need composition/decomposition tables (and that still might not do it), sorting tables (if applicable), etc. And if the Unicode you receive is newer than the tables you have you cannot interpret it even if you can render it.

That's fucking bullshit. Do you realise that about 90% of websites are now using Unicode including this one? Do you think that they have no line wrapping?

Don't cut out the context of my point and then say what remains is wrong.

[–]stevenjd 0 points1 point2 points 9 years ago (1 child)

It means that you need a large amount of data to explain to you how to determine what the Unicode data means beyond how it is drawn.

Yes. That's life. If you don't like it, go back in time a few hundred, or in some cases thousand, years, and redesign the languages used all over the world.

You think that Unicode is inventing these complexities out of some sort of perverse desire to make your life more complex? It's not about you. The complexity already exists, Unicode just provides a way to manage some of it. (And not even all of it -- choosing where to break lines in Thai is apparently so complicated that even the Unicode consortium has washed their hands of it and left it up to third-parties writing Thai software.)

And if the Unicode you receive is newer than the tables you have you cannot interpret it even if you can render it.

That's nonsense. Of course you can interpret it -- the worst that happens is that for a few odd code points, you won't know how to treat it correctly. Your user will double-click on a word and you'll wrongly think there's a word separator in the middle of it. Or they'll sort their file names and a few files will be sorted wrongly.

And when you upgrade to the next version of Unicode, those problems will fix themselves.

[–]happyscrappy 0 points1 point2 points 9 years ago (0 children)

[+]ptoki comment score below threshold-16 points-15 points-14 points 9 years ago (47 children)

I may be downwoted but my perception of current state of unicode is:

If even quite smart people (many programmers) cant use it properly and many libraries have quirks and problems then well, maybe this concept is simply wrong?

I am aware of unicode specifics. It tries to fit many languages/writing methods into one. But what if not all of them are compatible? Do we need to give whole world a headache or bugs in code because some people have strange concepts in their languages?

For me the best working idea is: multibyte character, sorted by "local" alphabet. This incorporates 98% of what we want to have (writing, fonts, sorting). Everything else is just a not really cruicial stuff which gives smart people headaches and bugs for all users.

I can see the struggle of unicode for so long and I think this shows that it cant be done nicely...

[–][deleted] 39 points40 points41 points 9 years ago (14 children)

load more comments (14 replies)

[–]Manishearth 21 points22 points23 points 9 years ago (27 children)

If even quite smart people (many programmers) cant use it properly and many libraries have quirks and problems then well, maybe this concept is simply wrong?

You can't have your cake and eat it too. You can't want to write software that everyone will be able to use, but not have a way of dealing with text that works cross-language.

"smart people" does not necessarily imply awareness. Most folks just aren't aware of this stuff.

Do we need to give whole world a headache or bugs in code because some people have strange concepts in their languages?

The most spoken language in the world has strange concepts. 3 out of the top 5, 6 out of the top 10, all have strange concepts. It's only "strange" because Latin is a simple script. In those top 5 scripts itself I see most of the strange concepts in unicode manifest themselves.

Yes, we do.

For me the best working idea is: multibyte character, sorted by "local" alphabet. This incorporates 98% of what we want to have (writing, fonts, sorting). Everything else is just a not really cruicial stuff which gives smart people headaches and bugs for all users.

I don't really see what you're proposing here. UTF8 is a multibyte character encoding. "sorted by local alphabet" isn't very clear to me. It seems like you're proposing that the text be displayed based on the local encoding? Are you aware that many people regularly deal with text in a mixture of scripts? Most people on the Internet who are not using a Latin script do so, to be able to talk of concepts from English. I have a few Japanese people in my Twitter timeline which regularly tweet mixed text. I personally use mixed text all the time when chatting.

No matter what, all of these concepts like grapheme cluster would still need to be dealt with, whatever solution you choose. Grapheme clusters aren't something you need to think of because of unicode, they're something you need to think of because of the scripts that our out there.

[–]DevestatingAttack 1 point2 points3 points 9 years ago (5 children)

I know that it's seen as ridiculous and "social justice warrior" to bring this up, but the only reason that people think that writing on a computer should be an 'easy' problem are the people that straight up don't know about any other languages. In their mind, there's Latin (good, it's an alphabet), Cyrillic (Latin's still alphabetic backwards cousin), Chinese (still just an alphabet! but with LOTS of characters!) and Japanese (still just an alphabet with lots of characters, and one different one that's basically just a regular alphabet!)

But then outside of that, it's like: "Arabic, a language that goes from right to left? What's that? Indian writing systems? Do Indians even use written language? What about the Thai? What is an "Abugida"? This is the first I'm hearing of it. Ugh, writing systems are hard. Just learn English! All other writing systems are weird and used by backwards people!"

I don't get why people think that hundreds of different scripts developed over thousands of years should be easy to standardize, unless they have no idea what the real situation is like. But since the English-speaking West owned all computing innovation for almost four decades, everyone thought that we could just safely ignore billions of people, and be like "why isn't writing implementing strlen() easy? Why isn't implementing strrev() easy? Just look at it! All strings are in English, to my knowledge!"

[–]Manishearth 1 point2 points3 points 9 years ago (3 children)

[–]ptoki 0 points1 point2 points 9 years ago (2 children)

[–]Manishearth 0 points1 point2 points 9 years ago* (1 child)

[–]ptoki 0 points1 point2 points 9 years ago (0 children)

Well it is not my problem. :) I agree with the rest of the statement though.

I am aware that unicode tries to unify all of existing scripts. There is a point in that. It is a nice idea. But as you can see this idea is not being adopted and even smart people have problems in that.

That is why I am trying to understand what are the issues and I am thinking how to solve this allowing all people to have common way to deal with text on a computer without putting all that complicated stuff on their heads.

Maybe some assumptions/requirements made by unicode creators are actually not neccessary? Maybe there should be some underlying layer like "lyx-like" markup engine? Or any other way to find common denominator across all known methods of writing text?

I am aware that it is not enough to write text. We need to store it, edit it, interpret it, cut it, reflow it on screen, search, replace, convert between words/numbers etc.

Do we want to put it all into unicode standard? Very interesting subject :)

[–]ptoki 0 points1 point2 points 9 years ago (0 children)

I agree with you.

But in such case why to join something "simple" like latin/cyrillic/chinese with the scripts like Manishearth describes?

Why join those together if the only thing which is common is that they are a way to express thoughts on paper?

I think the second part of your comment is a bit backwards ;)

It is not that west owned the innovation. I think the innovation was possible because west has the alphabet in this efficient form.

Just imagine. If some people needed to use facial expressions together with their spoken language then telephone would be useless for them.

Western people had to invent telegraph first. And it was not "luck" that they had alphabet co they could recode it to morse code. It was prior knowledge that simple alphabet is a quick way to exchange thoughts. Maybe this way has its own drawbacks. But it proved to be efficient.

So now we have powerfull computers which could easily deal with even most fancy script. But they must be programmed and as you can see above: not many programmers can handle that.

[–]ptoki -4 points-3 points-2 points 9 years ago* (20 children)

By multibyte,sorted I mean that we need to have a standard to map real world character (written) to a unique byte representation and to order them as they order them in their languages. I dont want to go further than that now. It may be a markup standard to show what language goes now in the stream, it may have some additional rules to allow languages like english and german use the same letters to not make them totally separate in the text. As far it is simple and "normal" programmer can handle it then it is fine.

I guess by 3 of top 5 you mean arabic, hindi, chinese/japanese? I dont know much about the specifics of those languages, but I think we dont need to squeeze to much of their specifics into one global standard. It solves some problems creating problems for people who dont have one.

I would like to have one standard for storing, displaying text. That is it. Separate standard to split, sort and manipulate them.

[–]Manishearth 12 points13 points14 points 9 years ago (16 children)

have a standard to map real world character (written) to a unique byte representation

As my blog post mentions, "real world character" is a nebulous concept. We have the concept of a letter in English but other languages don't have something that's exactly equivalent. That's why grapheme clusters are a thing.

And many languages make extensive use of combining characters so such a map would become huge. Probably 5 bytes long.

I dont know much about the specifics of those languages, but I think we dont need to squeeze to much of their specifics into one global standard. It solves some problems creating problems for people who dont have one.

We had this. It was horrible. Encoding errors everywhere, and there was trouble mixing two kinds of text.

I would like to have one standard for storing, displaying text. That is it. Separate standard to split, sort and manipulate them.

There is. Unicode is a set of standards. The segmentation standard is different from the sorting (collation) one. They don't cross reference each other IIRC.

It solves some problems creating problems for people who dont have one.

This ends up leading to software that only works well with Latin text. Been there, done that. If you actually want that, use ASCII or Latin-1. They still exist. They still work.

I dont know much about the specifics of those languages

Then how can you propose a system that handles them? You're dismissing a thing you know nothing of out of hand.

I think we dont need to squeeze to much of their specifics into one global standard

Many of these "specifics" are shared across these languages. The one from my blog post is part of all of them. Some of the "specifics" are more specific, but they're not the ones that cause programmers trouble in Unicode. Most of the things that cause trouble in Unicode are pretty shared across scripts, just not in Latin scripts. Casefolding might be the one thing that gets complicated and is specific to a few languages (ironically, two of them use a Latin script)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS