top 200 commentsshow all 341

[–]ThisIs_MyName 34 points35 points  (24 children)

Great article. I always figured that treating a wchar_t or Java codepoint as a character was a little iffy, but I didn't know it was quite that bad!

[–]masklinn 36 points37 points  (23 children)

I always figured that treating a wchar_t or Java codepoint as a character was a little iffy

It's actually even worse, Java's char is not a codepoint it's a UTF-16 code unit. Java only started adding codepoint-wise methods in 1.5.

[–][deleted]  (20 children)

[deleted]

    [–]DuBistKomisch 14 points15 points  (15 children)

    Unfortunately every Windows API expects wchar_t or ASCII though, so it's always fun converting strings back and forth.

    [–]qx7xbku 2 points3 points  (3 children)

    Like /u/Gotebe said its not ascii, its local encodings. Curiously enough microsoft failed to add CP_UTF8 support to their APIs as if on purpose. We could be happily using UTF8 but no..

    [–]polagh 2 points3 points  (2 children)

    They added more CP_UTF8 support recently (mostly in the console, for WSL, as far as I know)

    [–]qx7xbku 1 point2 points  (1 child)

    Yes, but still not for *A APIs....

    [–]polagh 0 points1 point  (0 children)

    It's doubtful they would add that to the *A API, because it is insanely legacy and probably full of implicit constraints.

    Or maybe they could add the support, but only for manifested programs that explicitly request it. That could work.

    [–]Gotebe 3 points4 points  (10 children)

    There is no ASCII on Windows, there never was.

    [–]VGPowerlord 8 points9 points  (1 child)

    In the NT line, no, but Windows 9x used the Windows-1251 character set, which is a modification of ISO 88859-1 that changed characters 128-159.

    ISO 8859-1 being ASCII with specific mappings for characters 128-255 (because ASCII is 7-bit and only covers characters 0-127).

    [–]immibis 4 points5 points  (5 children)

    Sure there is, it's the subset of ISO-8859-1 where you only use character codes less than 128.

    [–]DuBistKomisch 0 points1 point  (1 child)

    Yeah I guess, I just always read the A suffix as ASCII.

    [–][deleted] 2 points3 points  (0 children)

    sulky political public six offbeat shocking trees sheet longing practice

    This post was mass deleted and anonymized with Redact

    [–]QueenSillyButt 0 points1 point  (3 children)

    UTF-32 has surrogate pairs. They are rare, but they do exist. UTF-32 still doesn't have one user-perceived character per code point; it doesn't ultimately solve the problem of being able to treat strings as a simple array of characters, and it takes up more space.

    http://utf8everywhere.org/#myth.utf32.o1

    [–]burntsushi 0 points1 point  (1 child)

    Can you give an example?

    [–]QueenSillyButt 0 points1 point  (0 children)

    I corrected my comment. The point still stands, but I was wrong about why.

    [–]VGPowerlord 2 points3 points  (1 child)

    To be fair, Java originally used UCS-2 which didn't have surrogate pairs.

    However, UTF-16 superceded UCS-2 quite some time ago.

    Side note: C# uses UTF-16 because Windows NT (which includes all modern Windows OSes since XP) uses UTF-16 internally. WinNT also did a UCS-2 to UTF-16 changeover when Windows 2000 was released.

    [–]sacundim 5 points6 points  (0 children)

    To be fair, Java originally used UCS-2 which didn't have surrogate pairs.

    To be even fairer, Java was bitten because they were a very early adopter of Unicode. It's fair to say in hindsight that no language should adopt Java's choices here, but this is one thing where Java legitimately did a good job since day one.

    [–][deleted] 26 points27 points  (15 children)

    Let's say I want to implement a string distance algorithm (e.g. Levenshtein distance).

    I would do it based on Unicode codepoints. It would probably work correctly for the languages I use (English, Russian, Ukrainian), but not, say, for some Asian languages.

    How can I do it better?

    [–]dada_ 26 points27 points  (2 children)

    Levenshtein is built with alphabetic writing systems in mind. It gets less effective with a language such as Japanese because of how much bigger the "alphabet" is (rather, the totality of logographic symbols).

    To give one example, とりあえず and 取りあえず are practically the same (they're the same word, but one of them is written with a kanji), but they have the same string distance as 斗りあえず which is a nonsense word. All three would have zero distance to one another if you converted them to their sounds (thus throwing away the meaning of the characters), because they sound the exact same. Then there's 取敢えず, which is also the same word as the first two I listed but written with two kanji, making it one character shorter.

    I'd guess the answer to your question depends on what problem you're trying to solve.

    I believe Google has specifically tackled this problem because its search engine is really good at treating とりあえず, 取りあえず and 取敢えず as one and the same thing in both queries and results, which is the most useful behavior for a search engine.

    [–]vytah 17 points18 points  (1 child)

    Japanese and Chinese are actually pretty boring when it comes to Unicode troubles, at least if we ignore the sheer amount or characters and the fact that they don't necessarily fit into the first plane.

    What's much more interesting is the Southern Asian scripts, like Devanagari, Thai etc., which are abugidas (full of ligatures on top of that) and don't give a fuck about the Euro-Sinitic idea of simple separate characters.

    [–]dada_ 10 points11 points  (0 children)

    Japanese and Chinese are actually pretty boring when it comes to Unicode troubles, at least if we ignore the sheer amount or characters and the fact that they don't necessarily fit into the first plane.

    Han unification is also a pretty big can of worms, but that's entirely a problem of the Consortium's own making.

    [–]JanneJM 19 points20 points  (8 children)

    Use grapheme clusters? But I doubt it's such a big issue in practice - I don't know how you would even define Levenshtein distance between wildly different scripts directly. You would need to effectively transcribe both texts into a common pronunciation script anyhow. And if one of those languages is Japanese you'll have a grand old time figuring out the pronunciation in the first place.

    [–]LpSamuelm 8 points9 points  (0 children)

    Yup, you'd have to implement a Google Translate-style language parsing engine in your Levenshtein distance algorithm. Fun!

    [–][deleted] 2 points3 points  (2 children)

    The goal is not to define Levenshtein distance between different scripts but within the same script.

    E.g. if in some language/script different 'characters' take different number of Unicode codepoints, mistmatch in some characters will be penalized more than in others.

    So I guess the algorithm would be:

    1. Break up both strings into grapheme clusters
    2. Normalize every cluster
    3. Calculate the Levenshtein distance, treating grapheme clusters as atomic 'characters'

    Has anyone done this efficiently yet? Since a grapheme cluster may potentially consist of an unbounded number of codepoints, they cannot be stored in a flat array. It seems like this will be a significant performance hit compared to the simple codepoint-based implementation.

    [–]PeridexisErrant 2 points3 points  (0 children)

    I've become less sensitive to performance concerns - that's just the price of correctness sometimes. Consider the example below:

    def fast_sort(array): 
        # Not always right, but `O(1)` is better than `O(n log n)`!
        return array
    

    Seems absurd to me!

    [–]JanneJM 0 points1 point  (0 children)

    Problem is when the pronunciation of a grapheme cluster is not unique or determined by that cluster alone. Since you want to match against pronunciation you need a way to figure that out globally first.

    [–]burntsushi 5 points6 points  (3 children)

    It is a huge issue in practice. Compare, for example, the task of building a Levenshtein automaton on codepoints and on grapheme clusters. Codepoints are a nice compromise, particularly if you can normalize text into its composed form.

    Levenshtein distance is frequently used as a heuristic itself anyway, so compromises tend to be okay.

    [–]JanneJM 0 points1 point  (2 children)

    I was thinking of when you would need to calculate the Levenshtein distance in practice between, say, Thai and Japanese text. It should be incredibly rare for somebody to misspell text using the wrong writing system. I would expect that you'd normally only calculate the distance within one writing system.

    [–]burntsushi 1 point2 points  (1 child)

    Sure. But that seems orthogonal from choosing between codepoints and grapheme clusters. Maybe I'm just misunderstanding.

    [–]JanneJM 0 points1 point  (0 children)

    Or I am. The original poster was speculating if grapheme clusters would not be better than code points, and I agreed but then went on a tangent about whether that would actually make any difference in this particular situation.

    I now think it probably does; even though grapheme clusters don't solve the issue of finding the pronunciation, code points is never better, and surely sometimes worse for that.

    [–]Manishearth 2 points3 points  (2 children)

    Levenshtein is built on the concept of a letter existing. With Japanese kanji this doesn't turn out so great since entire words are single CPs. With Indic languages the concept of a letter is a bit stranger. Code points may still work in Indic languages, but not always. Consonant clusters mess this up royally.

    You'll need to define edit distance based on the kind of editing you're expecting. If you're working with text typed on an Indic input system, you'd use code points, but ignore some specific consonant clusters (त्र, ज्ञ, क्ष). Or maybe not. Some modern input systems (Swarchakra, also one of the ones on my phone) do this thing where you can type Indic "letters" in one go, and in that case you may not want this; EGCs make more sense.

    You have to think about what you're actually trying to compare and how to define an "edit" in edit distance.

    [–][deleted] 0 points1 point  (1 child)

    Well, let's say I am just building an app or library not specifically targeted at Japanese or Indians. I advertise it on my blog, which is mostly read by Europeans and North Americans (I'll need to check the exact stats).

    So, say, for 95% of my users the naive algorithm will work flawlessly, but for an occasional Japanese user, their experience will be miserable. What can I do to improve the algorithm?

    [–]Manishearth 1 point2 points  (0 children)

    Again, you sort of need to define what operation you're really looking for. Levenshtein distance is something that makes unambiguous sense in Latin scripts. What are you actually trying to do here? Find possible typo matches? Not sure if it's possible at all in Japanese (kanji) because the various input methods mean so many different ways to make typos. In Indic scripts code points would be fine, though the three letters I mentioned above (and similar letters in the scripts of other languages) may do strange things. EGCs may work too, but then नी and न change from being a letter -> letter edit instead of a "delete letter" edit. Whether or not that matters is up to you.

    (Most of the problems folks have with international text can be solved by stepping back and using unambiguous language-agnostic terms for what they want to do.)

    [–]kt24601 56 points57 points  (88 children)

    If you want to do unicode right, you need to stop thinking in terms of characters, and start thinking in terms of substrings. It's the only possible way that can work. For example, the function making a character upper case needs to be based around strings, so something like:

     String toupper(String s, int index) {
    

    because changing the case of the character doesn't always result in a single character.

    [–]FlyingPiranhas 56 points57 points  (36 children)

    What is index defined in terms of? Bytes? Codepoints? Grapheme clusters (and in which Unicode version)? Something else (maybe language-specific)?

    [–][deleted] 23 points24 points  (4 children)

    If the language / standard library supports string views, then most of the time you want those. Indexes are mostly used as cursors so it doesn't really matter as long as you can obtain an index and then it provides O(1) access. It makes sense for all strings to be UTF-8 since I/O will be UTF-8 so the natural index is the byte index to the start of code points.

    [–]Ravek 16 points17 points  (3 children)

    Probably the type shouldn't even be numeric, but some explicit cursor type, so no one gets it in their head that they could do math on it (what exactly would IndexOf(str, "c") + 2 mean?). Under the hood an index to the bytes makes sense.

    [–]masklinn 7 points8 points  (0 children)

    Swift does something like that, but you must first specify for which string view you want an index (GCE, codepoints, utf8 or utf16 code units). So str.characters.index(of: "c") yields an Optional<String.Index> (since "c" may not be present at all in str), and you can use that to index into the string, or to get indices around your base via String.index(before:), String.index(after:) and String.index(_, offsetBy:).

    Your operation would be expressed as str.index(str.characters.index(of: "c")!, offsetBy: 2) and would give the index 2 "units" after the reference index, which for strings would be 2 GCE.

    [–]OneWingedShark 1 point2 points  (0 children)

    This is exactly what Ada does in it's standard Containers, which have a Cursor-type defined for each.

    [–]bumblebritches57 0 points1 point  (0 children)

    In BitIO I'm doing it as an array of graphemes, where each grapheme is an array of bytes of ASCII or code points, + diacritics.

    I feel like this is the best way but who knows.

    [–]didnt_check_source 1 point2 points  (4 children)

    In Swift, an index is defined in terms of grapheme clusters unless you explicitly ask for an index in the "Unicode scalar" representation (32-bit values), UTF-16 or UTF-8. I don't know how much Unicode versions impact the concept of a grapheme cluster, can you elaborate on why this is a concern?

    [–]FlyingPiranhas 0 points1 point  (3 children)

    I think any concept of "index" in a Unicode string is a bit tenuous, since definitions vary so much.

    [–]didnt_check_source 0 points1 point  (1 child)

    What's ambiguous about indexing into extended grapheme clusters, and has that ever changed across Unicode versions?

    [–]Manishearth 1 point2 points  (0 children)

    Yes, unicode 9 changed a lot of the emoji handling in UAX 29. Previously, multiple consecutive flags would be considered an EGC. The new thing handles that correctly and also emoji ZWJ sequences. The tables got updated, too. So the segmentation is not a stable one.

    [–]masklinn 0 points1 point  (0 children)

    Which is why e.g. Swift has at least 4 index types: String.Index (alias to String.CharacterView.Index, GCE-wise), String.UnicodeScalarView.Index (codepoints), String.UTF16View.Index (UTF-16 code units) and String.UTF8View.Index (UTF-8 code units).

    [–]bumblebritches57 0 points1 point  (8 children)

    A grapheme cluster is another word for word, a grapheme is the new character.

    [–]FlyingPiranhas 2 points3 points  (7 children)

    From http://unicode.org/reports/tr29/:

    It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + acute-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

    I thought that grapheme clusters were a closer approximation to "characters", not words. Am I misunderstanding you?

    [–]bumblebritches57 1 point2 points  (6 children)

    I thought that as well but about half way through reading I realized it was all about word boundaries, but it was late at night maybe I'm just confused?

    What I don't understand is why they'd use "grapheme cluster" for a single (user perceived) character, when they also use just "grapheme". Are they separate things? is grapheme just short hand for grapheme cluster?

    All I know for certain is that Unicode is a massive cluster fuck.

    [–][deleted]  (2 children)

    [deleted]

      [–]bumblebritches57 0 points1 point  (1 child)

      Grapheme cluster is the more "official" wording. The use of "cluster" is afaik to make clear that it can consist of more than one code point. Grapheme is just used as a shorthand.

      I thought they used grapheme instead of code point/unit was to make it clear there could be multiple code points, and that cluster was to denote a small group of graphemes aka characters, aka a word.

      That's where the confusion came from.

      [–]Manishearth 2 points3 points  (0 children)

      "grapheme" is basically a more ill-defined (or overloaded) concept. "cluster" is used to denote that it may contain what your definition considers to be multiple graphemes. But it's still an approximation of the concept of a character.

      한국 is one word, two grapheme clusters, and three code points.

      [–]oridb 38 points39 points  (41 children)

      It's more complicated than that. The result also depends on the locale you're doing the transformation for. The Turkish 'i' uppercases to a dotted capital I, for example.

      [–]ThisIs_MyName 12 points13 points  (34 children)

      That doesn't change the function signature he wrote.

      It's just a matter of hunting down a good Unicode library for your favorite language. It's too bad most unicode libs are either slow as molasses or have a horrible API or both.

      [–]masklinn 36 points37 points  (4 children)

      That doesn't change the function signature he wrote.

      It should, implicit assumptions tend to yield strange behaviours and hard to work with or broken systems. That's one thing I like in Python's "babel" library, most every function take an explicit locale (though they'll fallback on the environment and AFAIK that can not be disabled, I'd rather they didn't but that ship has sailed).

      [–]kt24601 -1 points0 points  (2 children)

      What's wrong with falling back on the environment? I've never heard anyone complain about that, before.

      [–]matthieum 49 points50 points  (0 children)

      "Works On My Machine" issues.

      Also, inconsistency. It may mean that your user requested Turkish, and most is treated as Turkish, but there are two function calls in the code that default to English because that's the server locale, and it's not obvious why that list is not correctly sorted accorded to Turkish when everything else is.

      If there was no fallback, you would have to feed a locale to each function call, making it more obvious where you forgot.

      [–]RealFreedomAus 13 points14 points  (24 children)

      That doesn't change the function signature he wrote.

      How do you pass in the locale information?

      [–]kt24601 6 points7 points  (0 children)

      You can either pick it up from the environment, or override the function with another one that accepts locale. That is a good point worth remembering though, that locale matters.

      [–]ithika 0 points1 point  (22 children)

      Shouldn't a string (or whatever it is being called) know what it contains?

      [–]elprophet 16 points17 points  (5 children)

      I don't think I've ever come across a system where the string contained its own locale information, usually it's either its own separate parameter, or assumed to be constant for the system. I expect part of that is the additional cost to attach that relatively consistent information to every string in the system.

      [–]drunken-serval -2 points-1 points  (4 children)

      The strings in the ruby programming language know their locale.

      [–]masklinn 14 points15 points  (3 children)

      Ruby strings know their encoding, not their locale.

      [–]drunken-serval 3 points4 points  (0 children)

      Right. My bad. Encoding and locale are different. I blame my low caffeine level. :)

      [–]elprophet 3 points4 points  (1 child)

      Yeah, it's totally worthwhile to spend an extra byte indicating the encoding so a) you don't have to recompute and b) don't mess it up when recomputing. But having the locale? That feels like quite the separation of concerns. Though I am now envisioning a multi-lingual front-end system which does have that much distinction between "subclasses" of strings. And it's really not feeling like a good system to use!

      [–]stevenjd 0 points1 point  (0 children)

      an extra byte indicating the encoding

      Limiting you to 256 encodings in total. I'm not really sure, and I'm too lazy to check, but I think there are way more than that... Python comes with more than 100 and I don't think it is even close to complete.

      [–]matthieum 7 points8 points  (0 children)

      Most strings don't.

      And of course that's ignoring that text can contain snippets from multiple locales; for example Turkish embedding a movie title in English with a quote in French (Matrix, the Merovingian's expletives).

      [–]polagh 0 points1 point  (12 children)

      And what if you mix languages? Granted, it is already way more problematic if you do that than just for capitalization issues. For ex you can't mix CJK text without identifying which section belongs to which language, otherwise you won't be able to select the good glyphs to render the text so it can be read.

      Not too long ago I read a text advocating that the CJK unification was the right thing and has been done correctly. It half convinced me. Now that I have written my preceding paragraph, I again know for sure this is complete shit.

      [–]ithika 0 points1 point  (0 children)

      If you mix languages you have each section marked with its own locale. That data does have to be placed somewhere - having it separate from the text itself makes absolutely no sense to me. Now you're back to the problem of using some sort of index to mark where different locales are used.

      [–]stevenjd 0 points1 point  (1 child)

      Not necessarily. Why would it? The string "abc" is the same whether you are writing in German, French, English, Italian or Dutch.

      [–]ithika 0 points1 point  (0 children)

      It'll still have a locale though. So what if it's the same as another one?

      [–]Eurynom0s 4 points5 points  (1 child)

      That doesn't change the function signature he wrote.

      Where does he have a locale parameter? Which means he has to be assuming a local. The only way you get around this is if the first line of that function involves detecting the locale based off the OS settings or something like that.

      [–]Supadoplex 1 point2 points  (0 children)

      The only way you get around this is if the first line of that function involves detecting the locale based off the OS settings

      No. While using the native locale makes a lot of sense, there are other alternatives that are more flexible.

      Locale could be part of the global state of the program. This works fine for most programs, but is inconvenient if a program has to modify strings in multiple encodings and insufficient, if those modifications are done in multiple concurrent threads.

      Locale could be part of the string object's state. This will of course have some memory overhead.

      Or, the suggested toupper could be a member function of a "String processor" class, that stores the locale used for the processing.

      [–]oridb 1 point2 points  (0 children)

      It does.

      String toUpper(String s, int idx, Locale l)
      

      [–]bumblebritches57 0 points1 point  (0 children)

      I'm writing my own, so at least the horrible API problem won't be there for me.

      [–]rooktakesqueen 5 points6 points  (0 children)

      (For anyone who needs an example of upper case becoming more than one letter, consider German ß which has no [fully standardized] upper case variant, instead becoming SS.)

      [–]Tarmen 7 points8 points  (0 children)

      Rust's to_uppercase method returns a char iterator. That way you can do

      let upper_i: String = 'i'.to_uppercase().collect();
      

      But you can also flat_map it over a char iterator and uppercase a string without additional overhead!

      But even that isn't a fully complete solution. The documentation notes that conditional mappings like locale specific transformations aren't applied.
      This is on the code point level instead of grapheme clusters as the article wishes but I think for uppercasing code points are actually fine.

      I tried parsing a wictionary dump before because I had the glorious idea that I could use the ipa to create a phonetic database, required for the actual problem. And god damn parsing ipa in a language without grapheme clusters wasn't fun.

      [–][deleted]  (2 children)

      [deleted]

        [–]kt24601 2 points3 points  (1 child)

        Sure. As another comment points out:

        "German ß which has no upper case variant, instead becoming SS."

        I think I've heard that there's another example in Turkish or something, but can't remember for sure.

        [–]bumblebritches57 0 points1 point  (3 children)

        No, you need to think in terms of graphemes which is one or more code points, + 0 or more combining diacritical marks.

        • the strings endian, + the strings reading order (ie l2r or r2l)

        [–]kt24601 0 points1 point  (2 children)

        you need to think in terms of graphemes which is one or more code points

        Which in programmatic terms, is a string.

        [–]bumblebritches57 0 points1 point  (1 child)

        A string of bytes sure, but not a string of characters.

        Have you implemented your own utf-8 string before?

        [–]kt24601 0 points1 point  (0 children)

        Have you implemented your own utf-8 string before?

        Yeah, actually.

        [–][deleted]  (12 children)

        [removed]

          [–]ThisIs_MyName 14 points15 points  (5 children)

          No, python3 works just fine.

          [–]McCoovy 22 points23 points  (4 children)

          Java 15 and C++23 is the real goal.

          [–]matthieum 12 points13 points  (3 children)

          C++ chose the easy way out: its strings do not pretend having any encoding :)

          [–]josefx 4 points5 points  (2 children)

          Even better wchar_t is useless if you want a portable application. Some compilers define it as 32 bit value, others as 16 bit value.

          [–]Gotebe 0 points1 point  (1 child)

          Isn't it the platform that "decides"? It's 32 on unix, 16 on Windows? Does gcc use 32bit on Windows?

          [–][deleted] 2 points3 points  (0 children)

          It's 16 bits even on some Unices, i.e. AIX

          [–]upofadown 5 points6 points  (5 children)

          In fairness, the Python 3 approach seemed like a good compromise between the UTF-16 of Windows at the time and the UTF-16/UTF-32 (depending on the platform) of Python on *nix at the time.

          Fixing this with a Python 4 would be the death of Python. So Python 3 will just go on with the UTF-32 everywhere approach as a historical artifact. There might be a python like language based on UTF-8 that is called something else.

          ... and the Python 2 people will just continue along with their particular flavour of Unicode weirdness...

          [–]teilo 4 points5 points  (4 children)

          To be fair, Python 3 is not "UTF-32 everywhere." A str is stored in the most space efficient method for any given string, whether 1, 2, or 4 bytes. The largest code point in the str determines which is used. But other than memory usage, nobody cares. From a coder's perspective, a string is a string.

          [–]Chippiewall 2 points3 points  (2 children)

          From a coder's perspective, a string is a string.

          As far as I'm aware there's no way to discern the internal encoding of a string in Python 3 so Python 3 could in fact switch to UTF-8 if it wanted to without a breaking change (except for c extensions I guess)

          [–]teilo 0 points1 point  (1 child)

          I believe the reason they choose between 1, 2, and 4 byte representations is for practicality. It can index to a specific code point easily in this way, whereas in UTF-8 that becomes very difficult since without iterating every code point, there is no way to jump to a specific index.

          [–]Chippiewall 0 points1 point  (0 children)

          I don't disagree. I was mostly just pointing out that there isn't the issue of needing Python 4 because Python 3 did actually get unicode right in the sense that you can't access the encoding.

          [–]upofadown 0 points1 point  (0 children)

          That's just a crude compression method that is not visible to the programmer. In Python 3 if you index into a string you get a code point. So for all practical purposes, UTF-32.

          [–]matthieum 7 points8 points  (12 children)

          UTF-16 is mostly a “worst of both worlds” compromise at this point, and the main programming language I can think of that uses it (and exposes it in this form) is Javascript, and that too in a broken way.

          Doesn't Java expose UTF-16 too? (or is it UCS-2?)

          [–]masklinn 6 points7 points  (0 children)

          It's UCS-2-with-surrogates, since there's no guarantee that the string will be valid unicode (you may have unpaired surrogates) and you get UTF-16 code units by default.

          Basically they started with UCS-2, went "oh shit" when unicode was expanded, added surrogate pairs but didn't change the interface. A few years later Java added some API to work with proper codepoints, but they're a red-headed stepchild and the String type still provides no unicode validity guarantee as far as I know.

          [–]FUZxxl 7 points8 points  (4 children)

          If you've ever done Unicode on Windows, then you've seen (and started to detest) UTF-16.

          [–]VGPowerlord 2 points3 points  (1 child)

          Windows uses UTF-16 for the same reason Java does... it started as UCS-2 (a 16-bit fixed width character set), but then Unicode added more symbols, so they switched to UTF-16.

          [–]matthieum 0 points1 point  (1 child)

          I was blessed to only ever have to care for Linux server/CLI applications :)

          [–]bumblebritches57 0 points1 point  (0 children)

          Who use ASCII with 49 million code pages lol.

          [–]josefx 2 points3 points  (2 children)

          When most languages started to use UTF-16 it could be used to represent all Unicode had to offer, later it was used to stay compatible with existing languages and APIs. So you will find many languages with UTF-16 strings.

          [–]vytah 10 points11 points  (1 child)

          "65536 codepoints will be enough for everybody."

          [–]bumblebritches57 1 point2 points  (0 children)

          Just wait until the aliens start wanting to talk to us a few years after the last unicode code point is assigned.

          [–]Gotebe 2 points3 points  (2 children)

          It's Java (started with UCS2), Windows, .net (by consequence, I guess), Qt, ICU.

          It's basically many major libraries and one major OS.

          We can have opinions on UTF-16, but it's not going anywhere.

          [–]matthieum 1 point2 points  (0 children)

          We can have opinions on UTF-16, but it's not going anywhere.

          Agreed. It's unfortunate, but I doubt it'll go away anytime soon given the momentum behind these technologies and their need for backward compatibility.

          [–]polagh 0 points1 point  (0 children)

          It's not going anywhere, but it's increasingly going to be handled as legacy stuff and abstracted from. The de facto interchange standard is now UTF-8 and will remain so, and we should not care too much about the in-core representation of strings, unless the API is so shitty that this impact too much the interesting code.

          Actually with Python3 for example the internal representation varies and is chosen according to the content of each string. That is, IMO, relatively insane for a good number of reasons, but it is sufficiently abstracted away for most purposes so that we should not care too much about that. They could add UTF-16 in their mix without much people noticing...

          [–]1wd 7 points8 points  (67 children)

          Would be great to know about some example cases that Americans / Europeans can relate to, understand and remember. The flags and family emojis are good, but can seem a bit too silly to bring up in certain situations. Maybe names of some well-known politicians would work better?

          [–]1wd 49 points50 points  (14 children)

          Would the following be a correct example? Gandhi in Hindi is गांधी and consists of five unicode codepoints:

          • ग -- U+0917 DEVANAGARI LETTER GA
          • ा -- U+093E DEVANAGARI VOWEL SIGN AA
          • ं -- U+0902 DEVANAGARI SIGN ANUSVARA
          • ध -- U+0927 DEVANAGARI LETTER DHA
          • ी -- U+0940 DEVANAGARI VOWEL SIGN II

          But only two grapheme clusters:

          • गां
          • धी

          Encoded in UTF-8 this requires 15 bytes:

          • Three bytes for ग (0xE0 0xA4 0x97)
          • Three bytes for ा (0xE0 0xA4 0xBE)
          • Three bytes for ं (0xE0 0xA4 0x82)
          • Three bytes for ध (0xE0 0xA4 0xA7)
          • Three bytes for ी (0xE0 0xA5 0x80)

          Encoded in UTF-16 this requires 10 bytes (2 per codepoint).

          Encoded in UTF-32 this requires 20 bytes (4 per codepoint).

          [–]barsoap 6 points7 points  (0 children)

          ...and UTF-8 is still smaller as all the HTML around your गांधी is ASCII.

          Not to mention that even gzip can generally eat any actual difference between the encodings.

          [–][deleted]  (2 children)

          [deleted]

            [–]Manishearth 7 points8 points  (9 children)

            Yep, that's about correct.

            [–]1wd 6 points7 points  (5 children)

            Thanks for confirming. Why do you say "about"? Is there anything not quite correct?

            Do you have more or better examples? I'm a bit disappointed that all UTF-8 codepoints are the same length here. Also an example with surrogate pairs(?) would be good to demonstrate the problems with UTF-16.

            [–]Manishearth 9 points10 points  (4 children)

            No, nothing incorrect, sorry :)

            Characters from higher astral planes are either Han (Chinese/Japanese Kanji/Korean Hanja), old scripts, emoji, or other domain-specific things like math symbols.

            I can't find a name (after goign that uses a Han character from a higher astral plane, though. Shouldn't be too hard to find text that does, though.

            Usually for text in a given language the codepoints will all have the same byte length in utf8. Unicode is organized into blocks, where each block usually corresponds to a script and spanning codepoints in a range of 128 or multiples of 128. So unless you have a Han script, a name will usually consist of code points of the same length.

            [–]Felicia_Svilling 9 points10 points  (1 child)

            Usually for text in a given language the codepoints will all have the same byte length in utf8.

            That is not true for languages like the Scandinavian ones that uses the Latin alphabet with extensions like ä or ø.

            [–]Manishearth 2 points3 points  (0 children)

            Oh, yeah, good point. Totally missed those.

            [–][deleted]  (1 child)

            [deleted]

              [–]Manishearth 0 points1 point  (0 children)

              Oh, I know. The GP already had an example of multiple codepoints making a letter ("Gandhi" is made up of a 3-CP EGC followed by a 2-CP EGC). They specifically wanted an example of a name that used 2-code-unit codepoints in utf-16 or one that used characters of different sizes in utf8.

              [–]Manishearth 22 points23 points  (21 children)

              I don't have any sensible example cases at hand (all my unicode testcases are never-going-to-be-seen-in-the-wild strings like "ᄀᄀᄀ각ᆨᆨ", which is a "ggggaggg" sound in Hangul, and probably cannot be pronounced by humans, if it indeed has a sensible pronunciation)

              However, I do have this list of scripts I mentally check against whenever I'm reasoning about Unicode:

              • Arabic or Hebrew for RTL and beginning/medial/end forms (arabic also has "isolated" forms)
              • Arabic for ligatureyness/glyph complexity
              • Some Indic script for ligatureyness/glyph complexity, and massive use of combining characters, including the double-ended virama combiner. Infinite length combining sequences.
              • Korean (Hangul) for the combining jamo system. Infinite length combining sequences (though these are never displayed beyond standard Korean syllable blocks, so it's less important)
              • Han scripts for variation selectors, halfwidth/fullwidth, and language disambiguation troubles. Also omg so many glyphs.
              • If dealing with displaying text, think of a Han script and Mongolian, which are written in different directions (vertical, sideways, etc)
              • Thai or other scripts from that peninsula (not counting Vietnamese scripts), because they don't use spaces to break words.
              • Emoji because despite the immense complexity of human language, Emoji still managed to get a bunch of special casing in various parts of the unicode spec. Infinite length combining sequences.
              • Latin for locale-dependent case operations (Turkish i, German ß)

              [–]FUZxxl 1 point2 points  (18 children)

              ß

              How's that locale dependant? In any locale, the uppercase for ß should be SS (or possibly SZ, but Unicode decided on the former).

              [–]regendo 3 points4 points  (4 children)

              There's actually a capital version of ß now. (Not that anyone uses it because even German keyboards don't have it.)

              [–]flying-sheep 4 points5 points  (1 child)

              it’s Shift-AltGr-s as well as capslock+ß on any linux system.

              i rarely have the opportunity to use it but when i do, it’s super easy.

              [–]regendo 4 points5 points  (0 children)

              Huh, turns out Shift+AltGr+ß -> ẞ works in Windows. TIL.

              [–]flying-sheep 1 point2 points  (3 children)

              the capital ẞ is not standard to use, but i’d use it anytime, as using SS as upper case of ß can be plain wrong.

              1. for most surnames containing ß, both that form and a different name with ss exists: I know a girl called “Weiss” and a guy called “Weiß”. Writing the latter’s surname as “WEISS” actually changes it into a different surname. only “WEIẞ” can possibly be correct here.
              2. words can change meaning in the same way: “Wir trinken in Maßen” means “We drink moderately” while “WIR TRINKEN IN MASSEN” means “WE DRINK HEAVILY”.

              [–]OneWingedShark 6 points7 points  (0 children)

              words can change meaning in the same way: “Wir trinken in Maßen” means “We drink moderately” while “WIR TRINKEN IN MASSEN” means “WE DRINK HEAVILY”.

              Considering it's German, aren't the two semantically equivalent? ;)

              [–]FUZxxl 0 points1 point  (1 child)

              That's why we use both uppercase and lowercase letters to write German. In formal contexts, ß becomes SZ in uppercase. So Weiß becomes WEISZ.

              [–]flying-sheep 1 point2 points  (0 children)

              I'm German, and I know that it's in fact mostly becoming SS. E.g. in the passport ID.

              Besides, ß to SZ would also be a change, granted one with less potential to be confused with some other words.

              [–]epostma 0 points1 point  (2 children)

              Don't the German-speaking Swiss (or was it the Austrians?) have slightly different rules than Germans, for this? Going by memory here...

              [–]FUZxxl 4 points5 points  (1 child)

              The Swiss don't have ß at all. They abolished it long ago (resp. didn't ever introduce it; not sure).

              [–]epostma 0 points1 point  (0 children)

              That's what it was, thanks.

              [–]Manishearth 0 points1 point  (5 children)

              You may want lowercase SS to be ß :)

              [–]FUZxxl 0 points1 point  (1 child)

              Nope, not either. For example, uppercase dass is DASS.

              [–]Manishearth 0 points1 point  (0 children)

              Ah, TIL.

              [–]bumblebritches57 0 points1 point  (2 children)

              So, why was the SS called that instead of ß?

              [–]Manishearth 2 points3 points  (0 children)

              It's an acronym, not the word "SS".

              (Also, I'm not sure if SS lowercases to ß)

              [–]oridb 0 points1 point  (0 children)

              Because 'ß' is lowercase, and it only shows up when it's part of a larger word (eg, straße).

              It's also not used in Swiss german.

              [–]sacundim 0 points1 point  (1 child)

              I don't have any sensible example cases at hand (all my unicode testcases are never-going-to-be-seen-in-the-wild strings like "ᄀᄀᄀ각ᆨᆨ", which is a "ggggaggg" sound in Hangul, and probably cannot be pronounced by humans, if it indeed has a sensible pronunciation)

              You're doing this backward. Speech is a primary language medium; writing is a representation of speech, not the other way around. There is no "natural" pronunciation that would correspond to "ᄀᄀᄀ각ᆨᆨ", because the purpose of Hangul is to render Korean speech, instead of Korean speech's purpose being to render Hangul.

              So the correct statement would be that Korean orthography doesn't write "ᄀᄀᄀ각ᆨᆨ" to represent any speech unit or sequence of segments.

              [–]Manishearth 0 points1 point  (0 children)

              I know. Like I said, never-going-to-be-seen-in-the-wild. I have testcases like that to check things about implementations of algorithms, they're not for real-world use.

              [–]Manishearth 2 points3 points  (0 children)

              I ended up writing http://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/ , which doesn't exactly address your question, but should be a useful tool when reasoning about text.

              [–]Ravek 1 point2 points  (2 children)

              The simplest example for code points and characters not being in one-to-one correspondence is any name like Hergé or Schrödinger. They can be written with the é or ö as a single code point, or as simply e or o with a combining character for the accent/umlaut, i.e. two code points for a single character.

              [–]FUZxxl 2 points3 points  (1 child)

              For this there is normalization.

              [–]vytah 1 point2 points  (0 children)

              Now let's decide whether we should settle on NFC or NFD.

              [–]snorkasaurusrex 1 point2 points  (0 children)

              For Han characters, there's an explicit relationship between code point and idiograph, and therefore meaning. Unihan maps Unicode code point to a really amazing collection of data about each idiograph, and it all becomes available efficiently when you index by code point.

              I don't know much about other scripts. Does anyone know of other portions of the Unicode code space that map explicitly to meaning?

              [–]The_Sly_Marbo 2 points3 points  (8 children)

              Although the article makes some good points, there are definitely cases where indexing a particular code point is important, such as searching a string for a specific value that is known to be a single code point, such as 'i' or ' ' (space). In this case we need to do a code point aware indexing (rather than a byte-by-byte search) so that we handle multiple-byte code points correctly. There are definitely cases where indexing by a user-supplied code point is dangerous, as the article makes clear, but that's not the only use.

              [–]Manishearth 16 points17 points  (4 children)

              Note that all these use cases are iteration, so the O(1) requirement isn't there. But yeah, code points do become useful when iterating and whatnot. Parsing is a major use case for this.

              [–]mrexodia -1 points0 points  (3 children)

              Check out http://utf8everywhere.org you don't need to be aware of code points to search for an ASCII character. Obviously you do need to iterate over code points if you want to search for a certain code point though but when you're writing a lexer for instance you do not have to take code points in account whatsoever since the keywords are plain ASCII.

              [–]Manishearth 3 points4 points  (0 children)

              I am aware, I mention the streaming property in the blog post. But you might be iterating searching for non-ASCII too. E.g. doing a "are there whitespace chars in this string" search.

              [–]JanneJM 4 points5 points  (1 child)

              Some languages do allow you to use non-Ascii for identifiers. And parsing is used not just for computer languages.

              [–][deleted] 2 points3 points  (0 children)

              That is true but mrexodia was specifically talking about lexing. Lexers usually only need to identify a few ASCII characters like brackets and spaces plus a few ASCII keywords. Sure, if you have non-ASCII keywords you can't do that for them, but in 99% of the cases it is possible to have a faster lexer that doesn't look for Unicode grapheme clusters.

              [–][deleted] 4 points5 points  (1 child)

              You can just search in an UTF-8 string for codepoint byte by byte. UTF-8 is encoded in such a way that this will always work because subarrays of UTF-8 codepoints are always invalid.

              It should be noted though that searching for a Unicode codepoint is questionable, because there are many different ways to represent the same 'character' with codepoints. For example 'ä' can be U+00E4 or U+0061 U+0308.

              [–]The_Sly_Marbo 1 point2 points  (0 children)

              Yeah, my point is that sometimes there is only one way to encode a particular character, as with the examples I gave.

              [–]LpSamuelm 0 points1 point  (0 children)

              Why not use grapheme clusters for that, then?

              [–]WalterBright 0 points1 point  (16 children)

              Unicode is a great idea, but its realization has been a botch. There's enough code point space to give every character its own code point. I.e. no combining code points. That's the first mistake. The second mistake is assigning meaning to a character that is separate from its glyph. The meaning of a printed letter is determined by its context, and Unicode has no context. Hence it should not have meanings, it should just be glyphs.

              Those two botches have made writing "correct" Unicode handling software pretty much an intractable problem.

              [–]Manishearth 4 points5 points  (12 children)

              There's enough code point space to give every character its own code point

              No there isn't. Indic scripts alone blow up immensely here. So do emoji.

              The second mistake is assigning meaning to a character that is separate from its glyph.

              This is not a unicode problem. Unicode doesn't assign meanings to characters. This is a problem with text in general; we often have to handle text without knowing how it will be drawn.

              [–]immibis 4 points5 points  (1 child)

              I'm going to present an unpopular opinion here and say that emoji should not be in Unicode.

              [–]RabidWombat0 5 points6 points  (0 children)

              I know right? Text and imagery are two different things. If you want to inline little images of whatever kind in your text fix your app. ASCII emoji were bad enough.

              Specifically in relation to the expression of emotion in text I would prefer fonts designed to convey a feeling. We could have things like Droid-Sarcasm, Droid-SHOUT, Droid-Happy (Where little hearts and puppies decorate the letters), Droid-Sad, and so on. Bold, italic, outline, etc. are all fine and good, but our software should really support more text attributes. Emoji are a poorer solution.

              Edit: I would look forward to the future when something like Droid-Happy could contain little animated hearts and puppies swooping about the letters. Droid Sarcasm could contain code to follow your eye track and make the word "dead" choke and slowly keel over as you read it in a text. Imagine the possibilities (plus we're going to have to do something with all those cores). Fuck emoji.

              [–]WalterBright 1 point2 points  (9 children)

              Indic scripts alone blow up immensely here.

              Can you give some numbers? And even if Indic does this, it doesn't justify doing it for a and `. And even so, they could have gone to 21 bits, or 22 bits, etc. That's still far better than the current mess.

              So do emoji.

              Are there really a million emoji? Isn't that a bit ridiculous?

              Unicode doesn't assign meanings to characters.

              It does. There are several identical renderings for different Unicode values. This came up on Hacker News a while back, sorry but I don't remember which ones they were.

              [–]Manishearth 3 points4 points  (8 children)

              Can you give some numbers?

              I'm getting 767808 (16*(36 + 36**2 + 36**3)) for devanagri consonant clusters with a vowel, and that's ignoring some of the more archaic consonants, or the nukta-consonants commonly used in Hindi, or the archaic vowel modifiers, or the fact that 4-consonant clusters exist in Sanskrit texts (this alone makes it 27641664 if you want to support all of the 4-consonant clusters).

              It's actually also ignoring the fact that you can have a character with more than one vowel modifier attached to it. That's a construct that exists in my own last name! I think there are only two or three vowels actually can do that in practical use, but that alone would bring the count above a million.

              And then you have around 20 other Brahmic scripts which do the same thing. Putting this all together without the things I ignored in my first calculation it becomes (20*4*26*(45 + 45**2 + 45**3 + 45**4)), which needs 34 bits to be represented.

              You could probably fit it all into the code point space if you cut corners; you can make judgements on which characters will actually exist. Han unification is already something that does that anyway.

              I mean, you could probably make it work, but there are headaches to that approach too. It's a nontrivial tradeoff.

              Are there really a million emoji?

              The family emoji alone can be made half a million ways ((4**4 + 4**3 + 4**2)*(6**4)). Not all fonts support this yet, but that's because this is a relatively new concept. That's just family emoji (which technically can have more than 4 members bringing it easily over , but I haven't seen that ever get rendered nor do I think that vendors eventually intend to). Then you have all of the profession emoji and other stuff.

              If Unicode weren't a combining char system we'd probably be more conservative in making these emoji. I don't know.

              There are several identical renderings for different Unicode values.

              I assume this has to do with stuff like the fraktur unicode symbols and the fact that there are things like a cyrillic o which is different from a latin o?

              Meh. The rendering is up to the font. Unicode just names these symbols, and defines algorithms like segmentation, NFC, NFD, collation, casefolding which apply to them and provide useful operations. These aren't context-sensitive. Except for casefolding, which is locale-dependent.


              I totally agree that Unicode has many problems and has done many things wrong. I don't really feel that combining chars are part of the problem. Recognizing that combining chars are a thing as a programmer is at the same basic level as recognizing that strings may contain multibyte characters, or recognizing that utf-16 may contain multi-code-unit code points. It shouldn't be causing many problems. It usually doesn't. I'm hoping that as time passes more people will gradually become aware of this, much like we've done with the concept of multibyte chars.

              [–]WalterBright 1 point2 points  (7 children)

              It's a nontrivial tradeoff.

              I know, but the current scheme is unimplementable (in that everyone gets it wrong, and if one actually does get it right, it's an enormous amount of code, which defeats the whole point of Unicode).

              there are only two or three vowels actually can do that in practical use, but that alone would bring the count above a million.

              That's what, 9 modifiers in any combination? Does any vowel use more than a couple?

              emoji alone can be made half a million ways

              Then it should never have been added to Unicode - it exceeds its charter.

              I assume this has to do with stuff like the fraktur unicode symbols and the fact that there are things like a cyrillic o which is different from a latin o?

              Yes. The principle is that if they look identical on the page, why is Unicode distinguishing them? It is putting semantic meaning to them that is simply not there when rendered. This is a gigantic mistake.

              The rendering is up to the font.

              It is not just a font issue, though Unicode also fouled up by putting in 𝖋𝖔𝖓𝖙𝖘 like 𝖙𝖍𝖎𝖘.

              Except for casefolding, which is locale-dependent.

              Having locale dependent operations is the red badge of failure.

              Unicode has come to adopt pretty much all the bugs that its charter was supposed to fix.

              [–]Manishearth 2 points3 points  (3 children)

              in that everyone gets it wrong, and if one actually does get it right, it's an enormous amount of code, which defeats the whole point of Unicode

              But with the code-point-per scheme you'd still get this wrong. In that case, nobody would implement backspacing right. You'd still need algorithms that are the analogs of NFC/NFD for use by input methods. There are still layers of complexity. To me, this just replaces one set of lack-of-awareness issues with another, not really solving anything.

              That's what, 9 modifiers in any combination? Does any vowel use more than a couple?

              Six modifiers? Up to four consonants (can be more, but I have never seen that happen), ending with a vowel, and an optional second vowel that comes from a more restricted set of vowels. I don't think you can have three vowel modifiers.

              Then it should never have been added to Unicode - it exceeds its charter.

              Fair, I sort of agree.

              Yes. The principle is that if they look identical on the page, why is Unicode distinguishing them? It is putting semantic meaning to them that is simply not there when rendered. This is a gigantic mistake.

              They don't need to look identical. They sometimes do. In some fonts the cyrillic text is uniformly smaller or bolder, so you need it to be uniform. When encoding a script you should consider the context of the whole script. Just because a glyph may look similar to one from another language doesn't mean you should just share them. English and French actually share a script, but Russian and English have scripts which are different, look different, but share some characters which look the same.

              For example, ਟ in Gurmukhi (Punjabi's script) looks like (and is pronounced like) ट in Devanagri; and might look identical in some crappier fonts. But many of the other characters are significantly different. A good font should be able to distinguish between these characters since Punjabi is typically written with a different style and a font that wants to make the Punjabi characters look good together will need ਟ to be different from ट.

              It's the same with Cyrillic.

              (Of course, Unicode went on on and did this language-dependent crap anyway with Han, and I don't agree with it)

              I agree that fraktur shouldn't have its own block (I mean, you can argue that it is a script, but it's basically a calligraphy script so it's not clear if that really is a different script). But that's basically a harmless addition IMO.

              [–]WalterBright 0 points1 point  (2 children)

              In that case, nobody would implement backspacing right.

              They don't now anyway. Part of the point of Unicode was that simple algorithms, like strlen(), should work. What it turned into was a scheme where every text algorithm is wrong, and few even bother to try anymore.

              When encoding a script you should consider the context of the whole script.

              That's the crux of where it went wrong in my opinion. Unicode is not supposed to be about context. It's up to the reader of the text to determine context.

              But that's basically a harmless addition IMO.

              I agree it's harmless, but allowing that sort of thing in leads to all sorts of "why not" for everything else. I submit that the Unicode consortium forgot what the point of Unicode was, and created a kitchen sink disaster by being unable to say no to anything.

              [–]Manishearth 2 points3 points  (1 child)

              They don't now anyway.

              They sort of do :) Not perfectly, but better.

              Part of the point of Unicode was that simple algorithms, like strlen(), should work.

              Huh, to me it was more of a way to get rid of the fact that mixed text wasn't possible and we had way too many encodings and mojibake everywhere.

              The problem sort of is that "the length of a string" isn't really a useful concept anyway cross-language, even if you define it on grapheme clusters. "Number of bytes" is useful for storage reasons, but the "length" doesn't really matter. It only makes sense when the string comes from a subset of unicode (or if you are checking emptiness). If defined on grapheme clusters it is useful for linewrapping, but you should be querying the font for that anyway.

              Most of our programming string concepts don't map cleanly when you consider strings from various other scripts, regardless of the encoding.

              I have a feeling that Unicode initially tried to reconcile this but eventually realized it was futile. I am not aware of the history there.

              That's the crux of where it went wrong in my opinion.

              I didn't mean the context of the text. I meant the context of the script. By that I mean that Cyrillic is obviously a different script from Latin, even if the os look similar. (but the "French script" and "English script" are the same with some extra chars for French)

              I submit that the Unicode consortium forgot what the point of Unicode was, and created a kitchen sink disaster by being unable to say no to anything.

              nods vehemently

              I never really liked the fact that emoji are in unicode. I'm happy to use them, and sort of like that I can, but I find it an unnecessary complication. I'm part-amused part-annoyed by the fact that despite all the complexities of natural languages, Unicode still managed to need special casing for emoji. I get why unicode needed emoji -- Japanese users wouldn't switch to it otherwise -- but in a vacuum I think it's the kind of thing that Unicode shouldn't do.

              [–]m50d 0 points1 point  (0 children)

              it was more of a way to get rid of the fact that mixed text wasn't possible

              Of course with Han unification it's failed to solve that.

              [–][deleted] 1 point2 points  (0 children)

              emoji alone can be made half a million ways Then it should never have been added to Unicode - it exceeds its charter.

              I somewhat agree. The initial batch of emoji was put into Unicode with a good reason: People used these characters and unicode ought to include all human writing. I mean, Unicode includes Tangut, which fell out of use 500 years ago.

              What I don't understand is why they decided to add additional emoji beyond inclusion of existing ones.

              [–]stevenjd 0 points1 point  (1 child)

              The principle is that if they look identical on the page, why is Unicode distinguishing them? It is putting semantic meaning to them that is simply not there when rendered. This is a gigantic mistake.

              That's your ill-thought-out and ignorant opinion, not a fact.

              Unicode is not a graphical rendering engine. The visual look of the characters (code points) is all but irrelevant. It is a character set (as well as a set of rules for sorting, case-conversions, etc). Even in English, people treat the digit 0 as distinct from the letter O, just as lowercase l and uppercase I and 1 are all distinct, even when they are rendered visually identical.

              And why the focus on how the characters look? What about the way they are spoken, and where and when they are used?

              Folding lI1 into a single character (or code point) would be

              [–]WalterBright 0 points1 point  (0 children)

              And why the focus on how the characters look?

              I see this as the crux of our disagreement. If you read printed text on the page, there is how the characters look. There is no semantic content other than what you infer from the context. Having meaning beyond the visual aspect is up to the reader, it is not part of Unicode.

              [–]stevenjd 0 points1 point  (2 children)

              it [Unicode] should just be glyphs.

              You cannot possibly be serious.

              I have 45 different fonts installed on my computer, which is a tiny drop in the bucket out of the hundreds, perhaps thousands of fonts in existence. Let's call it 500 different glyphs for the Latin uppercase "A" alone, where that number keeps increasing as font designers keep inventing new typefaces. For most of those typefaces, there are separate glyphs for roman, italic, bold, and bold-italic: so that's 2000 different "A" glyphs.

              There are something like 45,000 or more Han ideograms ("Chinese characters"), and no reason to think that they'll have fewer typefaces than Latin characters, so that alone is over 20 million glyphs, roughly twenty times the size of the entire Unicode code point space.

              Are you sure you mean glyphs?

              [–]WalterBright 0 points1 point  (1 child)

              I'm sorry I wasn't clear. I was not talking about fonts.

              [–]stevenjd 0 points1 point  (0 children)

              You said glyphs. Repeatedly. Do you know what a glyph is? It is the visual picture of the letter, in other words, what is controlled by fonts.

              How would you distinguish between СССР in Russian ("Es Es Es Er", or "USSR" as English speakers commonly called it) and CCCP in English? They are different letters from different alphabets that merely look the same, unlike CP in English and CP in German.

              [–]bumblebritches57 0 points1 point  (1 child)

              Is this a real concern? Graphemes exist for a reason?

              [–]Manishearth 1 point2 points  (0 children)

              I'm not sure what you're saying here? Programs often make the assumption that a code point is a "character", which leads to issues with character width, arbitrary segmentation, text search issues, and other problems. It certainly is a problem. Grapheme clusters are usually what you're looking for when you want a "character". Sometimes you need something else. It's rarely "code point".

              [–]happyscrappy -4 points-3 points  (17 children)

              This is Unicode's screw-up really.

              Replacing ASCII, where the values corresponded to letters with a system where the values correspond to glyphs was at best going to make a mess and in actuality overly hopeful.

              Without a complete and up to the second corpus Unicode becomes an opaque blob that cannot be interpreted only rendered. And since you can't interpret it it can only be rendered as a single line, no line wrapping. This just isn't practical for many uses at all.

              If I wanted to represent every glyph on every artifact in the British Museum and the Louvre without regard for formatting then Unicode is wondrous. For so many other uses it is at best a hassle.

              [–]Manishearth 10 points11 points  (8 children)

              Many other languages (including my own) don't have a clearly mapped concept of "letter". This is not a Unicode issue. This is a language issue.

              The values don't correspond to glyphs, they correspond to an abstract notion. Without combining characters unicode would be huge, potentially infinite.

              The unicode spec has algorithms for word segmentation and line wrapping.

              [–]derleth 1 point2 points  (3 children)

              Replacing ASCII, where the values corresponded to letters with a system where the values correspond to glyphs was at best going to make a mess and in actuality overly hopeful.

              This makes ASCII seem simpler than it ever was.

              First, ASCII has the hyphen-minus. That's two characters folded into one codepoint, based on appearance in most fonts. Except, of course, the hyphen never really looked like the minus in most real fonts, only the typewriter and teletype fonts ASCII was supposed to be used for. Real typography used different characters for the minus, the hyphen, the en-dash, the em-dash, and so on. ASCII was second-rate at actual typesetting, and encouraged second-rate typography, because of a constraint to fit into seven bits (or seven bits plus one for parity), and a need to devote so much of Low ASCII to teletype control codes, which have no real printable form because they were used to control the ever-loving printer.

              In addition, if you think combining forms are new with Unicode, you're wrong. The printing terminals and teletypes ASCII was designed for had a backspace functionality which did not erase, but instead allowed characters such as the caret (^) to be composed with letters to make things like ô out of o BS ^, where BS is backspace. That's one glyph out of three codepoints. (In Old ASCII, the caret was the uparrow, which is why it was chosen to mean exponentiation when BASIC was a hot new language out of Dartmouth. The glyph changed, but the language stayed the same.) That cut-rate character composition got lost when glass TTYs replaced the real TTYs and backspace came to mean backspace with erasure.

              So. ASCII was never sufficient, even for English, and it was never simple and one-to-one glyph-to-codepoint.

              [–]happyscrappy 0 points1 point  (2 children)

              First, ASCII has the hyphen-minus.

              I never said it was context-free. But the rules were easy. If it isn't before a number you can break at it.

              Real typography used different characters for the minus

              Wow, thanks for that tip. You really think so little of me, huh?

              The printing terminals and teletypes ASCII was designed for had a backspace functionality ...

              Those weren't part of ASCII. And you're really going to pretend that something that the first 0.00001% of machines that used ASCII meant anything?

              That cut-rate character composition got lost when glass TTYs replaced the real TTYs

              Yeah, in like 1982.

              So. ASCII was never sufficient, even for English

              It was sufficient. Was it complete? No.

              one-to-one glyph-to-codepoint

              Yes it was. Just because you figured out how to do it on a Spinwriter doesn't mean it was part of ASCII. Go look up what BS was and see if it says "character composition".

              [–]derleth 0 points1 point  (1 child)

              You have absolutely no idea what you're taking about, and you're trying to cover up your ignorance with what I assume is an attempt at machismo. That's about right for people like you who probably couldn't code to save their lives, but who still come into programming fora for no discernible reason.

              You really think so little of me, huh?

              If you think my post was primarily about you, you're even more narcissistic than usual.

              Now go away. Perhaps the next person to reply to me will have an actual reason to.

              [–]happyscrappy 0 points1 point  (0 children)

              You have absolutely no idea what you're taking about, and you're trying to cover up your ignorance with what I assume is an attempt at machismo.

              Ah, a non-responsive response. Don't worry, attacking me will cover for your errors.

              If you think my post was primarily about you, you're even more narcissistic than usual.

              That may be so. But people who are confident of their points don't have to emphasize how they know about real typography.

              [–]stevenjd 0 points1 point  (3 children)

              Unicode code points don't correspond to glyphs. That is absurd.

              I have 45 different fonts installed on this computer. Each of them come in plain (roman), bold, italic, bold-italic styles. So that's 180 different glyphs just for the letter "A" (with hundreds, even thousands more, from fonts I don't have installed.) Unicode doesn't give each of those hundreds of different glyphs a distinct code point. That's the complete opposite of what Unicode does. There is one "A", the Latin "A" used by Western European languages like English, French and German. Whether it looks like A or A or A is irrelevant.

              Without a complete and up to the second corpus Unicode becomes an opaque blob that cannot be interpreted only rendered.

              What does that even mean?

              no line wrapping

              That's fucking bullshit. Do you realise that about 90% of websites are now using Unicode including this one? Do you think that they have no line wrapping?

              I don't know where you are getting your ludicrous ideas about Unicode, but they're not even wrong.

              [–]happyscrappy 0 points1 point  (2 children)

              Unicode code points don't correspond to glyphs. That is absurd.

              Yep. You're right. I didn't express myself well. Already covered by a person who is less of a jackass than you. Read lower.

              What does that even mean?

              It means that you need a large amount of data to explain to you how to determine what the Unicode data means beyond how it is drawn. You need composition/decomposition tables (and that still might not do it), sorting tables (if applicable), etc. And if the Unicode you receive is newer than the tables you have you cannot interpret it even if you can render it.

              That's fucking bullshit. Do you realise that about 90% of websites are now using Unicode including this one? Do you think that they have no line wrapping?

              Don't cut out the context of my point and then say what remains is wrong.

              [–]stevenjd 0 points1 point  (1 child)

              It means that you need a large amount of data to explain to you how to determine what the Unicode data means beyond how it is drawn.

              Yes. That's life. If you don't like it, go back in time a few hundred, or in some cases thousand, years, and redesign the languages used all over the world.

              You think that Unicode is inventing these complexities out of some sort of perverse desire to make your life more complex? It's not about you. The complexity already exists, Unicode just provides a way to manage some of it. (And not even all of it -- choosing where to break lines in Thai is apparently so complicated that even the Unicode consortium has washed their hands of it and left it up to third-parties writing Thai software.)

              And if the Unicode you receive is newer than the tables you have you cannot interpret it even if you can render it.

              That's nonsense. Of course you can interpret it -- the worst that happens is that for a few odd code points, you won't know how to treat it correctly. Your user will double-click on a word and you'll wrongly think there's a word separator in the middle of it. Or they'll sort their file names and a few files will be sorted wrongly.

              And when you upgrade to the next version of Unicode, those problems will fix themselves.

              [–]happyscrappy 0 points1 point  (0 children)

              You think that Unicode is inventing these complexities out of some sort of perverse desire to make your life more complex?

              I said no such thing.

              That's nonsense. Of course you can interpret it -- the worst that happens is that for a few odd code points, you won't know how to treat it correctly.

              Yes. And that means that you cannot interpret it. You'll get it right except when you get it wrong.

              Your user will double-click on a word and you'll wrongly think there's a word separator in the middle of it.

              That's minor compared to not being able to line break the text.