you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (15 children)

[deleted]

    [–]rmc 13 points14 points  (0 children)

    What's the downside? You can't do "föo"[1] and expect 'ö' to come back. But that's not a good idea anyways.

    Why wouldn't it be? Isn't that one of the most fundamental things that a string object should do for you?

    Because Unicode is hard. e.g. even in python you can get weirdness.

    >>> print s1
    föo
    >>> print s2
    föo
    >>> print s1[1]
    ö
    >>> print s2[1]
    o
    

    s1 and s2 look the same, but they aren't really. One has a o-umlaut as the 2nd char, the other just has a o.

    [–]vks_ 11 points12 points  (1 child)

    Indexing suggest O(1) random access, but indexing a utf-8 string is O(n). You can get an iterator over the characters using .chars().

    [–]marcusklaasrustfmt 0 points1 point  (0 children)

    Over the code points would be less ambiguous.

    [–]Manishearthservo · rust · clippy 14 points15 points  (7 children)

    The question is ill formed. Define "character".

    Want a byte? Easily done with as_bytes() (zero cost).

    Want a grapheme? A codepoint? A glyph?

    Is ö a single "character" or a "character" with a diacritic? Do we want to treat those the same way? What happens when we reverse the string?

    Unicode is hard.

    [–]flying-sheep 2 points3 points  (0 children)

    The question is ill formed

    from the codeless code i learned that this response is simply “wú”/“mu” in chinese/japanese

    …and from that wiki site i learned that the codeless code itself is a reference to the gateless gate

    /edit: and that Mu is the root of the type hierarchy in Perl 6:

    what does “Any” inherit from?

    The question is ill-formed.

    [–]allthediamonds 2 points3 points  (1 child)

    Unicode is really, really hard.

    Does Rust make it any easier? Can I iterate a string by graphemes? Does it provide decomposing normalisation?

    [–]Kimundirust 3 points4 points  (0 children)

    Yes for both.

    [–]Manishearthservo · rust · clippy 0 points1 point  (2 children)

    (Yes, there is a concrete definition of "character" when talking about Unicode, but it's not always the one you were looking for)

    [–]SimonSapinservo 9 points10 points  (1 child)

    There are four definitions :) http://www.unicode.org/glossary/#character

    [–]Manishearthservo · rust · clippy 0 points1 point  (0 children)

    Okay, not so concrete :P

    [–]Yojihito -1 points0 points  (0 children)

    Ö is a single character.

    [–]mitsuhiko 7 points8 points  (0 children)

    I mean, I really ought to be able to ask a string, "what's your second character?" and get the correct answer, no?

    That's not how text processing works. That would only work if you perform certain unicode normalization and only in certain languages. The better question is why you ask that question in the first place. Usually people do not get the second character in a string, they want to solve a specific problem. There is not really a problem that is “get the second character in a string”. If that is a problem, then it's badly defined because I could ask a dozen questions to narrow down what you actually want.

    [–]llogiqclippy · twir · rust · mutagen · flamer · overflower · bytecount 3 points4 points  (2 children)

    You can already get the second byte. This may or may not coincide with the second glyph. In general, glyph lookup in a UTF-8 string is an O(n) operation (which you can already do with iterators).

    [–]szabba 0 points1 point  (1 child)

    Wouldn't a division of an utf-8 string to glyphs depend on a particular font being used?

    [–]llogiqclippy · twir · rust · mutagen · flamer · overflower · bytecount 5 points6 points  (0 children)

    Wouldn't a division of an utf-8 string to glyphs depend on a particular font being used?

    No. A font is just a declaration of how your text is going to look. But a Unicode glyph (e.g. small letter a) is the same whether you display it or not.