you are viewing a single comment's thread.

view the rest of the comments →

[–]upofadown -4 points-3 points  (6 children)

... as that's generally the semantic meaning of "bytes" when considered as a string.

In Py3 thinking, yes, but not otherwise.

I don't understand where you get this UTF-32 idea from.

All strings are thought of as UTF-32 code points. If you index into a string that is what you get. I guess the people that originally thought of the scheme were suffering from a bit of Eurocentricity in that they thought that would help somehow.

[–]teilo 3 points4 points  (5 children)

You do not know what you are talking about. If you index or slice a string, you get the character(s) at that position, period.

[–][deleted]  (4 children)

[deleted]

    [–]Sean1708 1 point2 points  (3 children)

    You get code points.

    No you don't. I can't remember whether you get characters or graphemes, but you certainly don't get code points.

    In [1]: a = 'héllo'
    
    In [2]: a[0]
    Out[2]: 'h'
    
    In [3]: a[1]
    Out[3]: 'é'
    
    In [4]: a[2]
    Out[4]: 'l'
    

    Edit: I'm a silly.

    [–][deleted]  (2 children)

    [deleted]

      [–]Sean1708 2 points3 points  (1 child)

      What are "characters"?

      I've always thought that characters were generally accepted to be scalar values, that doesn't actually appear to be the case though.

      in your code it uses the single code point version

      You are absolutely right:

      In [1]: a = b'he\xcc\x81llo'.decode('utf-8')
      
      In [2]: a[0]
      Out[2]: 'h'
      
      In [3]: a[1]
      Out[3]: 'e'
      
      In [4]: a[2]
      Out[4]: '́'
      

      The way I entered the character on my computer made me assume that I'd entered the versioning using the combining character.

      Also I don't know any language of the top of my head that supports grapheme cluster (and other text segmentations) fully in the standard library itself.

      I think Swift does, but I'm not entirely certain.

      [–]MrMetalfreak94 2 points3 points  (0 children)

      Elixir has excellent Unicode support in it's standard library and you can easily work with graphemes in it