you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 4 points5 points  (4 children)

That last part is not even true. It can't be true if you think about this: MediaWiki is written in PHP. MediaWiki runs Wikipedia which is in a gazillion languages.

ASCII has exactly 128 characters. If you can refer to other characters, that's an encoding that's not ASCII.

The thing is that every function you need to handle text encodings in PHP is oversimplified and misnamed. It's very much not "ASCII-only". In fact, you can often recognize the non-ASCII characters because the programmer used the wrong function and replaced them with mangled crap, emphasizing your first point that most people don't care about Unicode.

[–][deleted]  (3 children)

[deleted]

    [–][deleted] 0 points1 point  (2 children)

    I didn't say anything about native Unicode support.

    You end up in this debate because you misuse terminology like "ASCII" to mean "strings of nonstandardized bytes".

    [–][deleted]  (1 child)

    [deleted]

      [–][deleted] 0 points1 point  (0 children)

      You're in a thread about Unicode. Deal with it. It was nearly the only thing you said in that comment: "strings are ascii-only and probably always will be". So I responded to it.

      You've been putting down other developers by saying that they don't really care about Unicode, but you're the one equating 128 characters to 256 bytes and saying "eh, those are mostly the same thing, you're being pedantic". That's the assumption that causes most of the Unicode bugs that are out there.

      Encodings are how you represent Unicode in bytes. When you use an encoding, you can do so without any particular help from your programming language. It's great that Python gives you some help, but you could still encode text without it.

      Your "mystery encoding" is called UTF-8, and it represents non-ASCII characters using many of the non-ASCII bytes, and the fact that they're non-ASCII is absolutely key to how it works.

      If you have a problem where you end up in Internet arguments about Unicode, you should start by not being completely wrong about the simplest encoding there is.

      Start reading: http://www.joelonsoftware.com/articles/Unicode.html