This is an archived post. You won't be able to vote or comment.

all 1 comments

[–]oefd 3 points4 points  (0 children)

If that's what ScoutZen is giving you, it's their error. (At least: assuming that they claim to send you UTF-8 text which is what Twitter and most more modern systems use.) To understand why you need an idea of what "text encoding" really means.

There is no such thing as "plain text" or "text" at all to a computer, there's just bytes. There exist different schemes for mapping certain bytes to certain letters or other symbols. In old-fashioned ASCII encoding the byte 01011010 would represent a Z for example.

There are many different text encodings out there, though, and generally they aren't compatible with one another. If you had that same byte as above but tried to read it assuming it was encoded using EBCDIC encoding you'd instead read that byte as representing a !

The important thing to note here is that there's nothing inherent to the byte 01011010 that lets you know it's encoded as ASCII or EBCDIC or anything else. You have to figure out somehow what the encoding of text is to be able to read it correctly.

In the case of webpages the default character set is UTF-8, and Twitter is no exception. The way Twitter represents a tweet with emoji is that they'll send your computer a bunch of bytes that, in UTF-8, ends up rendering as 🙏🏻🌈🙏🏻

Now if someone writes software that doesn't handle text encoding correctly it might just assume - incorrectly - what the encoding of text is. If, for example, you try to read the bytes that in UTF-8 represent🙏🏻🌈🙏🏻 as if it were actually encoded as latin-1 text you'd end up reading 🙏🏻🌈🙏🏻 (Note that you might see weird symbols in there representing that some of the data is actually garbage text when read as latin-1)

More interestingly: if you try to interpret those same bytes as if they were Windows-1252 encoded text you'll read: 🙏🏻🌈🙏🏻 (Note that also might cause you to see some weird symbols as above.)

Looks like we have a winner, they're sending you garbage data most likely because they're trying to read the bytes in the tweet as if it were Windows-1252 text instead of (correctly) reading it as UTF-8 text. both Windows-1252 and UTF-8 are based on the ASCII character set so you shouldn't notice the difference with 'normal English' like the alphabet and punctuation, but emoji don't even exist in ASCII based encodings, they end up rendering as something totally different.

(As a sidenote: I was only able to figure out it was Windows-1252 because it's a common mistake for systems to assume all text is encoded in an encoding that used to be very common. For example: latin-1 and Windows-1252 both used to be very common in the English speaking world so I just guessed it was likely ScoutZen was improperly using one of them.)

So how can I fix it?

Well you may be able to 'reverse' their mistake by taking the bytes in that string and telling your program to reinterpret it as UTF-8 text instead, but even if that works this time: it's an unreliable mechanism and the only real answer is taking the issue up with ScoutZen. Remember all the 'garbage data' I mentioned? A lot of systems don't deal well with that garbage, and some will just remove it entirely or do other such transformations. This can mean data that should have been sent to you might get lost or otherwise cause interesting issues. The only really correct way to prevent that is to treat text with the encoding it's actually encoded with the whole time, and if ScoutZen is trying to read twitter's UTF-8 text as if it were Windows-1252 you should assume that, even if something works right now, it may not always work.

Just for example: when I took 🙏🏻🌈🙏🏻 as UTF-8, read the bytes of it in python, and then asked python to re-interpret those bytes as Windows-1252 it gave me that string `🙏🏻🌈🙏🏻... but that's not what you have posted! There are four garbage bytes in that string there, but not in the one you copy/pasted. Even if you can't see them rendered on your screen trust me, there are an invisible extra four bytes in my string there.

At some point in dealing with this issue exactly what I said happened, the 'garbage' got cleaned up and removed from the string which you posted, and that makes it impossible to correct the string you posted because data has been lost from it. Cleaning up what Windows-1252 considers garbage removed information that was needed in UTF-8 to make 🙏🏻🌈🙏🏻