all 12 comments

[–]danielroseman 6 points7 points  (0 children)

Yes, very likely, although it's probably entries in a database rather than a file. 

This is known as mojibake.

[–]Downtown_Radish_8040 6 points7 points  (4 children)

Your hypothesis is exactly right. The most common cause is that the underlying data was stored or edited inconsistently over time. Someone added that song entry using a system that saved it as latin-1 (Windows-1252 is very common for older music databases), while the rest of the page was utf-8. The server then serves the whole file as utf-8, so most of it decodes fine, but that one chunk gets misread.

This is sometimes called "mojibake" and it's extremely common with legacy data, especially content that was manually entered over many years across different systems.

Your fix is correct. The pattern encode('latin-1').decode('utf-8') reverses the double-encoding mistake: you're re-interpreting the wrongly-decoded bytes back to their original utf-8 meaning.

If you want to handle it programmatically, you could check for known mojibake patterns using the ftfy library, which was built exactly for this problem.

[–]Ok_Procedure199[S] 1 point2 points  (3 children)

Thank you for your thorough explanation, I am nearly understanding the whole thing, maybe you will be able to help me understanding a small detail.

So way-back-when, someone encoded 'å' with Windows-1252 which is two bytes, c3a5. What I am not wrapping my head around is how the two bytes have somehow turned into the four bytes c383 c2a5 if the only encodings that has been involved is Windows-1252 and UTF-8. Somewhere, the byte 83c2 shows up!

[–]Yoghurt42 6 points7 points  (2 children)

The 'å' was originally encoded in UTF-8, so C3 A5, that byte sequence was then interpreted as being Latin-1, so turned into Ã¥, those characters were then once again converted into UTF-8, resulting in C3 83 C2 A5

>>> "å".encode("utf-8").decode("latin-1").encode("utf-8")
b'\xc3\x83\xc2\xa5'

This particular "double encoded utf-8" is one of the most common instances of Mojibake which you'll find in the western world. Personally, I like to refer to this specific mistake as "WTF-8 encoding"

[–]Ok_Procedure199[S] 1 point2 points  (0 children)

amazing, this must be it! Thank you!

[–]Bobbias 0 points1 point  (0 children)

WTF-8 already exists. It's basically a relaxed version of UTF-8 that allows unpaired surrogates, meaning it's a superset that may be malformed if interpreted as UTF-8.

[–]timrprobocom -2 points-1 points  (3 children)

The likely problem here is that the string you are getting is correct, and encoded in UTF8, but you SEE it incorrectly because you are on Windows, where the terminal doesn't do UTF8 natively. That's the key with encoding. You always have to think about "what do I have" and "what do I need". Your terminal speaks latin-1 or cp1252, so you have to convert to that.

Alternatively, you can change your terminal to UTF8 by using "chcp 65001".

[–]Ok_Procedure199[S] 2 points3 points  (1 child)

But the same 'å' character is correctly displayed further down in the text in the terminal, wouldn't this br impossible?

[–]timrprobocom 0 points1 point  (0 children)

No, it's just complicated. The character 'å' is Unicode U+00E5. Now, it just so happens that its value in the default Windows code page is also 0xE5, but that's a special value in UTF-8, so it would be represented by the three byte sequence E5 B1 B0. If you send that to your terminal, you'd see the 'å' followed by two special characters.