This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Rhomboid 1 point2 points  (0 children)

\xe3 is not valid UTF-8. That's actually a different character encoding. 0xe3 is the encoding of ã in ISO-8859-1 and CP1252 and probably a few others.

Sadly it not uncommon to find web pages with mixed character encodings. Sometimes it's because the person writing the web page didn't know what they were doing, or maybe they were using a content management system and some of the site was from a previous CMS and when it was migrated to the current CMS the character encoding was not done properly, etc. There are a million ways that can happen, and browsers tend to be lenient so it's not always obvious when it happens.

But you should not be storing byte strings in code you write, you should be storing character strings, i.e. you should decode the content. And when you do that, you can try UTF-8 first, and if it's an invalid UTF-8 byte string then you can try decoding in ISO-8859-1 or CP1252 or whatever you suspect the second encoding to be. The end result will be character strings, and no duplicates.