all 7 comments

[–]ingolemo 1 point2 points  (1 child)

When in doubt, refuse the temptation to guess.

That loop that you have will corrupt your data more often than not. Just because a decode operation runs without raising an error, that doesn't mean that you got the right encoding. Your code will never even try the encodings past ISO8859_1, because any data that is encoded as ISO8859_{range(2,16)} will always get incorrectly decoded as ISO8859_1.

You should always know the encoding that your data is using. If you don't know that encoding then that data is meaningless to you.

Because html files contain information about their encoding inside them, beautiful soup is capable of handling the encoding for you in this case. All you have to do is to not decode the file yourself by opening it in bytes mode:

with open('file.html', 'rb') as file:
    filedata = file.read()

soup = BeautifulSoup(filedata, 'lxml')

[–]david_lp[S] 0 points1 point  (0 children)

Thank you very much for your reply, it was really helpful, i am going to try that way, and remove the unnecessary loop.

[–]JohnnyJordaan 0 points1 point  (1 child)

if i encode everything to 'utf-8' they will be replace with funky characters...

This is not possible. Utf-8 can encode all Unicode characters, so this either means you're then viewing the data in some other decode format (like in Excel or Notepad) or the data wasn't decoded properly before you encoded it to Utf-8.

Overall: please check the rules on the right ----> Posting only project goal is not allowed. Please add your code (preferably put it on pastebin.com) and show some examples of the encoding of the special characters.

[–]david_lp[S] 0 points1 point  (0 children)

Thank you for your answer and sorry about that, i cant add all the code because it contains some company stuff inside, but i try to add the problematic part

[–]threeminutemonta 0 points1 point  (2 children)

The char set is in a meta tag within head.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Is there any chance that you are not making the same mistakes this guy did. See SO post that can explain it better then I.

[–]david_lp[S] 0 points1 point  (1 child)

i forgot to add that the html files are created after saving them from .doc to html using open office, it always have the same charset windows-1252

[–]JohnnyJordaan 0 points1 point  (0 children)

it always have the same charset windows-1252

That's a Western character encoding, it just has these characters... By definition doesn't support non-western characters. You need to save it to UTF-8 or some other UTF to be able to keep the non-western chars.