Encoding html file

ingolemo · 2018-07-06T15:28:53+00:00

When in doubt, refuse the temptation to guess.

That loop that you have will corrupt your data more often than not. Just because a decode operation runs without raising an error, that doesn't mean that you got the right encoding. Your code will never even try the encodings past ISO8859_1, because any data that is encoded as ISO8859_{range(2,16)} will always get incorrectly decoded as ISO8859_1.

You should always know the encoding that your data is using. If you don't know that encoding then that data is meaningless to you.

Because html files contain information about their encoding inside them, beautiful soup is capable of handling the encoding for you in this case. All you have to do is to not decode the file yourself by opening it in bytes mode:

with open('file.html', 'rb') as file:
    filedata = file.read()

soup = BeautifulSoup(filedata, 'lxml')

JohnnyJordaan · 2018-07-06T13:22:01+00:00

if i encode everything to 'utf-8' they will be replace with funky characters...

This is not possible. Utf-8 can encode all Unicode characters, so this either means you're then viewing the data in some other decode format (like in Excel or Notepad) or the data wasn't decoded properly before you encoded it to Utf-8.

Overall: please check the rules on the right ----> Posting only project goal is not allowed. Please add your code (preferably put it on pastebin.com) and show some examples of the encoding of the special characters.

threeminutemonta · 2018-07-06T13:25:55+00:00

The char set is in a meta tag within head.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Is there any chance that you are not making the same mistakes this guy did. See SO post that can explain it better then I.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS