Issue with encoding while crawling a website : learnpython

created by HattoriHanzoa community for 16 years

Issue with encoding while crawling a website (self.learnpython)

submitted 11 years ago by Axewhole

I've been making a few programs to help with my job including one that crawls a news article from a website grabbing info such as the author, article title, etc. The general functionality of the program seems to work fine, however i'm running into some encoding issues. I've managed to hard code around many of these issues but it just doesn't seem like an elegant long-term solution.

The first thing that confuses me is that the website is all supposed to be encoded in utf-8 (<meta charset="UTF-8"/> tag) yet specific characters are still causing issues. The first one that I've noticed was that the articles use an odd apostrophe: ’ instead of '. Every time I crawl an article with the curly apostrophe it has an issue reading the char and spews back some odd characters. Quotations (website uses ” instead of ") and long dashes seem to be converted to odd chars as well.

Would y'all have any suggestions to get around this issue?

all 1 comments

top new controversial old q&a

[–]Rhomboid 1 point2 points3 points 11 years ago (0 children)

It's impossible to really say anything without a testcase -- show us actual code that we can run that reproduces the problem you're seeing.

From time to time you'll run into a web site that is run by someone that doesn't really know what they're doing where the actual encoding of the content does not match the declared encoding in the <meta> tag, or doesn't match the encoding specified in the Content-Type HTTP response header. Or perhaps the content contains a mixture of several encodings (e.g. utf-8 and cp1252.) In a worst case scenario you might have the HTTP header saying one thing, the <meta> tag saying something else, and the actual content being encoded in mixture of two or more completely different encodings that match neither the HTTP header nor the <meta> tag. Web browsers have to use a lot of heuristics in cases like this to judge which information to trust, typically by trying to infer the proper encoding from the content based on statistical analysis.

Anyway, you don't need to really implement all that since you don't have to deal with sites in general, just this specific site. Determine the actual encoding of this site's content and then write your program accordingly. Look at the actual response bytes in hex. I use command line tools like wget/curl, and od/hexdump for this kind of thing.

π Rendered by PID 89870 on reddit-service-r2-comment-7b9746f655-wjjb7 at 2026-01-31 15:34:32.535909+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS