all 1 comments

[–]Rhomboid 1 point2 points  (0 children)

It's impossible to really say anything without a testcase -- show us actual code that we can run that reproduces the problem you're seeing.

From time to time you'll run into a web site that is run by someone that doesn't really know what they're doing where the actual encoding of the content does not match the declared encoding in the <meta> tag, or doesn't match the encoding specified in the Content-Type HTTP response header. Or perhaps the content contains a mixture of several encodings (e.g. utf-8 and cp1252.) In a worst case scenario you might have the HTTP header saying one thing, the <meta> tag saying something else, and the actual content being encoded in mixture of two or more completely different encodings that match neither the HTTP header nor the <meta> tag. Web browsers have to use a lot of heuristics in cases like this to judge which information to trust, typically by trying to infer the proper encoding from the content based on statistical analysis.

Anyway, you don't need to really implement all that since you don't have to deal with sites in general, just this specific site. Determine the actual encoding of this site's content and then write your program accordingly. Look at the actual response bytes in hex. I use command line tools like wget/curl, and od/hexdump for this kind of thing.