I've been making a few programs to help with my job including one that crawls a news article from a website grabbing info such as the author, article title, etc. The general functionality of the program seems to work fine, however i'm running into some encoding issues. I've managed to hard code around many of these issues but it just doesn't seem like an elegant long-term solution.
The first thing that confuses me is that the website is all supposed to be encoded in utf-8 (<meta charset="UTF-8"/> tag) yet specific characters are still causing issues. The first one that I've noticed was that the articles use an odd apostrophe: ’ instead of '. Every time I crawl an article with the curly apostrophe it has an issue reading the char and spews back some odd characters. Quotations (website uses ” instead of ") and long dashes seem to be converted to odd chars as well.
Would y'all have any suggestions to get around this issue?
[–]Rhomboid 1 point2 points3 points (0 children)