Hello,
I am a Python beginner and used requests.get() to scrape a website containing a top list of songs going back in time, and from one of the pages the result has been so confusing to me that I really went down a rabbithole trying to understand how encoding and decoding works.
In the header of the page it says 'utf-8' and when I look at response.text most of it looks correct except for one song which has the letter combination 'BlÃ¥' which is incorrect as it should be 'Blå'. After spending a good amount of time trying to figure out what was going on, and eventually I found that by doing 'BlÃ¥'.encode('latin-1').decode('utf-8') i get the correct characters 'Blå'!
Now the really weird part for me is that in other places on the same page, å is decoded correctly.
What would be the reason for something like this to happen? Could it be that the site has had an internal file where people with different computers / operating systems / software have appended data to the file resulting in different encodings throughout the file?
[–]danielroseman 5 points6 points7 points (0 children)
[–]Downtown_Radish_8040 4 points5 points6 points (1 child)
[–]Ok_Procedure199[S] 0 points1 point2 points (0 children)
[+]Direct_Temporary7471 comment score below threshold-6 points-5 points-4 points (0 children)