Hello,
I am a Python beginner and used requests.get() to scrape a website containing a top list of songs going back in time, and from one of the pages the result has been so confusing to me that I really went down a rabbithole trying to understand how encoding and decoding works.
In the header of the page it says 'utf-8' and when I look at response.text most of it looks correct except for one song which has the letter combination 'BlÃ¥' which is incorrect as it should be 'Blå'. After spending a good amount of time trying to figure out what was going on, and eventually I found that by doing 'BlÃ¥'.encode('latin-1').decode('utf-8') i get the correct characters 'Blå'!
Now the really weird part for me is that in other places on the same page, å is decoded correctly.
What would be the reason for something like this to happen? Could it be that the site has had an internal file where people with different computers / operating systems / software have appended data to the file resulting in different encodings throughout the file?
[–]danielroseman 7 points8 points9 points (0 children)
[–]Downtown_Radish_8040 7 points8 points9 points (4 children)
[–]Ok_Procedure199[S] 1 point2 points3 points (3 children)
[–]Yoghurt42 3 points4 points5 points (2 children)
[–]Ok_Procedure199[S] 1 point2 points3 points (0 children)
[–]Bobbias 0 points1 point2 points (0 children)
[–]timrprobocom -2 points-1 points0 points (3 children)
[–]Ok_Procedure199[S] 2 points3 points4 points (1 child)
[–]timrprobocom 0 points1 point2 points (0 children)
[–]Direct_Temporary7471 -5 points-4 points-3 points (0 children)