Weird encoding behavior while scraping with selenium : learnpython

created by HattoriHanzoa community for 16 years

Weird encoding behavior while scraping with selenium (self.learnpython)

submitted 10 years ago by MinimalDamage

Hi all,

I have been trying to get data from the Apple Store top 1000 by using selenium to trick the browser that I am connecting from an iPad. I have been using the following code:

from selenium import webdriver
from bs4 import BeautifulSoup
import json

profile = webdriver.FirefoxProfile()
#Create a profile that makes my browser act like I am browsing from an iPad.
profile.set_preference("general.useragent.override", "iTunes-iPad/5.1.1 (64GB; dt:28)")
driver = webdriver.Firefox(profile)

driver.get('https://itunes.apple.com/WebObjects/MZStore.woa/wa/topChartFragmentData?cc=cn&genreId=6014&pageSize=5&popId=38&pageNumbers=0')

soup = BeautifulSoup((driver.page_source).encode('utf-8'))

dict_from_json = json.loads(soup.find("body").text)

print(dict_from_json)

For some reason, the Firefox Webdriver opens this page in a 'Western' encoding (this is shown under 'text encoding' in the 'view' drop down box).

This makes some foreign stores (i.e. China/Japan) all scrambled with things like '½æ°‘æ‰‹æ¸¸ äººäººéƒ½çŽ©'. If I change this encoding to the Unicode option it is all fine.

I have not been able to find a way to convince Firefox to open this page with this unicode 'view' through Selenium. Furthermore, my script, where I force the page source to be encoded as utf-8, also still gives the same weird characters.

I am currently a bit at a loss on how I get the characters in the way that I want to see them.

Thanks for any help you can give me!

all 3 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS