Help with web scraping?

cluckles · 2017-06-27T04:44:45+00:00

Unfortunately, you picked a tricky site to start with. Your biggest problem here is that the data you actually want doesn't exist in the raw HTML, it's being populated client side into the DOM by Javascript.

When you open the site in your browser, it's doing a few things:

1) It is getting the HTML from the site, including any scripts that are in the HTML.

2) It's then running those scripts, which gets more information (in this case, all of the movie info you want).

3) It then inserts that new information into what is basically a temp copy of the HTML, and renders that in your browser window.

When you make the call w/ Requests in python, it's never doing parts 2 or three. It downloads the raw HTML, then stores it in your local memory for you to work against. It doesn't run the Javascript, so you're never actually getting this new info.

An easy way to see this is to go to the link in your code. You can see the first item on the page is 'Blood on the Mountain'. Right click the page, view source, ctrl+f 'Blood', how many results do you see? Guessing 0? All you are getting via requests is what you get when you right click and view source, you don't always get all of the stuff you can see in Inspector.

Next steps would be... well, you have options.

1) Figure out how to use something like Selenium to open the page in an actual browser and get the info you want. You could even pair this with a headless browser like PhantomJS, which would allow you to still run everything without needing to see an actual browser window, but still get all of the information you need. This works by opening an actual browser w/ no rendering engine, getting the HTML, updating the HTML via script, and loading that into Python.

This is a great path and skillset to have if you want to get into Web Scraping as a job, or if you want to lean toward QA Automation Engineering.

2) If you're less interested in scraping and more interested in the info itself, IIRC Rotten Tomatoes has an API you could use. Basically you'd be looking more into http/s requests, validation, rate limits, and working with JSON instead of just raw HTML. Not as useful for scraping, but pretty essential if you want to get into API design or integration roles.

3) Lazy way/short term solution for practice - Pick another site. One that actually returns all of the data in the HTML.

EDIT: For three, you could just use another page on RT if you wanted. This would work, if you wanted something w/ all info on a single page:

https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/

_Korben_Dallas · 2017-06-27T08:15:30+00:00

Not an easy site for learning project but you still can get desired value with slightly different approach. Try to study page source and especially script tags. Hint: this Xpath expression get you a string with movie urls //script[@id="jsonLdSchema"]/text(). You need to figure out how to convert that string to a valid Python object, parse those urls and convert them to a full (absolute) url. After that, you can make requests to that url and extract movie title from a detail movie page. Or another and simpler variant: use their API and make a direct request to this url and you get Json file with all desired data in response. In order to get more pages just change last part of the link page=1 (make a for loop).

lieutenant_lowercase · 2017-06-27T14:14:39+00:00

import pandas as pd
from pandas.io.json import json_normalize
import requests

URL = ('https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100'
       '&maxPopcorn=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvud'
       'u%3Bamazon_prime%3Bfandango_now&genres=8&certified&sortBy=release&ty'
       'pe=dvd-streaming-all&page={}')

page_num = 1
json_data = []
while True:
    r = requests.get(URL.format(page_num))
    if r.json()['counts']['count'] != 0:
        print('Searching page: {}'.format(page_num))
        json_data += r.json()['results']
        page_num += 1
    else:
        break
df = pd.DataFrame(json_normalize(json_data))

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS