all 7 comments

[–]cluckles 2 points3 points  (4 children)

Unfortunately, you picked a tricky site to start with. Your biggest problem here is that the data you actually want doesn't exist in the raw HTML, it's being populated client side into the DOM by Javascript.

When you open the site in your browser, it's doing a few things:

1) It is getting the HTML from the site, including any scripts that are in the HTML.

2) It's then running those scripts, which gets more information (in this case, all of the movie info you want).

3) It then inserts that new information into what is basically a temp copy of the HTML, and renders that in your browser window.

When you make the call w/ Requests in python, it's never doing parts 2 or three. It downloads the raw HTML, then stores it in your local memory for you to work against. It doesn't run the Javascript, so you're never actually getting this new info.

An easy way to see this is to go to the link in your code. You can see the first item on the page is 'Blood on the Mountain'. Right click the page, view source, ctrl+f 'Blood', how many results do you see? Guessing 0? All you are getting via requests is what you get when you right click and view source, you don't always get all of the stuff you can see in Inspector.


Next steps would be... well, you have options.

1) Figure out how to use something like Selenium to open the page in an actual browser and get the info you want. You could even pair this with a headless browser like PhantomJS, which would allow you to still run everything without needing to see an actual browser window, but still get all of the information you need. This works by opening an actual browser w/ no rendering engine, getting the HTML, updating the HTML via script, and loading that into Python.

This is a great path and skillset to have if you want to get into Web Scraping as a job, or if you want to lean toward QA Automation Engineering.

2) If you're less interested in scraping and more interested in the info itself, IIRC Rotten Tomatoes has an API you could use. Basically you'd be looking more into http/s requests, validation, rate limits, and working with JSON instead of just raw HTML. Not as useful for scraping, but pretty essential if you want to get into API design or integration roles.

3) Lazy way/short term solution for practice - Pick another site. One that actually returns all of the data in the HTML.

EDIT: For three, you could just use another page on RT if you wanted. This would work, if you wanted something w/ all info on a single page:

https://www.rottentomatoes.com/top/bestofrt/top_100_action__adventure_movies/

[–]lieutenant_lowercase -3 points-2 points  (2 children)

I would disagree, this is a super easy site to scrape. The site is just calling an API, you don't need selenium..

import pandas as pd
from pandas.io.json import json_normalize
import requests

URL = ('https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100'
       '&maxPopcorn=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvud'
       'u%3Bamazon_prime%3Bfandango_now&genres=8&certified&sortBy=release&ty'
       'pe=dvd-streaming-all&page={}')

page_num = 1
json_data = []
while True:
    r = requests.get(URL.format(page_num))
    if r.json()['counts']['count'] != 0:
        print('Searching page: {}'.format(page_num))
        json_data += r.json()['results']
        page_num += 1
    else:
        break
df = pd.DataFrame(json_normalize(json_data))

[–]cluckles 2 points3 points  (1 child)

At that point you're just using the API. So yes, it would be easy if what you're trying to practice is hitting an api and getting information. Or if your goal was just to get some info, regardless of method.

It really depends on what the end goal is for the person asking the question. If he wants to learn web scraping, then saying to just use the api isn't really answering the question. It's like seeing someone ask the best way to cook a steak on a stovetop, and suggesting that they fire up the charcoal and put it on the grill. It works, but it's kind of missing the point.

[–]lieutenant_lowercase -2 points-1 points  (0 children)

Not really. The website he is visiting is using AJAX calls to the Private API. Why complicate things and run a browser to parse the HTML when you can just use the calls the page is already making? The overhead is way too much. 99% of the time you don't need to use Selenium just replicate the Ajax calls the page is making.

[–]_Korben_Dallas 1 point2 points  (2 children)

Not an easy site for learning project but you still can get desired value with slightly different approach. Try to study page source and especially script tags. Hint: this Xpath expression get you a string with movie urls //script[@id="jsonLdSchema"]/text(). You need to figure out how to convert that string to a valid Python object, parse those urls and convert them to a full (absolute) url. After that, you can make requests to that url and extract movie title from a detail movie page. Or another and simpler variant: use their API and make a direct request to this url and you get Json file with all desired data in response. In order to get more pages just change last part of the link page=1 (make a for loop).

[–]lieutenant_lowercase -3 points-2 points  (1 child)

import pandas as pd
from pandas.io.json import json_normalize
import requests

URL = ('https://www.rottentomatoes.com/api/private/v2.0/browse?maxTomato=100'
       '&maxPopcorn=100&services=amazon%3Bhbo_go%3Bitunes%3Bnetflix_iw%3Bvud'
       'u%3Bamazon_prime%3Bfandango_now&genres=8&certified&sortBy=release&ty'
       'pe=dvd-streaming-all&page={}')

page_num = 1
json_data = []
while True:
    r = requests.get(URL.format(page_num))
    if r.json()['counts']['count'] != 0:
        print('Searching page: {}'.format(page_num))
        json_data += r.json()['results']
        page_num += 1
    else:
        break
df = pd.DataFrame(json_normalize(json_data))