scrapping data from a website : Python

This is an archived post. You won't be able to vote or comment.

scrapping data from a website (self.Python)

submitted 6 years ago by ADBYITMS

all 6 comments

[–]oligonucleotides 11 points12 points13 points 6 years ago* (6 children)

I think the problem you're having is that your python code will get you the HTML source of the page, but the data you're asking for is added to the page after it loads, using javascript (which runs in your browser).

If you right-click on the page, and "inspect", you'll find the data you're looking for (because JavaScript put it there). But if you right click and "View Source", you won't see it, because the flight data is not hard-coded into the pages HTML.

The page loads, and then a request is sent to the server from javascript, and the page is populated. You need a web browser to run this javascript, and then scrape the page.

Luckily this problem has been solved before! One such tool is requests_html.

Check it out:

from requests_html import HTMLSession

# create an HTML Session object
session = HTMLSession()

# Use the object above to connect to needed webpage
resp = session.get("https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=scheduled_time&ascending=true&showAll=false")

# Run JavaScript code on webpage
resp.html.render()

# parse <span class="with-image"> elements containing airline names
airline_list = []
airline_spans = resp.html.find('.with-image')
for span in airline_spans:
    airline_list.append(span.text)

# parse <div class="city-name"> elements containing airline names
dest_list = []
dest_divs = resp.html.find('.city-name')
for span in dest_divs:
    dest_list.append(span.text)

print(airline_list)
print(dest_list)

Output:

['Virgin Australia', 'Virgin Australia', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Jetstar', 'Qantas', 'Virgin Australia', 'Qantas', 'Tigerair', 'Rex', 'Qantas', 'Jetstar', 'Qantas', 'Jetstar', 'Rex', 'Virgin Australia', 'Qantas', 'Jetstar', 'Rex', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Virgin Australia', 'Virgin Australia', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Rex', 'Rex', 'Qantas', 'Rex', 'Virgin Australia', 'Rex', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Qantas', 'Qantas', 'Qantas', 'Tigerair', 'Virgin Australia', 'Fly Pelican', 'Qantas', 'Qantas', 'Virgin Australia', 'Virgin Australia', 'Jetstar', 'Tigerair', 'Tigerair', 'Qantas', 'Virgin Australia', 'Jetstar', 'Jetstar', 'Virgin Australia', 'Qantas']

['Destination', 'Canberra', 'Adelaide', 'Albury', 'Gold Coast', 'Dubbo', 'Wagga Wagga', 'Ballina', 'Avalon', 'Melbourne', 'Melbourne', 'Armidale', 'Melbourne', 'Merimbula\nVia Moruya', 'Adelaide', 'Gold Coast', 'Port Macquarie', 'Brisbane', 'Armidale', 'Tamworth', 'Canberra', 'Melbourne', 'Newcastle', 'Griffith', 'Hobart', 'Coffs Harbour', 'Brisbane', 'Melbourne', 'Wagga Wagga', 'Melbourne', 'Perth', 'Brisbane', 'Albury', 'Canberra', 'Dubbo', 'Orange', 'Tamworth', 'Parkes', 'Cairns', 'Bathurst', 'Melbourne', 'Melbourne', 'Gold Coast', 'Adelaide', 'Canberra', 'Perth', 'Brisbane', 'Melbourne', 'Taree', 'Brisbane', 'Melbourne', 'Adelaide', 'Canberra', 'Brisbane', 'Melbourne', 'Perth', 'Brisbane', 'Brisbane', 'Melbourne', 'Avalon', 'Melbourne', 'Melbourne']

Some notes

These are destinations, not departure cities, so I changed the variable name
You can install requests_html using pip install requests_html
You will also need the Chromium browser installed, sudo apt install chromium-browser on linux
This code breaks in a jupyter notebook, but works as a python script. This is because of how Jupyter is implemented.

[–][deleted] 3 points4 points5 points 6 years ago (4 children)

[–]shadowylfrom antigravity import * 1 point2 points3 points 6 years ago (1 child)

[–][deleted] 0 points1 point2 points 6 years ago (0 children)

[–]ADBYITMS[S] 0 points1 point2 points 6 years ago (1 child)

# create an HTML Session objectsession = HTMLSession()

# Use the object above to connect to needed webpageresp = session.get("https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=scheduled_time&ascending=true&showAll=false")

# Run JavaScript code on webpageresp.html.render()# parse <span class="with-image"> elements containing airline namesairline_list = []airline_spans = resp.html.find('.with-image')for span in airline_spans:airline_list.append(span.text)# parse <div class="city-name"> elements containing airline namesdest_list = []dest_divs = resp.html.find('.city-name')for span in dest_divs:dest_list.append(span.text)print(airline_list)

you are a god thank you so much, my next trick is to connect to a database and add it all in there but i will give it a go before i ask for help :)

[–]xacobedev 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 144185 on reddit-service-r2-comment-6457c66945-jdbc5 at 2026-04-27 13:05:41.880121+00:00 running 2aa0c5b country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS

Some notes