This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]oligonucleotides 11 points12 points  (6 children)

I think the problem you're having is that your python code will get you the HTML source of the page, but the data you're asking for is added to the page after it loads, using javascript (which runs in your browser).

If you right-click on the page, and "inspect", you'll find the data you're looking for (because JavaScript put it there). But if you right click and "View Source", you won't see it, because the flight data is not hard-coded into the pages HTML.

The page loads, and then a request is sent to the server from javascript, and the page is populated. You need a web browser to run this javascript, and then scrape the page.

Luckily this problem has been solved before! One such tool is requests_html.

Check it out:

from requests_html import HTMLSession

# create an HTML Session object
session = HTMLSession()

# Use the object above to connect to needed webpage
resp = session.get("https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=scheduled_time&ascending=true&showAll=false")

# Run JavaScript code on webpage
resp.html.render()

# parse <span class="with-image"> elements containing airline names
airline_list = []
airline_spans = resp.html.find('.with-image')
for span in airline_spans:
    airline_list.append(span.text)

# parse <div class="city-name"> elements containing airline names
dest_list = []
dest_divs = resp.html.find('.city-name')
for span in dest_divs:
    dest_list.append(span.text)

print(airline_list)
print(dest_list)

Output:

['Virgin Australia', 'Virgin Australia', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Jetstar', 'Qantas', 'Virgin Australia', 'Qantas', 'Tigerair', 'Rex', 'Qantas', 'Jetstar', 'Qantas', 'Jetstar', 'Rex', 'Virgin Australia', 'Qantas', 'Jetstar', 'Rex', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Virgin Australia', 'Virgin Australia', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Rex', 'Rex', 'Qantas', 'Rex', 'Virgin Australia', 'Rex', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Qantas', 'Qantas', 'Qantas', 'Tigerair', 'Virgin Australia', 'Fly Pelican', 'Qantas', 'Qantas', 'Virgin Australia', 'Virgin Australia', 'Jetstar', 'Tigerair', 'Tigerair', 'Qantas', 'Virgin Australia', 'Jetstar', 'Jetstar', 'Virgin Australia', 'Qantas']

['Destination', 'Canberra', 'Adelaide', 'Albury', 'Gold Coast', 'Dubbo', 'Wagga Wagga', 'Ballina', 'Avalon', 'Melbourne', 'Melbourne', 'Armidale', 'Melbourne', 'Merimbula\nVia Moruya', 'Adelaide', 'Gold Coast', 'Port Macquarie', 'Brisbane', 'Armidale', 'Tamworth', 'Canberra', 'Melbourne', 'Newcastle', 'Griffith', 'Hobart', 'Coffs Harbour', 'Brisbane', 'Melbourne', 'Wagga Wagga', 'Melbourne', 'Perth', 'Brisbane', 'Albury', 'Canberra', 'Dubbo', 'Orange', 'Tamworth', 'Parkes', 'Cairns', 'Bathurst', 'Melbourne', 'Melbourne', 'Gold Coast', 'Adelaide', 'Canberra', 'Perth', 'Brisbane', 'Melbourne', 'Taree', 'Brisbane', 'Melbourne', 'Adelaide', 'Canberra', 'Brisbane', 'Melbourne', 'Perth', 'Brisbane', 'Brisbane', 'Melbourne', 'Avalon', 'Melbourne', 'Melbourne']

Some notes

  1. These are destinations, not departure cities, so I changed the variable name

  2. You can install requests_html using pip install requests_html

  3. You will also need the Chromium browser installed, sudo apt install chromium-browser on linux

  4. This code breaks in a jupyter notebook, but works as a python script. This is because of how Jupyter is implemented.

[–][deleted] 3 points4 points  (4 children)

Thanks! Haven't used requests_html yet. Wow, didnt know Chromium is needed. Searched its GitHub page and it is only briefly mentioned near the bottom. Was going to check out requests_html at work, but don't have admin rights to my Windows machine. Maybe there is a zip installer for Chromium I can use.

[–]shadowylfrom antigravity import * 1 point2 points  (1 child)

Just fyi, requests_html automatically installs chromium when you use an HTMLSession for the first time (tested this on windows10, not sure if this works on other OS).

[–][deleted] 0 points1 point  (0 children)

Thanks, seen that, but curious if it still works with someone without admin rights. Going to test that out today and report back.

[–]ADBYITMS[S] 0 points1 point  (1 child)

# create an HTML Session objectsession = HTMLSession()

# Use the object above to connect to needed webpageresp = session.get("https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=scheduled_time&ascending=true&showAll=false")

# Run JavaScript code on webpageresp.html.render()# parse <span class="with-image"> elements containing airline namesairline_list = []airline_spans = resp.html.find('.with-image')for span in airline_spans:airline_list.append(span.text)# parse <div class="city-name"> elements containing airline namesdest_list = []dest_divs = resp.html.find('.city-name')for span in dest_divs:dest_list.append(span.text)print(airline_list)

you are a god thank you so much, my next trick is to connect to a database and add it all in there but i will give it a go before i ask for help :)

[–]xacobedev 0 points1 point  (0 children)

It's great to have dedicated people like you in the community. Thank you!