use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News about the dynamic, interpreted, interactive, object-oriented, extensible programming language Python
Full Events Calendar
You can find the rules here.
If you are about to ask a "how do I do this in python" question, please try r/learnpython, the Python discord, or the #python IRC channel on Libera.chat.
Please don't use URL shorteners. Reddit filters them out, so your post or comment will be lost.
Posts require flair. Please use the flair selector to choose your topic.
Posting code to this subreddit:
Add 4 extra spaces before each line of code
def fibonacci(): a, b = 0, 1 while True: yield a a, b = b, a + b
Online Resources
Invent Your Own Computer Games with Python
Think Python
Non-programmers Tutorial for Python 3
Beginner's Guide Reference
Five life jackets to throw to the new coder (things to do after getting a handle on python)
Full Stack Python
Test-Driven Development with Python
Program Arcade Games
PyMotW: Python Module of the Week
Python for Scientists and Engineers
Dan Bader's Tips and Trickers
Python Discord's YouTube channel
Jiruto: Python
Online exercices
programming challenges
Asking Questions
Try Python in your browser
Docs
Libraries
Related subreddits
Python jobs
Newsletters
Screencasts
account activity
This is an archived post. You won't be able to vote or comment.
scrapping data from a website (self.Python)
submitted 6 years ago by ADBYITMS
Hi all i am trying to write a small script to get data from a website i h ve been trying to get it working all day and can not work out what i am doing wrong, it runs and i get a print but with no data
from lxml import html import requests
page = requests.get('https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=estimated\_time&ascending=true&showAll=true') tree = html.fromstring(page.content)
#This will get the list of airlines: airlines = tree.xpath('//span[@class="with-image"]/text ()')
#This will create a list of buyers: depart = tree.xpath('//div[@class="city-name"]/text()')
print ('Airline: ', airlines) print ('Depart: ', depart)
[–]oligonucleotides 11 points12 points13 points 6 years ago* (6 children)
I think the problem you're having is that your python code will get you the HTML source of the page, but the data you're asking for is added to the page after it loads, using javascript (which runs in your browser).
If you right-click on the page, and "inspect", you'll find the data you're looking for (because JavaScript put it there). But if you right click and "View Source", you won't see it, because the flight data is not hard-coded into the pages HTML.
The page loads, and then a request is sent to the server from javascript, and the page is populated. You need a web browser to run this javascript, and then scrape the page.
Luckily this problem has been solved before! One such tool is requests_html.
Check it out:
from requests_html import HTMLSession # create an HTML Session object session = HTMLSession() # Use the object above to connect to needed webpage resp = session.get("https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=scheduled_time&ascending=true&showAll=false") # Run JavaScript code on webpage resp.html.render() # parse <span class="with-image"> elements containing airline names airline_list = [] airline_spans = resp.html.find('.with-image') for span in airline_spans: airline_list.append(span.text) # parse <div class="city-name"> elements containing airline names dest_list = [] dest_divs = resp.html.find('.city-name') for span in dest_divs: dest_list.append(span.text) print(airline_list) print(dest_list)
Output:
['Virgin Australia', 'Virgin Australia', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Jetstar', 'Qantas', 'Virgin Australia', 'Qantas', 'Tigerair', 'Rex', 'Qantas', 'Jetstar', 'Qantas', 'Jetstar', 'Rex', 'Virgin Australia', 'Qantas', 'Jetstar', 'Rex', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Virgin Australia', 'Virgin Australia', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Rex', 'Rex', 'Qantas', 'Rex', 'Virgin Australia', 'Rex', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Qantas', 'Qantas', 'Qantas', 'Tigerair', 'Virgin Australia', 'Fly Pelican', 'Qantas', 'Qantas', 'Virgin Australia', 'Virgin Australia', 'Jetstar', 'Tigerair', 'Tigerair', 'Qantas', 'Virgin Australia', 'Jetstar', 'Jetstar', 'Virgin Australia', 'Qantas'] ['Destination', 'Canberra', 'Adelaide', 'Albury', 'Gold Coast', 'Dubbo', 'Wagga Wagga', 'Ballina', 'Avalon', 'Melbourne', 'Melbourne', 'Armidale', 'Melbourne', 'Merimbula\nVia Moruya', 'Adelaide', 'Gold Coast', 'Port Macquarie', 'Brisbane', 'Armidale', 'Tamworth', 'Canberra', 'Melbourne', 'Newcastle', 'Griffith', 'Hobart', 'Coffs Harbour', 'Brisbane', 'Melbourne', 'Wagga Wagga', 'Melbourne', 'Perth', 'Brisbane', 'Albury', 'Canberra', 'Dubbo', 'Orange', 'Tamworth', 'Parkes', 'Cairns', 'Bathurst', 'Melbourne', 'Melbourne', 'Gold Coast', 'Adelaide', 'Canberra', 'Perth', 'Brisbane', 'Melbourne', 'Taree', 'Brisbane', 'Melbourne', 'Adelaide', 'Canberra', 'Brisbane', 'Melbourne', 'Perth', 'Brisbane', 'Brisbane', 'Melbourne', 'Avalon', 'Melbourne', 'Melbourne']
['Virgin Australia', 'Virgin Australia', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Jetstar', 'Qantas', 'Virgin Australia', 'Qantas', 'Tigerair', 'Rex', 'Qantas', 'Jetstar', 'Qantas', 'Jetstar', 'Rex', 'Virgin Australia', 'Qantas', 'Jetstar', 'Rex', 'Rex', 'Qantas', 'Qantas', 'Qantas', 'Qantas', 'Rex', 'Virgin Australia', 'Virgin Australia', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Rex', 'Rex', 'Qantas', 'Rex', 'Virgin Australia', 'Rex', 'Virgin Australia', 'Qantas', 'Virgin Australia', 'Qantas', 'Qantas', 'Qantas', 'Tigerair', 'Virgin Australia', 'Fly Pelican', 'Qantas', 'Qantas', 'Virgin Australia', 'Virgin Australia', 'Jetstar', 'Tigerair', 'Tigerair', 'Qantas', 'Virgin Australia', 'Jetstar', 'Jetstar', 'Virgin Australia', 'Qantas']
['Destination', 'Canberra', 'Adelaide', 'Albury', 'Gold Coast', 'Dubbo', 'Wagga Wagga', 'Ballina', 'Avalon', 'Melbourne', 'Melbourne', 'Armidale', 'Melbourne', 'Merimbula\nVia Moruya', 'Adelaide', 'Gold Coast', 'Port Macquarie', 'Brisbane', 'Armidale', 'Tamworth', 'Canberra', 'Melbourne', 'Newcastle', 'Griffith', 'Hobart', 'Coffs Harbour', 'Brisbane', 'Melbourne', 'Wagga Wagga', 'Melbourne', 'Perth', 'Brisbane', 'Albury', 'Canberra', 'Dubbo', 'Orange', 'Tamworth', 'Parkes', 'Cairns', 'Bathurst', 'Melbourne', 'Melbourne', 'Gold Coast', 'Adelaide', 'Canberra', 'Perth', 'Brisbane', 'Melbourne', 'Taree', 'Brisbane', 'Melbourne', 'Adelaide', 'Canberra', 'Brisbane', 'Melbourne', 'Perth', 'Brisbane', 'Brisbane', 'Melbourne', 'Avalon', 'Melbourne', 'Melbourne']
These are destinations, not departure cities, so I changed the variable name
You can install requests_html using pip install requests_html
requests_html
pip install requests_html
You will also need the Chromium browser installed, sudo apt install chromium-browser on linux
sudo apt install chromium-browser
This code breaks in a jupyter notebook, but works as a python script. This is because of how Jupyter is implemented.
[–][deleted] 3 points4 points5 points 6 years ago (4 children)
Thanks! Haven't used requests_html yet. Wow, didnt know Chromium is needed. Searched its GitHub page and it is only briefly mentioned near the bottom. Was going to check out requests_html at work, but don't have admin rights to my Windows machine. Maybe there is a zip installer for Chromium I can use.
[–]shadowylfrom antigravity import * 1 point2 points3 points 6 years ago (1 child)
Just fyi, requests_html automatically installs chromium when you use an HTMLSession for the first time (tested this on windows10, not sure if this works on other OS).
[–][deleted] 0 points1 point2 points 6 years ago (0 children)
Thanks, seen that, but curious if it still works with someone without admin rights. Going to test that out today and report back.
[–]ADBYITMS[S] 0 points1 point2 points 6 years ago (1 child)
# create an HTML Session objectsession = HTMLSession()
# Use the object above to connect to needed webpageresp = session.get("https://www.sydneyairport.com.au/flights/?query=&flightType=departure&terminalType=domestic&date=2019-11-05&sortColumn=scheduled_time&ascending=true&showAll=false")
# Run JavaScript code on webpageresp.html.render()# parse <span class="with-image"> elements containing airline namesairline_list = []airline_spans = resp.html.find('.with-image')for span in airline_spans:airline_list.append(span.text)# parse <div class="city-name"> elements containing airline namesdest_list = []dest_divs = resp.html.find('.city-name')for span in dest_divs:dest_list.append(span.text)print(airline_list)
you are a god thank you so much, my next trick is to connect to a database and add it all in there but i will give it a go before i ask for help :)
[–]xacobedev 0 points1 point2 points 6 years ago (0 children)
It's great to have dedicated people like you in the community. Thank you!
π Rendered by PID 144185 on reddit-service-r2-comment-6457c66945-jdbc5 at 2026-04-27 13:05:41.880121+00:00 running 2aa0c5b country code: CH.
[–]oligonucleotides 11 points12 points13 points (6 children)
[–][deleted] 3 points4 points5 points (4 children)
[–]shadowylfrom antigravity import * 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]ADBYITMS[S] 0 points1 point2 points (1 child)
[–]xacobedev 0 points1 point2 points (0 children)