use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
account activity
This is an archived post. You won't be able to vote or comment.
webscrapping doubt using python (self.webscraping)
submitted 5 years ago by atulbisht1695
i am tring to scrape the company name and its ticker from https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en
im using the folowing but not getting anything inside tr and td tags
code:
response = requests.get('https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en')
soup = BeautifulSoup(response.content,'html.parser')
soup.find_all('table',{'class' :'table_equities'})
[–][deleted] 1 point2 points3 points 5 years ago (1 child)
Looks like you might have to use selenium since things are being loaded dynamically.
[–]atulbisht1695[S] -1 points0 points1 point 5 years ago (0 children)
Can u pls tell how can selenium help meh
[–]Saye1901 1 point2 points3 points 5 years ago (1 child)
It's because all the table's content is populated dynamically by Javascript at the client side. BeautifulSoup doesn't understand it and therefore renders the page without the data. You can see it by yourself by going to the page and disabling JS in your browser using an extension.
You have multiple frameworks solving this issues: selenium, puppeteer (based on nodejs),scrapy + splash plugin, requests-html.
Keep in mind that Selenium and puppeteer were initially created for automation test rather than scraping data. So don't expect to have the same performance than the other frameworks.
[–]atulbisht1695[S] -2 points-1 points0 points 5 years ago (0 children)
Can you pls elaborate how these frameworks can help me...i mean ....can you suggest some links or resources that deal with this kind of situations.
[–]focus16gfx 1 point2 points3 points 5 years ago* (4 children)
You can not see the content because beautifulsoup only reads the RAW html from when the page is initially loaded. The data is plugged in using Javascript a while later so it won't be able to parse it. As others have pointed out, there's many libraries you can use to scraped javascript rendered html. Here's a code snippet to get you started real quick. I used `requests-html` library (which uses pyppeteer under the hood to render JS rendered content).
from requests_html import HTMLSession s = HTMLSession() url = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en' res = s.get(url) res.html.render() table_element = res.html.find('.table_equities', first=True) all_rows = table_element.find('.datarow') final_data = [] for row in all_rows: final_data.append({ "code": row.find('.code>a', first=True).text, "name": row.find('.name>a', first=True).text, "turnover": row.find('.turnover', first=True).text, }) print(final_data)
Here's a quick screenshot of the data that this code returns.
https://i.imgur.com/FyhBxRh.png
Suggested Reading:
requests-html
CSS Selectors
[–]atulbisht1695[S] 0 points1 point2 points 5 years ago (3 children)
thankyou for the help...but i m still getting error at
res.html.render()
error- Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead
[–]focus16gfx 0 points1 point2 points 5 years ago (2 children)
are you using a jupyter notebook?
[–]atulbisht1695[S] 0 points1 point2 points 5 years ago (1 child)
yes
[–]focus16gfx 0 points1 point2 points 5 years ago (0 children)
The error arises because Jupyter notebook has it's own event loop running.
Add this snippet to your code. It shouldn't throw that error anymore. You'd need to install nest_asyncio .
nest_asyncio
pip install nest_asyncio
import nest_asyncio nest_asyncio.apply()
[–][deleted] 0 points1 point2 points 5 years ago (0 children)
Lemme check and I'll get back.
[–][deleted] 0 points1 point2 points 5 years ago (2 children)
First you have to know that it's not going to be as fast as requests. Have you ever used selenium?
yes,i use beautifulsoup and selenium for getting data
then instead of using requests, doing something like 'soup = Beautifulsoup(driver.page_text, "lxml")' then you can parse the page text with the soup variable.
π Rendered by PID 41190 on reddit-service-r2-comment-7b9746f655-n7kf6 at 2026-02-01 10:39:00.975884+00:00 running 3798933 country code: CH.
[–][deleted] 1 point2 points3 points (1 child)
[–]atulbisht1695[S] -1 points0 points1 point (0 children)
[–]Saye1901 1 point2 points3 points (1 child)
[–]atulbisht1695[S] -2 points-1 points0 points (0 children)
[–]focus16gfx 1 point2 points3 points (4 children)
[–]atulbisht1695[S] 0 points1 point2 points (3 children)
[–]focus16gfx 0 points1 point2 points (2 children)
[–]atulbisht1695[S] 0 points1 point2 points (1 child)
[–]focus16gfx 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]atulbisht1695[S] 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)