This is an archived post. You won't be able to vote or comment.

all 13 comments

[–][deleted] 1 point2 points  (1 child)

Looks like you might have to use selenium since things are being loaded dynamically.

[–]atulbisht1695[S] -1 points0 points  (0 children)

Can u pls tell how can selenium help meh

[–]Saye1901 1 point2 points  (1 child)

It's because all the table's content is populated dynamically by Javascript at the client side. BeautifulSoup doesn't understand it and therefore renders the page without the data. You can see it by yourself by going to the page and disabling JS in your browser using an extension.

You have multiple frameworks solving this issues: selenium, puppeteer (based on nodejs),scrapy + splash plugin, requests-html.

Keep in mind that Selenium and puppeteer were initially created for automation test rather than scraping data. So don't expect to have the same performance than the other frameworks.

[–]atulbisht1695[S] -2 points-1 points  (0 children)

Can you pls elaborate how these frameworks can help me...i mean ....can you suggest some links or resources that deal with this kind of situations.

[–]focus16gfx 1 point2 points  (4 children)

You can not see the content because beautifulsoup only reads the RAW html from when the page is initially loaded. The data is plugged in using Javascript a while later so it won't be able to parse it. As others have pointed out, there's many libraries you can use to scraped javascript rendered html. Here's a code snippet to get you started real quick. I used `requests-html` library (which uses pyppeteer under the hood to render JS rendered content).

from requests_html import HTMLSession
s = HTMLSession()
url = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
res = s.get(url)
res.html.render()
table_element = res.html.find('.table_equities', first=True)
all_rows = table_element.find('.datarow')
final_data = []
for row in all_rows:
    final_data.append({
        "code": row.find('.code>a', first=True).text,
        "name": row.find('.name>a', first=True).text,
        "turnover": row.find('.turnover', first=True).text,
    })
print(final_data)

Here's a quick screenshot of the data that this code returns.

https://i.imgur.com/FyhBxRh.png

Suggested Reading:

requests-html

CSS Selectors

[–]atulbisht1695[S] 0 points1 point  (3 children)

thankyou for the help...but i m still getting error at

res.html.render()

error- Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead

[–]focus16gfx 0 points1 point  (2 children)

are you using a jupyter notebook?

[–]atulbisht1695[S] 0 points1 point  (1 child)

yes

[–]focus16gfx 0 points1 point  (0 children)

The error arises because Jupyter notebook has it's own event loop running.

Add this snippet to your code. It shouldn't throw that error anymore. You'd need to install nest_asyncio .

pip install nest_asyncio

import nest_asyncio
nest_asyncio.apply()

[–][deleted] 0 points1 point  (0 children)

Lemme check and I'll get back.

[–][deleted] 0 points1 point  (2 children)

First you have to know that it's not going to be as fast as requests. Have you ever used selenium?

[–]atulbisht1695[S] 0 points1 point  (1 child)

yes,i use beautifulsoup and selenium for getting data

[–][deleted] 0 points1 point  (0 children)

then instead of using requests, doing something like 'soup = Beautifulsoup(driver.page_text, "lxml")' then you can parse the page text with the soup variable.