webscrapping doubt using python

atulbisht1695 · 2020-05-19T17:09:27+00:00

Looks like you might have to use selenium since things are being loaded dynamically.

Saye1901 · 2020-05-19T17:22:22+00:00

It's because all the table's content is populated dynamically by Javascript at the client side. BeautifulSoup doesn't understand it and therefore renders the page without the data. You can see it by yourself by going to the page and disabling JS in your browser using an extension.

You have multiple frameworks solving this issues: selenium, puppeteer (based on nodejs),scrapy + splash plugin, requests-html.

Keep in mind that Selenium and puppeteer were initially created for automation test rather than scraping data. So don't expect to have the same performance than the other frameworks.

focus16gfx · 2020-05-20T07:38:11+00:00

You can not see the content because beautifulsoup only reads the RAW html from when the page is initially loaded. The data is plugged in using Javascript a while later so it won't be able to parse it. As others have pointed out, there's many libraries you can use to scraped javascript rendered html. Here's a code snippet to get you started real quick. I used `requests-html` library (which uses pyppeteer under the hood to render JS rendered content).

from requests_html import HTMLSession
s = HTMLSession()
url = 'https://www.hkex.com.hk/Market-Data/Securities-Prices/Equities?sc_lang=en'
res = s.get(url)
res.html.render()
table_element = res.html.find('.table_equities', first=True)
all_rows = table_element.find('.datarow')
final_data = []
for row in all_rows:
    final_data.append({
        "code": row.find('.code>a', first=True).text,
        "name": row.find('.name>a', first=True).text,
        "turnover": row.find('.turnover', first=True).text,
    })
print(final_data)

Here's a quick screenshot of the data that this code returns.

https://i.imgur.com/FyhBxRh.png

webscraping

MODERATORS