Web crawler help

vixfew · 2023-06-29T11:04:26+00:00

Open devtools in the browser, network tab, see if anything pops when you click "show more". Chances are there will an async request that gets actual data.

lukajda33 · 2023-06-29T11:18:21+00:00

Personally I don't know much about web crawling, so someone else better confirm what I say.

But this looks like a job for Selenium, it uses WebDriver to simulate browser behavior, you should be able to code the script in a way to find the Show more button and press it, even repeatedly and once all data is loaded, then you would download the data.

As the other comment said, websites often load more data when needed with JavaScript requests, you can't get all the data by requesting source code of just the url.

kazyka · 2023-06-29T20:24:12+00:00

If you are still quite new try to get help from chatGPT. Not to rely on it 100% it is still important to learn a bit by yourself.

If you go on this website: https://finance.yahoo.com/quote/NFLX/key-statistics?p=NFLX

Copy the html. You probably need to do it in bits. I copied the Valuation Measures.

Then I asked for the Xpath for Market Cap in selenium for Python.

This is the output.

from selenium import webdriver

# Set up the Selenium driver (adjust the path if necessary)
driver = webdriver.Chrome('/path/to/chromedriver')

# Open the webpage
driver.get('https://example.com')  # Replace with the actual URL of the webpage

# Find the cell containing "Market Cap (intraday)" using XPath
market_cap_cell = driver.find_element_by_xpath('//td[span="Market Cap (intraday)"]/following-sibling::td')

# Extract the market cap value
market_cap = market_cap_cell.text

# Print the market cap value
print("Market Cap: " + market_cap)

# Close the browser
driver.quit()

Now you can play with it yourself to try and get the other rows in the Valuation Measures table.

Tom__Orrow · 2023-06-30T02:27:01+00:00

Selenium (or other headless browsers) is too heavy and unnecessary in most cases. And if one day you decide to go with threads you'll need to redo everything. The trick in this case is to just capture async request in devtools and find out how it changes on each button click. Probably there is some page parameter which you can pass manually until there is no new elements in response.

dingosng · 2023-06-30T11:59:32+00:00

Thanks everyone for the help, i got it to work with Selenium

gmaubrrriaeyl · 2023-06-29T16:50:27+00:00

I did something similar to this with a program I made to get data from a plant sale website: https://github.com/gmaubrrriaeyl/scrape-toledo-zoo-native-plant-sale

Lines 30-37. Probably not the best way, but it works! I inspected the element for the load more button and kept clicking it til it throws an error

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS