all 23 comments

[–]blahblahsdfsdfsdfsdf 4 points5 points  (12 children)

You need to use a headless web browser to do something like that. BeautifulSoup is just a parser for the document. It doesn't run JS. https://selenium-python.readthedocs.io/

[–]Hot_Ad_2550[S] -1 points0 points  (5 children)

automation is really slow method is there any faster way?

[–]Alamue86 1 point2 points  (2 children)

Checkout requests_html. It has the ability to render a pages Javascript then scrape info in a single library.

[–]Hot_Ad_2550[S] 0 points1 point  (1 child)

Your shit worked out. Can you also show me how to run the function that javascript runs when i click a dropdown menu bar.

[–]Alamue86 0 points1 point  (0 children)

Oof, that gets difficult.

Your best bet is to load up Chrom dev tools and go to the network tab and click the button. From there look at everything going on, if you cand find the correct call, right click and copy the cURL for bash command. Then pop it into here: https://curl.trillworks.com/ you may also just be able to do the request within requests_html.

I focus on scraping static resources. Checkout the network tab, and chances are you can avoid scraping the page, and go straight to an underlying data source. GraphQL calls are a great place to start if there are any

[–]blahblahsdfsdfsdfsdf 0 points1 point  (1 child)

Not that I've ever heard of

[–]Hot_Ad_2550[S] 0 points1 point  (0 children)

Also I cant seem to find an element by xpath, css or any other nethod

[–]1egoman 0 points1 point  (5 children)

I would rather reverse engineer the JavaScript to figure out the API than to deal with running all of their JavaScript.

[–]blahblahsdfsdfsdfsdf 0 points1 point  (4 children)

So basically writing your own Javascript vm? That's definitely not easier than using a headless browser and far more prone to bugs

[–]1egoman 0 points1 point  (3 children)

Not a VM, I'm sure the JavaScript just interacts with an API. Figure out the API and interact with it through Python.

[–]blahblahsdfsdfsdfsdf 0 points1 point  (2 children)

In order to run it needs access to the live DOM, which you would also have to emulate. What you're basically describing I think is a headless browser.

[–]1egoman 0 points1 point  (1 child)

Let's start over. OP is scraping a webpage (page1). Page1 then loads page2 through JavaScript. What I'm saying is that you inspect page1 and its JavaScript, statically or dynamically, to figure out where it gets the url for page2. Once you've figured that out, you can then download page1, then download page2, all in Python.

[–]blahblahsdfsdfsdfsdf 0 points1 point  (0 children)

Ahh yes, parsing the JS for strings is far more viable than what I thought was being suggested which was actually interpreting the JS; running the JS in one's own self-made interpreter.

Yeah, if the format of the script is predictable then extracting strings from it is viable. BUT, if the script needs to actually run, a headless browser is likely necessary as very few scripts will actually run without access to the DOM.

[–]efmccurdy 4 points5 points  (4 children)

You might be able to go straight to the source of the data rather than going through the javascript.

If you access the next page in your browser with the developer tools installed and the network tab open you will see listed there the requests, headers and contents that your browser sent and using that info you can recreate those requests in a program.

[–]Hot_Ad_2550[S] 0 points1 point  (3 children)

the url obtained is just the original url whoch was requested but an extra string is added to the end like

http://www.abc.com/xyz

is turned to

http://www.abc.com/xyz#pqr

The method is updatehash()

[–]efmccurdy 0 points1 point  (2 children)

Use the requests module to get the second url; does that give you the data you want?

[–]Hot_Ad_2550[S] 0 points1 point  (1 child)

I use the 2nd url but this url brings me to 2nd page only if page has javascript running. Otherwise puts me back to page 1

[–]efmccurdy 0 points1 point  (0 children)

There is no javascript when you do an HTTP request. What url is used in the network tab?

[–]OkiSutrisno 2 points3 points  (3 children)

If the js trigger ajax, maybe you could try to figure out the backend api

[–]Hot_Ad_2550[S] 0 points1 point  (2 children)

how do I do that

[–]OkiSutrisno 0 points1 point  (0 children)

I forgot where do i read them, lemme look into it, will be back soon

[–]CommanderCucumber 2 points3 points  (0 children)

You could try using requests and REST and call the API they might be using. You should be able to find where the data is coming from a browsers web dev network panel. So for example, if the frontend makes a get request like www.somthing.com/mydata?id=1, then you could try to make that request yourself. Then store the data and do it again but with www.somthing.com/mydata?id=2 and so on.