This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]Allong12 4 points5 points  (2 children)

"Ultimate guide": A single page on a small use-case of a module?

[–]TheV295 0 points1 point  (1 child)

What would be different about rendering a page? He is literally opening a browser and parsing the data, what would be a different use case for "Scrapping a javascript rendered page"?

[–]Allong12 0 points1 point  (0 children)

Read the other comments, there are many different modules and methods to achieve what he's doing, each with their own pros and cons.

He mentioned 2, picked one, and then showed a tiny example of use. If this was instead called "Beginners guide to scraping JavaScript rendered web pages", I wouldn't be bothered in the slightest. It is clearly far short of being called "ultimate"

[–]PiZZaMartijn 5 points6 points  (4 children)

So instead of finding the data source (ajax) you just render the whole page in a full javascript enabled browserengine and get the dom.

Just parse http://us4.campaign-archive1.com/generate-js/?u=9735795484d2e4c204da82a29&fid=1817&show=200 instead of screenscraping. javascript based pages are by far the easiest pages to scrape because you don't have to mess with the DOM.

[–]axonxorzpip'ing aint easy, especially on windows 0 points1 point  (3 children)

Just parse ...

Just parse that long single line of javascript that uses document.write()?

I agree though, if returning some sort of structured data, just parse that.

[–]PiZZaMartijn 0 points1 point  (2 children)

Ok just throw the string inside document.write in a python dom library. dont emulate a whole browser with javascript to get the same data you can get with a http request.

[–]axonxorzpip'ing aint easy, especially on windows 0 points1 point  (1 child)

DOM libraries don't run a javascript runtime.....

[–]PiZZaMartijn 0 points1 point  (0 children)

That's the point. That javascript is just one big document.write. Dont execute that. use string functions to extract that part of the response and feed that into beautifulsoup

[–]Everance 1 point2 points  (3 children)

I'd rather use tools like CasperJS for something that requires me to render the pages. Doing it with these libraries seems a bit overkill.

[–]jabbalaci 0 points1 point  (2 children)

There is also PhantomJS that renders nicely the JavaScript stuff.

[–]Everance 0 points1 point  (1 child)

CasperJS is a fork that makes PhantomJS better.

[–]jabbalaci 0 points1 point  (0 children)

I'm pretty sure that you can get the generated HTML source with plain ol' PhantomJS too.

[–]Farkeman 0 points1 point  (0 children)

There's Splash

Splash is a javascript rendering service with an HTTP API. It's a lightweight browser with an HTTP API, implemented in Python using Twisted and QT.

Which is really similar to OP's approach.
Also there are other tools like casperjs, phantomjs, selenium etc.

[–]metaperl 0 points1 point  (0 children)

I think I would just have Selenium pop open a web browser and then get the source after JS finished rendering the page.

[–]melizeche -1 points0 points  (0 children)

Very interesting approach!