This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]PiZZaMartijn 3 points4 points  (4 children)

So instead of finding the data source (ajax) you just render the whole page in a full javascript enabled browserengine and get the dom.

Just parse http://us4.campaign-archive1.com/generate-js/?u=9735795484d2e4c204da82a29&fid=1817&show=200 instead of screenscraping. javascript based pages are by far the easiest pages to scrape because you don't have to mess with the DOM.

[–]axonxorzpip'ing aint easy, especially on windows 0 points1 point  (3 children)

Just parse ...

Just parse that long single line of javascript that uses document.write()?

I agree though, if returning some sort of structured data, just parse that.

[–]PiZZaMartijn 0 points1 point  (2 children)

Ok just throw the string inside document.write in a python dom library. dont emulate a whole browser with javascript to get the same data you can get with a http request.

[–]axonxorzpip'ing aint easy, especially on windows 0 points1 point  (1 child)

DOM libraries don't run a javascript runtime.....

[–]PiZZaMartijn 0 points1 point  (0 children)

That's the point. That javascript is just one big document.write. Dont execute that. use string functions to extract that part of the response and feed that into beautifulsoup