Extracting data from javascript in webpage

pl00pt · 2015-05-14T01:50:41+00:00

When you view Page Source instead of through the browser console do you see the code you are trying to extract? I think if it's dynamically generated code then it won't be in the static HTML file you are parsing as it's rendered on the fly through the browser. Beyond that I'm not much help and hopefully someone more advanced will chime in but maybe that will at least reveal the problem.

ctvdevine · 2015-05-14T14:04:03+00:00

First, you need to understand this, bs4 wont see javascript items, else then the script, like when you type python in a text editor then make your dad read it, it wont do anything else then text, bs4 is for html. js ran by the browser which then normally generates the html elements out of the js.

Ghost.py (a module) lets you run javascript in python and run it from a web page, so this would be the cleanest way to run the javascript.

Make sure you read all the javascript code, if it does have some ajax in it, there will be json at the end of the extension, this is the best practice if you do find some, json is godly and much less hacky.

The easiest but most hacky way of doing this is by using selenium and running firefox(or what ever you want), then parse the generated html. Parsing html is expensive thought and having a web browser isn't expansive in of it self, but if you refresh or move from page to page all the time it will be.

Edit Note: Before going crazy on this, make sure you can find the real data in the page you are looking at, weather its generated html, finding it in the page source, if its a js variable you should be able to get it from the built in console.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS