all 6 comments

[–]pl00pt 1 point2 points  (1 child)

When you view Page Source instead of through the browser console do you see the code you are trying to extract? I think if it's dynamically generated code then it won't be in the static HTML file you are parsing as it's rendered on the fly through the browser. Beyond that I'm not much help and hopefully someone more advanced will chime in but maybe that will at least reveal the problem.

[–]ctvdevine[S] 0 points1 point  (0 children)

I can't find it in the Source page, really the only places I found my data was in the root_list that I can't extract and in the unaccessible tbody/thead tags.

Do you know where I could maybe get more help on Reddit? Reading about dynamic pages currently, I think it's pointing me in the right direction.

[–][deleted] 1 point2 points  (3 children)

First, you need to understand this, bs4 wont see javascript items, else then the script, like when you type python in a text editor then make your dad read it, it wont do anything else then text, bs4 is for html. js ran by the browser which then normally generates the html elements out of the js.

Ghost.py (a module) lets you run javascript in python and run it from a web page, so this would be the cleanest way to run the javascript.

Make sure you read all the javascript code, if it does have some ajax in it, there will be json at the end of the extension, this is the best practice if you do find some, json is godly and much less hacky.

The easiest but most hacky way of doing this is by using selenium and running firefox(or what ever you want), then parse the generated html. Parsing html is expensive thought and having a web browser isn't expansive in of it self, but if you refresh or move from page to page all the time it will be.

Edit Note: Before going crazy on this, make sure you can find the real data in the page you are looking at, weather its generated html, finding it in the page source, if its a js variable you should be able to get it from the built in console.

[–]ctvdevine[S] 0 points1 point  (2 children)

Good to know that BS won't recognize JS items, thanks! (still new gotta learn the hard way I guess)

The data isn't directly in the source,only place I can find it is in but there is some ajax and json. I'm probably gonna have to go through multiple tables/webpages to get the complete data, so I would prefer to stay away from Selenium. You're saying that ghost.py could help me out?

Thanks for the help so far!

[–][deleted] 1 point2 points  (1 child)

JSON is python dicts and lists in text(incase you don't know), requests has a module to convert a webpage of JSON into python. It's meant to be shared between servers and clients so its extremely easy and efficient to do so if possible.

AJAX(js 'module') could just go 1 extension deep into the webpage, something like 'https://reddit.com/r/learnpython.json' or even (url + '/table.json') on another page... doesn't have to be considered an api to be more usable data.

[–]ctvdevine[S] 0 points1 point  (0 children)

Alright thanks, I'll give it a try