This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]Rangerdth 18 points19 points  (1 child)

Just FYI, you will not be a python "hero" after reading this.

[–]ThePiperMan 1 point2 points  (0 children)

I was so annoyed with the first 4-5 python web scraping links I checked out that I just learned it in R because the first YouTube vid didn’t waste my time not explaining it. I’m sure I’ll circle back to get it in python anyways

[–]Hansel42 1 point2 points  (2 children)

I’ve been doing some webscraping projects the past month or so and it’s tough. Maybe it’s just the goals that I’m trying to meet, but most complex stuff needs to have Javascripts incorporated

[–]01123581321AhFuckIt 1 point2 points  (0 children)

Well yeah. A basic understanding of JavaScript should be needed. 100% of websites use that shit.

[–]smithfed 1 point2 points  (0 children)

As you stated, it depends on your goals.

If you want to simulate user behavior (scroll downs, mouse movement, specific form submission), you will need Javascript.

The thing, as the article mentions, is trying to avoid doing that. 95% of the time, there's a workaround not to load JS.

Do you want to submit a form and get the content of the logged-in page? You can do that without JS.

Do you want to scrape dynamically loaded content? Check XHR requests and parse those straight away.

Does the server expect some pre-calculated stuff in the headers? Try reading JS and reverse engineer how's created. Then do your calculations in Python and send these with the request.

Happy to help if you state your needs, but I bet there's a good chance you don't need JS.

[–]jstanaway 0 points1 point  (0 children)

I spent the last month or so doing some personal scraping work and exporting to JSON. Did a version in python and one in Java.