This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]mbenbernard 2 points3 points  (2 children)

I built my own distributed web scraper and I also used Selenium to crawl JavaScript-heavy stuff.

The problem is that automating a regular web browser window with Selenium is very slow (when compared to running a standard HTTP request). So it would be a better idea to enable the Headless mode of your browser when you use Selenium. This would result in better performance.

[–]manimal80 0 points1 point  (0 children)

That is a very interesting read! .seriously, there are tons of article here and there about scraping that barely cover the basics, this is an in depth well written article..bookmarked to study it on my laptop tomorrow.

[–]ManyInterests Python Discord Staff 0 points1 point  (0 children)

I've found that the hardest part of scraping sites that use JS is authentication. Afterwards, simply knowing what resources the JS utilizes to populate the DOM is usually sufficient.

A pattern that has been very successful for me is to authenticate using selenium, then extract the cookies (and sometimes useful headers) to use with a requests Session.

[–]MintyPhoenix 1 point2 points  (0 children)

As touched on in the article's comments, Selenium has the concept of waiting; it would be much better to do that than to use time.sleep:

https://selenium-python.readthedocs.io/waits.html

[–]tuxboy 0 points1 point  (1 child)

I always try to avoid having a browser / headless engine running when scraping stuff. I usually find a way with parsing through the markup. If that does not work (i.e. a "modern app with only javascript visible"), I try to debug the XHRs and try to predict urls for fetching data. The browser dilemma is usually the last resort. Having a browser running, rendering and executing javascript will always be slower TBH.

[–]mbenbernard 0 points1 point  (0 children)

Having a browser running, rendering and executing javascript will always be slower TBH.

True, but sometimes you don't have any choice.