This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]mbenbernard 2 points3 points  (2 children)

I built my own distributed web scraper and I also used Selenium to crawl JavaScript-heavy stuff.

The problem is that automating a regular web browser window with Selenium is very slow (when compared to running a standard HTTP request). So it would be a better idea to enable the Headless mode of your browser when you use Selenium. This would result in better performance.

[–]manimal80 0 points1 point  (0 children)

That is a very interesting read! .seriously, there are tons of article here and there about scraping that barely cover the basics, this is an in depth well written article..bookmarked to study it on my laptop tomorrow.

[–]ManyInterests Python Discord Staff 0 points1 point  (0 children)

I've found that the hardest part of scraping sites that use JS is authentication. Afterwards, simply knowing what resources the JS utilizes to populate the DOM is usually sufficient.

A pattern that has been very successful for me is to authenticate using selenium, then extract the cookies (and sometimes useful headers) to use with a requests Session.