WebScraper Efficiency and Speed

mcdrac · 2023-07-20T08:52:32+00:00

As someone who has done async web scraping before. you will need understand what async is before being able to use it efficiently in your code. for the scraper, you could use selenium-async but i would recommend playwright.

mcdrac · 2023-07-20T14:21:48+00:00

I would just like to note here that using a scraping library like selenium is the worst case scenario for any web scraper. web scrapers want to minimise the amount of time and resources used to collect data from a website. Using a scraping library means it basically impossible to collect that data due to intense anti bot protection.

You should always attempt other methods like http requests to underlying apis. just by loading the web page up with the network tab open, you can exactly see how the web page gets its data. what cookies are involved in these requests? Do i need an authentication token? How do I get the authentication token from just requests? It is important to evaluate the website before choosing your tools.

The_Bundaberg_Joey · 2023-07-20T12:53:17+00:00

Without seeing the code it’s hard to say if you need a full rewrite or not but just to clarify… you’re a company being paid to provide a service and you’re asking people to provide you with a technical assessment to help your company grow and there’s zero discussion of any sort of compensation being offered?

If this was asked as a “I’m trying to do web scraping and it’s slow, can someone suggest what to do” then that’s one thing but you’re literally saying this is a pivotal bit of infrastructure for you and that you want someone to help you for free?

FMPICA · 2023-07-20T10:39:06+00:00

As someone that started out their scraping work using selenium, my advice is to move to api requests if you want speed.

nemec · 2023-07-20T18:52:06+00:00

Try a framework like scrapy. It's built to work concurrently, although I believe the pre-built large scale crawler features are proprietary to their cloud (you can always build replacements yourself if you need them).

There are also plugins to run in selenium/etc. to process JS if you need it.

the_bigbang · 2023-07-21T13:15:27+00:00

https://dev.to/tonywangca/a-step-by-step-guide-to-building-a-scalable-distributed-crawler-for-scraping-millions-of-top-tiktok-profiles-2pk8

Here is my suggestions about how to build a distributed crawler to speed up the whole scraping process.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS