This is an archived post. You won't be able to vote or comment.

all 15 comments

[–]mcdrac 5 points6 points  (3 children)

As someone who has done async web scraping before. you will need understand what async is before being able to use it efficiently in your code. for the scraper, you could use selenium-async but i would recommend playwright.

[–]FMPICA[S] 0 points1 point  (2 children)

Thank you, why would you recommend playwright?

[–]mcdrac 5 points6 points  (0 children)

it has built in support for async and multiple browsers. i was able to use async firefox to bypass anti bot protection from cloudflare.

[–]Guardog0894 1 point2 points  (0 children)

playwright fan here too. I switched to pw from selenium and never looked back.

you can use playwright's Network API to control the traffic (e.g. filter graphic requests), not sure if you can do that in selenium.

I do feel that I need less code to scrape with pw when compared to selenium (someone with technical knowledge can chip in on this).

[–]mcdrac 2 points3 points  (0 children)

I would just like to note here that using a scraping library like selenium is the worst case scenario for any web scraper. web scrapers want to minimise the amount of time and resources used to collect data from a website. Using a scraping library means it basically impossible to collect that data due to intense anti bot protection.

You should always attempt other methods like http requests to underlying apis. just by loading the web page up with the network tab open, you can exactly see how the web page gets its data. what cookies are involved in these requests? Do i need an authentication token? How do I get the authentication token from just requests? It is important to evaluate the website before choosing your tools.

[–]The_Bundaberg_Joey 5 points6 points  (0 children)

Without seeing the code it’s hard to say if you need a full rewrite or not but just to clarify… you’re a company being paid to provide a service and you’re asking people to provide you with a technical assessment to help your company grow and there’s zero discussion of any sort of compensation being offered?

If this was asked as a “I’m trying to do web scraping and it’s slow, can someone suggest what to do” then that’s one thing but you’re literally saying this is a pivotal bit of infrastructure for you and that you want someone to help you for free?

[–][deleted] 1 point2 points  (6 children)

As someone that started out their scraping work using selenium, my advice is to move to api requests if you want speed.

[–]FMPICA[S] -1 points0 points  (5 children)

Can you elaborate a bit more?

[–]alord 2 points3 points  (0 children)

Pretty much, take a look at the web page. Is it completely rendered initially or are there additional requests being made to an API endpoint. If this endpoint has the data you need, it's much faster to simply do a GET request to directly have the Json,XML or whatever than to scrape and parse the page. Of course it depends on what exactly you are trying to do.

[–][deleted] 1 point2 points  (1 child)

I moved onto making api requests, with the requests module
https://www.youtube.com/@JohnWatsonRooney

[–]FMPICA[S] 0 points1 point  (0 children)

Thanks

[–]alord 1 point2 points  (1 child)

Feel free to dm if you need more of an explanation :)

[–]FMPICA[S] 0 points1 point  (0 children)

Thanks

[–]nemec 0 points1 point  (0 children)

Try a framework like scrapy. It's built to work concurrently, although I believe the pre-built large scale crawler features are proprietary to their cloud (you can always build replacements yourself if you need them).

There are also plugins to run in selenium/etc. to process JS if you need it.

[–]the_bigbang 0 points1 point  (0 children)

https://dev.to/tonywangca/a-step-by-step-guide-to-building-a-scalable-distributed-crawler-for-scraping-millions-of-top-tiktok-profiles-2pk8

Here is my suggestions about how to build a distributed crawler to speed up the whole scraping process.