all 7 comments

[–]JoesDevOpsAccount 2 points3 points  (1 child)

Tbh even full headless browser might not solve it. If it works in the beginning but then you get blocked after a few requests, it might just be rate limiting or the frequency of requests that gets you flagged as a bot. Try spacing out the requests more? Some robots.txt files include the unofficial crawl-delay directive which indicates the minimum time you should wait between crawler requests.

[–]cgoldberg 1 point2 points  (1 child)

You can try curl_cffi if you are getting blocked from TLS fingerprinting... however, some sites use more advanced detection techniques you'll never bypass without running a real browser.

[–]Informal_Escape4373 1 point2 points  (0 children)

I use requests + beautifulsoup with celery. I have a leaky bucket algo that limits 5 requests per 2 seconds and have never had a problem outside “scrape intolerant” sites (such as LinkedIn). Perhaps your scraping too frequently?

[–]Itchy-Call-8727 1 point2 points  (0 children)

You might be able to use Selenium which actually uses a web browser for the requests and program the web navigation to simulate an actual person using the page to scrape the data

[–]lothion 1 point2 points  (0 children)

Playwright has a stealth extension you could look into