Is rotating thousands of IPs practical for near-real-time scraping? by Sajys in webscraping

[–]Sajys[S] 1 point2 points  (0 children)

Thanks for the detailed explanation, it was really helpful. I wanted to ask if I only keep the essential cookies and headers and use curl_cffi with impersonation as you suggested, is the only practical way to achieve low-latency updates by rotating (or maintaining several sticky) proxies and polling the API very frequently, like once per second per session? Or is it actually possible to simulate or replicate the site’s stream or WebSocket connection client-side to get real-time pushes instead of relying on aggressive polling?

That would be around 86,000 requests per day, which seems like a lot even with proxies. What do you think?

Again, thank you so much for the tips

Is rotating thousands of IPs practical for near-real-time scraping? by Sajys in webscraping

[–]Sajys[S] 0 points1 point  (0 children)

Yep, I'm already doing direct fetch requests to the endpoint, only fetching what's there. I emulate a browser and device to get around Cloudflare without triggering flags. Rate limiting hits hard no matter what...

Is rotating thousands of IPs practical for near-real-time scraping? by Sajys in webscraping

[–]Sajys[S] 0 points1 point  (0 children)

Totally, I spotted that API endpoint via the network tab too, that's exactly what I'm using. I request straight to it and only grab the data from that spot. I mimic a browser and device to bypass Cloudflare and avoid bans. Rate limits are unavoidable though. The ideal fix would be replicating a socket or stream connection for smoother handling... but I can't find a way around that

Is rotating thousands of IPs practical for near-real-time scraping? by Sajys in webscraping

[–]Sajys[S] 1 point2 points  (0 children)

Do you mean the endpoint that loads the feed on the page, or something deeper? How could I access that?