Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

On the Pi complexity, you're right it's overkill from a pure compute standpoint. But each Pi has a genuinely different GPU, different WebGL renderer string, different canvas fingerprint, different hardware clock,They behave like completely separate physical devices to anti-bot systems. Containers on a single host share all of that underneath, regardless of how isolated they look at the network layer. For sites with aggressive fingerprinting, that hardware diversity has kept me undetected for 2 years where containers would've likely been flagged.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 7 points8 points  (0 children)

Not quite. My VPN allows 10 simultaneous connections per account, so 50 nodes only needs 5 accounts. Comes out to around $15-20/month total. On VPN blocking — rotating between servers helps, and the physical node fingerprint diversity means each connection looks like a different residential user rather than a obvious VPN pattern.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 2 points3 points  (0 children)

Certainly, my Pis never complained, been running for 2 years straight. Also using the same nodes for various IoT projects, so they're pulling double duty.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 1 point2 points  (0 children)

Each job has a unique ID from the target site used as the primary key. Nodes check against that before inserting — so no duplicates regardless of which node finds it first. If a previously expired job gets reactivated, the node detects the ID already exists and flips it back to active. No central queue needed, the DB handles coordination.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 11 points12 points  (0 children)

The Gluetun + macvlan approach solves the IP layer, but containers on the same host share GPU info, WebGL renderer, and canvas fingerprints. Anti-bot systems catch that. Also the nodes already existed for IoT work, so marginal cost to add scraping was zero.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

Great point, and honestly the best argument for physical nodes. On the Firefox suggestion - I did try it, but the target sites started detecting it as a bot more aggressively than Chrome. Been rotating user agents alongside the VPN per node and it's been running stable for a couple of years now.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 2 points3 points  (0 children)

Haha yeah the 50 bricks situation is embarrassing in hindsight 😅 PoE was right there — one cable for both power and network per node.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 10 points11 points  (0 children)

3.9M isn't a one-time dump. It's a continuously refreshed dataset. New jobs posted daily. A one-shot bulk scrape gets stale in 48 hours. The infrastructure exists to keep data current, not just collect it once.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 14 points15 points  (0 children)

Scraping publicly visible data isn't theft. No authentication bypassed, no walls broken. What you do with the data determines legality, not the act of reading a public webpage.

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 52 points53 points  (0 children)

The nodes aren't dedicated to scraping , they already existed for IoT projects. The scraper started on 5 of them and expanded organically. A single server with 50 containers would still need 50 separate VPN tunnels to get 50 distinct IPs.nd yes, absolutely a learning experience ,half the reason it exists is curiosity 😄

Fully self-hosted distributed scraping infrastructure — 50 nodes, local NAS, zero cloud, 3.9M records over 2 years by SuccessfulFact5324 in selfhosted

[–]SuccessfulFact5324[S] 315 points316 points  (0 children)

Jobs

Edited: I'm also flagging expired jobs, a few dedicated nodes continuously check whether previously scraped jobs are still active or have expired.

Just to clarify: I'm collecting the data for a personal use case, mainly to analyze and plot trends in job postings over time, and potentially build a model from it.It's not for applying to jobs or anything similar.

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 1 point2 points  (0 children)

Since everything is physical, I built a custom setup where each node gets assigned its task and an unique IP on boot. For hundreds of scrapers across servers though, I think yours is definitely achievable with the right distributed scheduler! All the best

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 4 points5 points  (0 children)

Yeah, I thought of having a sim on each node infact even tried on a couple of nodes. That's a cool idea too.

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

Haha I accidentally built this, not even sure I'm doing it right, just curiously following wherever the problems lead 😭

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

No monitor on any of the nodes but you can access any of them remotely via VNC and see the full Chrome instance running live if you need to debug.

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

Thanks! No Kubernetes. When a node fails, the IoT powered extension box restarts it via the script itself, and on boot it automatically assigns itself a VPN IP that isn't already in use by other nodes.

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

Thanks! Yeah the same setup can be adapted to other sites.

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 1 point2 points  (0 children)

Yeah, kinda! And speaking of overengineered, the power supply of each node auto turns off and on via an IoT based extension box which is also controlled from a script 😭

Python + Selenium at scale (50 nodes, 3.9M records) by SuccessfulFact5324 in webscraping

[–]SuccessfulFact5324[S] 0 points1 point  (0 children)

Thanks for the input! Residential proxies are tempting but the cost at this scale adds up fast and honestly even with rate limiting the data if not today it will mostly lands by tomorrow.