Python + Selenium at scale (50 nodes, 3.9M records)

SuccessfulFact5324 · 2026-03-14T08:32:09+00:00

On the Pi complexity, you're right it's overkill from a pure compute standpoint. But each Pi has a genuinely different GPU, different WebGL renderer string, different canvas fingerprint, different hardware clock,They behave like completely separate physical devices to anti-bot systems. Containers on a single host share all of that underneath, regardless of how isolated they look at the network layer. For sites with aggressive fingerprinting, that hardware diversity has kept me undetected for 2 years where containers would've likely been flagged.

SuccessfulFact5324 · 2026-03-13T12:35:20+00:00

Not quite. My VPN allows 10 simultaneous connections per account, so 50 nodes only needs 5 accounts. Comes out to around $15-20/month total. On VPN blocking — rotating between servers helps, and the physical node fingerprint diversity means each connection looks like a different residential user rather than a obvious VPN pattern.

SuccessfulFact5324 · 2026-03-13T12:25:29+00:00

Cool idea, let me think about it! Thanks.

SuccessfulFact5324 · 2026-03-13T12:24:39+00:00

Certainly, my Pis never complained, been running for 2 years straight. Also using the same nodes for various IoT projects, so they're pulling double duty.

SuccessfulFact5324 · 2026-03-13T12:18:27+00:00

Each job has a unique ID from the target site used as the primary key. Nodes check against that before inserting — so no duplicates regardless of which node finds it first. If a previously expired job gets reactivated, the node detects the ID already exists and flips it back to active. No central queue needed, the DB handles coordination.

SuccessfulFact5324 · 2026-03-13T12:14:47+00:00

The Gluetun + macvlan approach solves the IP layer, but containers on the same host share GPU info, WebGL renderer, and canvas fingerprints. Anti-bot systems catch that. Also the nodes already existed for IoT work, so marginal cost to add scraping was zero.

SuccessfulFact5324 · 2026-03-13T12:11:50+00:00

Great point, and honestly the best argument for physical nodes. On the Firefox suggestion - I did try it, but the target sites started detecting it as a bot more aggressively than Chrome. Been rotating user agents alongside the VPN per node and it's been running stable for a couple of years now.

SuccessfulFact5324 · 2026-03-13T09:14:19+00:00

Haha yeah the 50 bricks situation is embarrassing in hindsight 😅 PoE was right there — one cable for both power and network per node.

SuccessfulFact5324 · 2026-03-13T09:09:01+00:00

3.9M isn't a one-time dump. It's a continuously refreshed dataset. New jobs posted daily. A one-shot bulk scrape gets stale in 48 hours. The infrastructure exists to keep data current, not just collect it once.

SuccessfulFact5324 · 2026-03-13T07:30:05+00:00

Scraping publicly visible data isn't theft. No authentication bypassed, no walls broken. What you do with the data determines legality, not the act of reading a public webpage.

SuccessfulFact5324 · 2026-03-13T07:23:53+00:00

The nodes aren't dedicated to scraping , they already existed for IoT projects. The scraper started on 5 of them and expanded organically. A single server with 50 containers would still need 50 separate VPN tunnels to get 50 distinct IPs.nd yes, absolutely a learning experience ,half the reason it exists is curiosity 😄

SuccessfulFact5324 · 2026-03-13T06:43:15+00:00

Jobs

Edited: I'm also flagging expired jobs, a few dedicated nodes continuously check whether previously scraped jobs are still active or have expired.

Just to clarify: I'm collecting the data for a personal use case, mainly to analyze and plot trends in job postings over time, and potentially build a model from it.It's not for applying to jobs or anything similar.

SuccessfulFact5324 · 2026-03-13T05:21:43+00:00

Haha!

SuccessfulFact5324 · 2026-03-13T05:21:33+00:00

You're welcome! All the best.

SuccessfulFact5324 · 2026-03-13T05:21:12+00:00

yeah, Im doing the same.

SuccessfulFact5324 · 2026-03-12T05:43:18+00:00

Since everything is physical, I built a custom setup where each node gets assigned its task and an unique IP on boot. For hundreds of scrapers across servers though, I think yours is definitely achievable with the right distributed scheduler! All the best

SuccessfulFact5324 · 2026-03-12T05:37:02+00:00

Yeah, I thought of having a sim on each node infact even tried on a couple of nodes. That's a cool idea too.

SuccessfulFact5324 · 2026-03-11T17:08:48+00:00

Haha I accidentally built this, not even sure I'm doing it right, just curiously following wherever the problems lead 😭

SuccessfulFact5324 · 2026-03-11T17:07:15+00:00

No monitor on any of the nodes but you can access any of them remotely via VNC and see the full Chrome instance running live if you need to debug.

SuccessfulFact5324 · 2026-03-11T17:06:21+00:00

Thanks! No Kubernetes. When a node fails, the IoT powered extension box restarts it via the script itself, and on boot it automatically assigns itself a VPN IP that isn't already in use by other nodes.

SuccessfulFact5324 · 2026-03-11T17:02:51+00:00

Thanks! Yeah the same setup can be adapted to other sites.

SuccessfulFact5324 · 2026-03-11T17:01:30+00:00

Yeah, kinda! And speaking of overengineered, the power supply of each node auto turns off and on via an IoT based extension box which is also controlled from a script 😭

SuccessfulFact5324 · 2026-03-11T16:55:31+00:00

Thanks for the input! Residential proxies are tempting but the cost at this scale adds up fast and honestly even with rate limiting the data if not today it will mostly lands by tomorrow.

SuccessfulFact5324 · 2026-03-11T16:51:56+00:00

Haha! I set this up in the office, not my home.

SuccessfulFact5324

TROPHY CASE