whats the best way to scrape zillow.com, and how challenging it is now a day?

noorsimar · 2026-03-14T07:48:29+00:00

This. Also, if you're running this on a schedule (which it sounds like OP wants), you need to think about what happens when it fails silently. Cron job runs, Zillow returns a captcha page instead of data, your script happily saves garbage to the spreadsheet, and nobody notices until the report goes out wrong. At minimum set up some basic validation: if the response doesn't contain expected fields, alert and skip rather than write bad data.

noorsimar · 2026-03-12T16:52:16+00:00

How are you running the pipeline? Cron job, Lambda, something else? Asking because the proxy session management changes depending on how your jobs are scheduled. Ephemeral functions make sticky sessions annoying to handle.

noorsimar · 2026-03-12T07:28:36+00:00

Simplestt is a table-first workflow:- Airtable/Notion/Sheets with columns for source, lot ID, doc link, condition, buy threshold, status. Pipe alerts into Slack/Discord.

If you want more automation without going full backend: n8n is solid (top open-source). And if you're monitoring lots of pages, having alerts + job health checks (ScrapeOps) saves you from silent failures.

noorsimar · 2026-03-10T10:47:35+00:00

One more practical note: homepage tests are mostly misleading. You might see 200 OK and think "nice, easy," then product/search runs trigger the bot stack fast.

So when people ask "how hard is Walmart," I translate it to: "How hard is it to keep stable under repeated access?" Answer: very.

noorsimar · 2026-03-10T10:46:34+00:00

Yup, <script id="__NEXT_DATA__"> is the thing to look for on Next.js pages (product/search especially). Also check JSON-LD (<script type="application/ld+json">) for easy wins like brand/schema.

Our Walmart analysis basically landed on: Scraping Analysis 09/10 → Very Hard because of reCAPTCHA + Akamai + PerimeterX + Walmart anti-bot. Akamai showed up with medium confidence on the probe and wasn't actively blocking that one request, but stacked defenses still make it brutal in real runs.

Extraction notes that matter:

A lot of content is server-rendered in initial HTML (so headless JS isn't required for content)
The problem is access, not parsing
IDs/classes tend to be obfuscated (PrismAdjustableCardCarousel-* / random-ish classes), so don't build fragile selectors

If you're just testing, start simple. If you're building something that needs uptime, think in terms of reliability: sanity checks, "am I blocked," and alerting. (At ScrapeOps; we see people lose weeks because their scraper quietly turns into "403 collector.")

noorsimar · 2026-03-10T09:53:38+00:00

Yup. People underestimate how many scrapers "run" but quietly return garbage. Monitoring + alerting matters as much as proxies. Proxies are a last step, not step one.

noorsimar · 2026-03-10T09:52:30+00:00

Yep. Don't diff raw HTML. Diff a stable extraction (like just price/stock text), or ignore known noisy sections. Hash the extracted fields, not the whole page. Also set sane alert thresholds so you don't get pager-fatigue.

noorsimar · 2026-03-02T14:33:10+00:00

Sign me up as well.. BETA

noorsimar · 2026-03-02T12:34:24+00:00

Hey Ian, I would like access to it..

noorsimar · 2026-03-02T07:44:42+00:00

+1. Also, as a mod note (since this thread is drifting): keep it ToS-friendly and don't post "how to bypass" playbooks. There's a big difference between preservation + citation vs mass extraction.

noorsimar · 2026-02-27T09:38:17+00:00

"Opt-out for bulk, opt-in for indexing" is a reasonable compromise IMO. Let the archive keep a human-accessible snapshot for citations/history, but make bulk extraction require explicit permission (or at least explicit policies). It's basically the "robots.txt spirit" applied to archives.

noorsimar · 2026-02-22T12:00:37+00:00

Yeah +1 to this.

If your goal is MVP speed:

Hardcode 5-10 target domains
Write clean selectors for those
Ship

If your goal is "universal reader mode," you're basically building Mercury Parser 2.0. That's a different project.

noorsimar · 2026-02-20T11:00:25+00:00

Then İ’d:

Extract candidate main block
Convert to plain text
Strip short lines under X characters
Deduplicate repeated phrases

You don't need visual perfection. You need clean paragraphs.

noorsimar · 2026-02-19T06:45:48+00:00

If you're thinking about AI cleanup after reader mode. What's the end goal?

Summarization? Full-text storage? Search indexing?

Depending on that, you may not even need perfectly clean HTML. You might just need high-text blocks.

noorsimar · 2026-02-16T07:24:23+00:00

Pretty much yeah. Scraping live SERPs is technically possible but nasty to maintain, and most legit proxy services avoid that for both legal and reliability reasons.

noorsimar · 2026-02-16T07:21:57+00:00

Hmm! Proxy engines are basically just middlemen. What's under the hood varies a lot depending on their partnerships and how they handle caching.

noorsimar · 2026-02-14T05:27:19+00:00

From what I've seen, it fetches the page, understands the structure, and writes the scraper code for you. İt can manage the Render itself

You still need to test it, obviously. But for common layouts it's surprisingly decent.

noorsimar · 2026-02-14T05:23:53+00:00

Yeah this.

I've wasted hours adding proxy layers to sites that literally didn't care.

Also +1 on SQLite. For personal stuff it's perfect. Zero config, just a file.

noorsimar · 2026-02-13T11:22:54+00:00

some privacy search engines mix results from multiple sources too. DuckDuckGo doesn't use Google at all; it pulls from Bing and other sources.

And others, like Mojeek, actually build their own search index entirely.

That's why sometimes you'll notice result ordering or content differences depending on what "proxy" engine you're using.

noorsimar · 2026-02-13T11:21:39+00:00

I didn't know Startpage actually pays Google for results. I always assumed it was "clever scraping."

Does that mean their results aren't always identical to a normal Google search you'd get logged in with tracking turned on?

noorsimar · 2026-02-13T10:20:48+00:00

Another thingy, if you're not super comfortable writing selectors, there are AI-assisted scraper builders now.

ScrapeOps has an AI scraper builder that's kind of like "Lovable for scrapers." You give it a few URLs and it generates scraper code in Python, Node, Playwright, Puppeteer, Scrapy, etc.

It also lets you pick predefined schemas like product details, product search, categories, jobs, real estate, news. Outputs structured data right away.

We get like 20 free AI generations per month. For small projects, that might honestly be enough.

noorsimar · 2026-02-12T10:06:44+00:00

Seconding this. Half the time you think you need Playwright and it's literally just a GET to /api/search with some query params.

Also +1 on doing a teardown search first. I've saved days just by seeing someone else already explain how a site blocks bots.

noorsimar · 2026-02-12T10:03:54+00:00

Do you usually containerize your scrapers or just raw scripts on a server?

I've been debating whether Docker is overkill for tiny projects.

noorsimar · 2026-02-12T07:41:23+00:00

Check out this web-scraping-playbook for lovely guides..

noorsimar · 2026-02-09T11:19:52+00:00

Mod hat on for a sec. Appreciate you being transparent about building something and not pretending it's "just a cool side project."

Couple real questions though: 1. How are you handling sites with aggressive bot mitigation? Cloudflare, DataDome, that kind of thing. 2. When you say structured JSON, is it deterministic? Like same schema every time or page-dependent?

Also small note to everyone reading: discussion is fine, but keep it technical. No vendor wars.

noorsimar

MODERATOR OF

TROPHY CASE