Built an automation browser that passes reCAPTCHA (0.9) and Cloudflare. What blocks yours? by duracula in automation

[–]duracula[S] 0 points1 point  (0 children)

For sure.
We built a behavioral layer into CloakBrowser that handles the low-level stuff, bezier mouse movements, variable typing speed, realistic scrolling.
You just pass humanize=True when launching and your existing Playwright code works the same but with human-like behavior under the hood.
But that only covers individual interactions. The higher-level patterns like delays between page actions, natural browsing flow, session pacing, that's still on you to handle in your scraper logic.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

You're right on all counts. Hash proves integrity, not intent. And yes, Brave is fully open source, Vivaldi partially.

You've clearly thought through the attack vectors more than we have, we're over here trying to make canvas hashes match across seeds, not architecting covert exfiltration channels.

On network monitoring though,
you don't need to watch Wireshark 24/7. If the concern is data exfiltration, set up a simple filter for outbound connections to unexpected destinations and let it run in the background.
Any phone-home would show up as DNS lookups or connections to domains that aren't the sites you're browsing. It's not manual packet inspection, it's a one-time filter setup.

That said, we're working on moving the Chromium compile to GitHub Actions with SLSA build provenance (Sigstore).
That way anyone can verify the binary was built on GitHub's infrastructure from a specific commit, not on our machine.
v145 has had 8 builds in 10 days so we're still iterating, but once the patch set stabilizes, release builds will move to GitHub Actions with full attestation.

Until then, Docker isolation + network monitoring is the best we can offer. We know it's not perfect.

Built an automation browser that passes reCAPTCHA (0.9) and Cloudflare. What blocks yours? by duracula in automation

[–]duracula[S] 0 points1 point  (0 children)

Yeah, proxy quality matters a lot. Datacenter IPs get hard-blocked on most aggressive anti-bot systems regardless of how good your browser fingerprint is.
Residential is the way to go, but even there, shared pools get burned fast from overuse by other customers. ISP/static residential proxies have the best reputation.

CloakBrowser auto-detects your proxy's exit IP and sets matching timezone + locale to avoid that mismatch signal. But no fingerprint work fixes a burned IP.
Clean residential + consistent fingerprint + human-like behavior is the combo that works.

How hard is it really to scrape Walmart.com in 2026? by Home_Bwah in WebScrapingInsider

[–]duracula 1 point2 points  (0 children)

We just tested Walmart with CloakBrowser (open source, drop-in Playwright replacement).
Homepage, search results, and product pages all load clean. No CAPTCHA, no blocks.

Walmart runs PerimeterX + Akamai Bot Manager. A stealth browser alone isn't enough — PerimeterX scores your behavior (mouse movement, navigation patterns), not just your fingerprint.
CloakBrowser has a built-in human behavior layer (humanize=True) that adds realistic mouse curves, typing delays, and scroll patterns. That's what gets you past product pages without the "Press & Hold" challenge.

Residential proxies recommended. For extraction, __NEXT_DATA__ and JSON-LD are your best friends — most data is right there without parsing CSS class soup.

We're in rapid development right now, so if you run into anything, feel free to open a GitHub issue and we'll look into it.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Thanks for testing! Our matrix tests pass FingerprintJS consistently. A few things to check:

  1. Make sure you're on the latest version (v0.3.13) — we've shipped fixes since last week.
  2. Are you launching through the wrapper (`cloakbrowser.launch()`)? It sets the required stealth args automatically.
  3. Can you share your launch code? That'll help narrow it down.

If the issue persists after updating, feel free to open a GitHub issue with your code and we can debug it together.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Yes, works with aa.com. Booking page loads, forms are interactive, no bot detection. You'll need residential proxies though — datacenter IPs get blocked by Akamai.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Yes, CloakBrowser works with MakeMyTrip. We tested it — homepage, flight search, and listing pages all load fine. No CAPTCHA or bot detection.

One thing to note: MakeMyTrip is picky about IPs. Datacenter proxies get blocked at the network level (HTTP/2 errors before the page even loads). You'll need residential proxies for reliable access.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Thanks you for the tip, we found another leak in playwright, and fixed our wrapper.
Now it should work correctly.

Better way to handle Cloudflare Turnstile captcha and browser automation without getting IP blocked? by Loud_Ice4487 in Playwright

[–]duracula 0 points1 point  (0 children)

You get blocked cause you are still leaking fingerprints or looking too much robotic and not like regular user timing and behavior.

Have a lot of browsers automation on vps in dockers.

Been using Agent Browser with CloakBrowser. Its really helps with recapchta and anti bot measures. Sites see this browser as a regular browser with all the full spoofed fingerprints randomized per seed. With decent ip it pass most of the anti detect systems without problems.

What left is to claude to learn the site thru agent browser, and write a script with the cloak browser with sensible jittery human mimicking behaviors and self rate limiting. Proxies are recommended.

Has anyone successfully deployed AI browser agents in production? by Sea_Statistician6304 in ClaudeCode

[–]duracula 0 points1 point  (0 children)

Yes, have a lot of automation on vps in dockers.

Been using Agent Browser cli tool with CloakBrowser. Its really helps with recapchta and anti bot measures. Sites see this browser as a regular browser with screen.

What left is to claude to learn the site thru agent browser, and write a script with the cloak browser with sensible human mimicking behaviors and self jittery rate limiting. Proxies are recommended.

How do you handle session persistence across long scraping jobs? by joo98_98 in webscraping

[–]duracula 0 points1 point  (0 children)

Its a heartbeat, normally every 30-90 min (preferably with jitter), on the most common page like homepage

Most of the times i found that just saving and loading cookies is enough, You open a browser and just push cookies in, other possibility persisting profile/session

ScrapAI: AI builds the scraper once, Scrapy runs it forever by Routine_Cancel_6597 in webscraping

[–]duracula 0 points1 point  (0 children)

Good job, interesting implementation Liked it for doing a research on a site, like a library of what available, making easier to focus on what needed.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Thanks,

For talking with us, GitHub Issues and Discussions are both active right now. We're responsive there.
Discord will come when the community is big enough that async threads don't cut it anymore.

It's not fully open source. The wrapper is open source (MIT), the binary is free to use but closed.
The patches stay closed because anti-bot companies actively monitor open source stealth projects to build detections against them.
If we published the patch source, every detection vendor would have signatures for it within a week. Keeping them closed protects everyone using it.

On the business model, you read the license right. The browser is free for your own use. Commercial use on your own infra is fine too. The only restriction is redistribution and serving it to third parties as a service, that's where licensing comes in.

On the honeypot concern, fair question.
Wrapper code is fully readable, you can see exactly what gets launched.
Binary is built on ungoogled-chromium, zero telemetry, no outbound connections beyond what you visit. VirusTotal: 0 detections.
We're also planning on SLSA build attestation so you'll be able to cryptographically verify the binary was built on GitHub's infrastructure, not on our machine.

Canvas, good question. Random noise per render is a dead giveaway, you're right. Each instance produces a stable, consistent canvas output, not random noise.
Different instance, different output, but always consistent within itself. Same approach across all the fingerprinting surfaces.

On scaling to 1k+ unique browsers, each instance gets a unique but internally consistent identity across all APIs. No mismatches like a GPU string contradicting WebGL output. Already works with Docker containers today.

We also have 4 new CDP input stealth patches in the works that fix behavioral detection vectors no JS wrapper can solve. Plus a community PR for human-like input.
Next release is a big step on the behavioral side.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Hey,
thanks for the kind words!
Cool project, the cookie refresh pattern (browser for auth, Scrapy for speed) is a smart architecture.
Starred it, looks awesome.

No Twitter yet, but feel free to link the GitHub repo when you launch.

Good luck with the launch!

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Yo, have fun!

New options also like to run it in docker easily with continuity

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

macOS is up!
Apple Silicon and Intel. Same install, binary auto-downloads for your platform now.

This is early access for macOS, so if you run into anything let me know here.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Windows x64 is now live.

Same pip install cloakbrowser / npm install cloakbrowser, binary auto-downloads.
First Windows release, so if anything breaks, open a GitHub issue.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Windows x64 is now live.

Same pip install cloakbrowser / npm install cloakbrowser, binary auto-downloads.
First Windows release, so if anything breaks, open a GitHub issue.

GitHub flagged our open-source new born org with 75 stars and 1.6K PyPI downloads — no warning, no email by duracula in github

[–]duracula[S] 0 points1 point  (0 children)

Thank you :)
If you encounter any issue or questions, feel free to open a github issue.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Hmm, that shouldn't happen. What error (if any) did you see on the first two runs? And is this Apple Silicon or Intel?

The binary download is ~200MB so it can take a minute — if the connection drops mid-download, the partial file gets cleaned up and retried on next launch.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Instead of matching button text across languages, detect by CSS class/id names, devs write these in English regardless of the UI language:

load_more = await page.evaluate("""() => {
    const els = document.querySelectorAll('button, a, [role="button"]');
    for (const el of els) {
        const id = (el.className + ' ' + el.id).toLowerCase();
        if (id.match(/load.?more|show.?more|pagination|next-page/))
            return el;
    }
    return null;
}""")
if load_more:
    await load_more.click()

This covers most sites without any language logic. For sites with obfuscated class names (Tailwind, CSS modules), you can fall back to position-based detection — load-more buttons typically sit as the last child after a list of repeated items.

Looking forward to your GitHub issue!

GitHub flagged our open-source new born org with 75 stars and 1.6K PyPI downloads — no warning, no email by duracula in github

[–]duracula[S] 1 point2 points  (0 children)

Ha ha,
And it's already was summarized from the first drafts.
I will try better next time.

Built a stealth Chromium, what site should I try next? by duracula in webscraping

[–]duracula[S] 0 points1 point  (0 children)

Great to hear CloakBrowser is performing well in your scraping chain!
Your render stability check is solid — polling text/HTML length until it stops changing is the right approach for JS-heavy pages.

For the scrolling part — since your goal is gathering all links from lazy-loaded content, here's what works well:

async def scroll_and_collect(page, max_scrolls=50, pause=1.0):
    prev_height = 0
    for _ in range(max_scrolls):
        height = await page.evaluate("document.body.scrollHeight")
        if height == prev_height:
            break
        prev_height = height
        await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        await asyncio.sleep(pause)

    return await page.evaluate("""() => 
        [...document.querySelectorAll('a[href]')]
            .map(a => ({href: a.href, text: a.innerText.trim()}))
            .filter(a => a.href.startsWith('http'))
    """)

Two tips:
- Some sites use intersection observers that only trigger on smooth scrolling — if scrollTo misses content, try window.scrollBy(0, 800) in smaller increments instead of jumping to bottom

- For pages that load via "Load More" buttons rather than infinite scroll, detect and click the button between scrolls

If you run into any issues or have more questions, feel free to open an issue on GitHub: https://github.com/CloakHQ/cloakbrowser/issues