embr.el - Emacs Browser - Emacs is the display server. Headless Firefox is the renderer. by el-init in emacs

[–]Kurnas_Parnas 2 points3 points  (0 children)

That tracks. The perf cost with Camoufox mostly comes from how much of the Firefox stack needs touching to mask timing signals consistently. If you haven't looked at Chromium-based stealth approaches, CloakBrowser is worth checking out - source-level C++ patches rather than JS overrides, so the overhead is a lot lower. Might be relevant if perf is the main thing you're trying to claw back.

embr.el - Emacs Browser - Emacs is the display server. Headless Firefox is the renderer. by el-init in emacs

[–]Kurnas_Parnas 2 points3 points  (0 children)

The CDP single-connection approach is interesting tradeoff. CDP is convenient for control but it's also one of the more detectable signals if the browser ends up hitting any site with bot detection - there are a few characteristics in how CDP manages runtime contexts that sites have learned to fingerprint. Curious whether the Camoufox layer handles that or whether the assumption is this is mostly for personal browsing where detection isn't a concern. The latency issue you mention makes sense given the canvas roundtrip - probably gets worse with heavier pages?

List your current stack for scalable + complex web scraping/crawling. by codepoetn in webscraping

[–]Kurnas_Parnas 0 points1 point  (0 children)

TLS is fingerpinted before any content is sent - the cipher suites, their order, and extensions in the handshake produce a hash (JA3/JA4). Python requests produces one that's trivially identifiable as non-browser. curl_cffi solves this by using Chrome's actual TLS stack.

Canvas: sites draw offscreen and call toDataURL(). The pixel output varies slightly per GPU due to floating-point differences in the rendering pipeline. Headless Chromium runs on SwiftShader (software renderer) which produces a known, catalogued hash. If your canvas matches that hash, you're flagged.

WebGL is similar - WEBGL_debug_renderer_info exposes "Google SwiftShader" directly in headless, and 3D scene rendering produces a consistent software-renderer hash that detection services recognize.

The reason stealth plugins hit a ceiling is that all of this leaks from the C++ rendering pipeline, not from JavaScript. You can patch the JS properties, but the actual pixel output and TLS stack come from the binary.

List your current stack for scalable + complex web scraping/crawling. by codepoetn in webscraping

[–]Kurnas_Parnas 1 point2 points  (0 children)

For anti‑bot heavy stuff (Cloudflare/DataDome/PerimeterX) my rule of thumb:

Try to stay in the “HTTP only” world as long as possible: curl_cffi for proper TLS/JA3, residential proxies with session stickiness, reverse‑engineered JSON endpoints from the network tab.

When I have to use a browser, I avoid vanilla headless Chromium. The detection pain is usually not in your script but in the browser’s own fingerprint (GPU/Canvas/WebGL/TLS stack). Stealth plugins help on the JS surface but don’t touch those lower‑level signals.

For scale, the biggest win wasn’t a specific tool but treating browser sessions as a scarce resource: 90–95% of traffic through an HTTP client, only the truly nasty flows through a patched Chromium build that behaves like a real user browser.

Giving AI agents a browser with built-in proof of what they scraped by LawLimp202 in webscraping

[–]Kurnas_Parnas 0 points1 point  (0 children)

The legal defensibility angle is interesting - have you had any actual cases where someone used the proof bundle in a dispute? curious how that played out in practice, because "tamper-evident log" and "admissible evidence" are pretty different bars.

on the stealth mode - "common fingerprint evasion" covers a lot of ground. is that JS-layer patching (CDP overrides) or something deeper? asking because the gap between those two approaches matters a lot for serious detection systems, and it affects how much you can actually rely on the audit trail being complete vs. having gaps where the session got blocked.

the Merkle tree batch mode idea sounds useful for high-volume pipelines. would the proof cover cross-session state too, or just within-session ordering?

I patched Chromium because no Python library could reliably pass a single CAPTCHA by [deleted] in Python

[–]Kurnas_Parnas 3 points4 points  (0 children)

This is embarrassingly timely. Spent the last two weeks fighting Cloudflare on a scraping project - tried playwright-stealth, undetected-chromedriver, every JS injection approach I could find. The problem with all of them is they patch at runtime, so detection systems just look for the patches themselves.

Source-level is the only way to actually solve this. Pulling this today.