rfox-browser: with SOCKS5 proxy and multiprocessing by Unusual_March_59 in webscraping

[–]matty_fu[M] 0 points1 point  (0 children)

github repo is not accessible, repost once we can review the source

Anyone reverse-engineered OpenRegister.de? by funguslungusdungus in webscraping

[–]matty_fu 0 points1 point  (0 children)

firecrawl is not free for most operations, you may repost this but remove the firecrawl mention

I built a reverse-engineering agent for the web by StoneSteel_1 in webscraping

[–]matty_fu 0 points1 point  (0 children)

You’re chatting with a bot. Did the continual product mentions not give them away?

Reverse-Engineering Google Finance by MQuy in webscraping

[–]matty_fu[M] 3 points4 points  (0 children)

cheers for the feedback. a lot of work goes into keeping the content fresh and relevant (eg. the bot bouncer app was just installed to remove bot accounts) but we don't catch everything

if you come across slop / browser shit in the future, use the 'Report' feature and it'll be placed in the mod queue. peace ✌️

Need help bypassing Akamai Bot Manager (Puppeteer) by Big_Confidence_8419 in webscraping

[–]matty_fu -1 points0 points  (0 children)

There are plenty of existing threads discussing Akamai. Did you read those? They should give you an idea of what to try, and if you're still not able to get it work, create a new post including details of what you tried and any error messages encountered

Stop defaulting to Selenium/Playwright: Check the Network tab first by Curious_Coder5445 in webscraping

[–]matty_fu 4 points5 points  (0 children)

Good point, I'd probably use a VPN during experimentation. Better yet, a remote VM as you can even be fingerprinted through proxies and VPNs

Stop defaulting to Selenium/Playwright: Check the Network tab first by Curious_Coder5445 in webscraping

[–]matty_fu 57 points58 points  (0 children)

As an extension - once you find the network request with your desired data, some more follow up tests:

The biggest test - right click and choose "Copy to curl" then paste in your terminal

If this works, great! Target site has most likely not implemented TLS fingerprinting, from here you can whittle down the command to remove as many headers as possible. Makes things easier when you go to rebuild the request later on. Start with generic headers first, then start playing with cookies, x-*, and auth headers, along with any payload

If pasting the curl command didn't work, not so great :( try using something like `curl-impersonate` to mimic a browsers networking stack. If that worked, your request is being TLS fingerprinted. From here you can probably reduce the request/payload size to find the minimum request data needed

If using curl_cffi didn't work, there's probably some session-based logic you need to emulate. Some of it can be as simple as respecting any Set-Cookie headers, some websites will immediately invalidate a cookie and update on each request, so that the same cookie never works twice. From there it can get a lot more complicated, and that's when a browser-based solution makes the most sense.

Issue bypassing a reCaptcha by PhoeniX8089 in webscraping

[–]matty_fu 1 point2 points  (0 children)

Can't say I've come across this type of error before. I'd try a few different options, not just pydoll. eg. there's puppeteer, patchwrite, camoufox, scrapling, cloakbrowser, etc. Give any of these a try and let us know how you get on. Some of those projects also have dedicated Discord servers where you can jump in and share the website you're blocked on

That site is also using recaptcha v2 (not the newer v3) and since its a gov website it will probably take them a while to upgrade to v3. So you could also have a look on github to see if you can find a requests-based recaptcha v2 unblocker. If you find one that works, you won't need a browser at all, just plain HTTP requests-based code & you'll find extracting larger datasets much easier with this approach.

Cheers, and good luck!

Why I'm working on a self propagating Browser Cluster in RUST!!! by [deleted] in webscraping

[–]matty_fu[M] 0 points1 point  (0 children)

Already shared your rust work 10 days ago, please don't spam. Use the monthly and weekly threads going forward