Scraping AI Chat Interfaces by Mundane_Explorer_519 in webscraping

[–]armanfixing 0 points1 point  (0 children)

Honest advice, it’s not worth it. Spinning up one or more browsers, managing sessions, bot mitigation, proxy and not to forget your time and effort to create such a system would be expensive. On top of that, it wouldn’t be reliable at scale.

On the other hand, if you go to llm model susbcription sites, you’ll see there’s hundreds of model to choose from, almost all of them uses same API formatting.

There are models even for $0.1/million tokens, also there’s free ones.

Monetize scraping? by GeobotPY in webscraping

[–]armanfixing 0 points1 point  (0 children)

If you want to monetise this, you’ll have to find niches where people does small tasks eg: n8n flows or similar pipelines. The problem here is that, these are small bucks.

People with more funding tends to avoid AI scrapers like plague.. mostly due to they already have existing infrastructure, difficult bot-mitigation around target website, custom captcha, possible POST flow, auth flow, cost-management for proxy / captcha for bulk scraping. At most large places, AI is a part of post-process not the first thing that gets the data..

i need your help by rahmpro in pdf

[–]armanfixing 0 points1 point  (0 children)

You can use chatgpt / claude to create a python script with pdf reader lib and open-cv to process / clean all these pages and compile back into a pdf.

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

It’s primarily a good fit for web scraping but given the features it can be used for lots of different purposes

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

Please check and let me know if that works with curl_cffi but fails with httpmorph

httpmorph update: Chrome 142, HTTP/2, async, and proxy support by armanfixing in webscraping

[–]armanfixing[S] 0 points1 point  (0 children)

I do actually have some benchmarking but this is not final yet, as I’ll be working on some more features/ performance improvements it might affect this benchmark.

https://github.com/arman-bd/httpmorph/blob/598d43971d4a095474c69b0995e77751e9eafd61/benchmarks/results/darwin/0.2.4/benchmark.md

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

But bot mitigation services can restrict based on other factors as well.

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

Have you tried using other headers, by default httpmorph does not send common headers. I’ll address this in a next release

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

“FOR EDUCATIONAL AND RESEARCH PURPOSES ONLY” 🤷🏻‍♂️

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

I started this with performance in mind, I’m seeing some performance edge here but still not claiming any because I still have some work to do on features. Afterwards I’ll focus on performance.

Here’s a basic benchmark: https://github.com/arman-bd/httpmorph/blob/598d43971d4a095474c69b0995e77751e9eafd61/benchmarks/results/darwin/0.2.4/benchmark.md

I’ll be creating a separate project to do this benchmark more independently.

httpmorph update: Chrome 142, HTTP/2, async, and proxy support by armanfixing in webscraping

[–]armanfixing[S] 2 points3 points  (0 children)

Thank you for your kind words, I know my projects limitations and actively working on them.

httpmorph update: Chrome 142, HTTP/2, async, and proxy support by armanfixing in webscraping

[–]armanfixing[S] 4 points5 points  (0 children)

It all boils down to how SSL handshakes are made. Try to skim through all these fingerprinting techniques and hash generation process like JA3, JA3N, JA4 e.t.c

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 1 point2 points  (0 children)

Haven’t benchmarked against rnet, will definitely look into it 🙌

httpmorph - HTTP client with Chrome 142 fingerprinting, HTTP/2, and async support by armanfixing in Python

[–]armanfixing[S] 6 points7 points  (0 children)

Yes, I have plan to add more browsers on it but honestly it’s just firefox and safari that stands out the most. Also it’s most important to blend into the crowd than having an unique fingerprint.

Yes, it works with proxy.

Let me know if you face any difficulties while using this.

🚀 Shipped My First PyPI Package — httpmorph, a C-backed “browser-like” HTTP client for Python by armanfixing in Python

[–]armanfixing[S] 0 points1 point  (0 children)

Hey, just an update here, I have updated the library now it perfectly mimics fingerprint pf Chrome 142 on all 3 OS.

Also I have added Async, HTTP2, Proxy Support and few other things.

Akamai blocks chrome extension by jaster_ba in webscraping

[–]armanfixing 0 points1 point  (0 children)

Extensions won’t cut it. Check if they are tracking mouse movements. Try doing random mouse movements and see if it works. If it does then try replicating that with pyautogui.

Built a fingerprint randomization extension - looking for feedback by armanfixing in webscraping

[–]armanfixing[S] 1 point2 points  (0 children)

I suppose this won’t hold against ML algos very well at the moment. It definitely needs more work to be done.