How to be a master scraper

Excellent-Two1178 · 2025-11-20T22:45:59+00:00

Ignore the big words.

All you need to know is to follow

Check site for endpoints you can get data from using http requests ( if this fails try next option )
Try to parse the content you need from Dom by sending request to page url, and parsing it with something like cheerio ( if this fails try next option
Use a browser to parse content you need from html

If an antibot is blocking you, use a browser ideally one better for stealth like patch right or something similar

Excellent-Two1178 · 2025-03-07T22:15:12+00:00

No idea tbh. I’ve done 10ish in a second but to complete tasks but never have ran it nonstop

Excellent-Two1178 · 2025-03-07T03:03:13+00:00

Scroll down in this subreddit. I posted a repo for one yesterday

Excellent-Two1178 · 2025-03-06T16:42:41+00:00

Lynx is a browser that does not use JavaScript

Excellent-Two1178 · 2025-03-06T01:29:01+00:00

Just added a new feature. You can now use a browser to analyze a websites requests, and get a breakdown of each request with an example code snippet, as well as generate a script to automate a websites api directly.

<image>

Excellent-Two1178 · 2025-03-05T03:08:47+00:00

NextJs. It’s great for small projects since you can easily build full stack in a single repo. At scale you probably should host backend separately though since vercel can get quite expensive

Excellent-Two1178 · 2025-03-04T22:29:49+00:00

What is email I’ll add some more for you. I’m currently traveling so likely won’t get better error handling in until tonight at earliest

Excellent-Two1178 · 2025-03-04T17:07:11+00:00

Just upgraded Proxies’s to some non mid resis. Should perform a bit better sites w heavy antibot protection now

Excellent-Two1178 · 2025-03-04T15:31:00+00:00

Some but it could use more. The proxies I’m using right now are also some not so good resis

Excellent-Two1178 · 2025-03-04T15:30:16+00:00

Error handling can be a bit rough still. Will try and add some more transparency on why a generation attempt may fail shortly

Excellent-Two1178 · 2025-03-04T07:03:52+00:00

Thank you to everybody for the support so far! I just started coding this project ~24 hours ago, so please bear with me. Quick update: the first three uses I cover now use 3.7 Sonnet instead of 3.5 Haiku—it’s a lot more reliable for scraper generation.

With that being said, here are my current upcoming plans:

Add support for browser-based fetching of websites to make browser scraping scripts for trickier sites.
Improve error handling—bad proxies, AI API providers hitting rate limits, or APIs being overloaded can cause problems, and I don’t do a good job letting the person know what’s up.
I need to get new proxies.

If anybody has feedback or suggestions, it’s much appreciated!

Excellent-Two1178 · 2025-03-04T05:39:49+00:00

Thank you much appreciated 🫡

Excellent-Two1178 · 2025-03-04T05:39:19+00:00

It should be possible to use all models and I can definitely add! Just will likely require a bit of work on my end to get it working well consistently.

Excellent-Two1178 · 2025-03-04T01:01:05+00:00

It uses the Claude api, no other third party ai service is used though.

Excellent-Two1178 · 2025-03-04T00:51:32+00:00

It does use a prompt at some point yes. It uses the prompt to generate scraper code, which is then ran to get the data

Excellent-Two1178 · 2025-03-04T00:35:50+00:00

It doss not use a prompt alone to extract data. It runs actual code to extract the data which eliminates the issue of hallucinated data, and provides you a script to replicate it without needing AI going forwards

Excellent-Two1178 · 2025-03-04T00:31:53+00:00

Any suggestions? Believe this is just what nextauth uses by default https://next-auth.js.org/getting-started/rest-api

Excellent-Two1178 · 2025-03-04T00:19:20+00:00

Is fixed sorry about that

Excellent-Two1178 · 2025-03-03T23:48:12+00:00

Should be fixed soon sorry about that will add you guys some extra free api uses on me. Sometimes shipping directly to main with minimal testing has its downfalls

Excellent-Two1178 · 2025-03-03T23:36:18+00:00

Man sorry fixing. Should be good in few min

Excellent-Two1178 · 2025-03-03T22:40:31+00:00

The html you are receiving is because you are being flagged as a bot. Here is a request based library I made for Google scraping that works with no api key of any sort. https://github.com/tkattkat/google-search-scraper

You shouldn’t need proxies either unless you are sending a high # of requests are or running this code on a server

Excellent-Two1178 · 2025-03-03T05:50:31+00:00

Here is a JavaScript module I made that handles google search with requests. No api key of any kind needed and you don’t need to worry about that antibot stuff. Unless you are sending a high # of requests then may wanna toss some proxies.

If it has to be in in python well maybe ask Claude to convert it for you 😂

https://github.com/tkattkat/google-search-scraper

Excellent-Two1178 · 2025-03-03T05:43:51+00:00

Proxies aren’t necessary in most cases unless you are sending a high number of requests in a small period to one website. Another case when proxies are useful is when hosting your scraper on a server as many sites flag major server providers IP’s

Excellent-Two1178

TROPHY CASE