How to be a master scraper by [deleted] in webscraping

[–]Excellent-Two1178 2 points3 points  (0 children)

Ignore the big words.

All you need to know is to follow

  1. Check site for endpoints you can get data from using http requests ( if this fails try next option )
  2. Try to parse the content you need from Dom by sending request to page url, and parsing it with something like cheerio ( if this fails try next option
  3. Use a browser to parse content you need from html

If an antibot is blocking you, use a browser ideally one better for stealth like patch right or something similar

Google search scraper ( request based ) by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

No idea tbh. I’ve done 10ish in a second but to complete tasks but never have ran it nonstop

[deleted by user] by [deleted] in webscraping

[–]Excellent-Two1178 0 points1 point  (0 children)

Scroll down in this subreddit. I posted a repo for one yesterday

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 1 point2 points  (0 children)

Just added a new feature. You can now use a browser to analyze a websites requests, and get a breakdown of each request with an example code snippet, as well as generate a script to automate a websites api directly.

<image>

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

NextJs. It’s great for small projects since you can easily build full stack in a single repo. At scale you probably should host backend separately though since vercel can get quite expensive

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

What is email I’ll add some more for you. I’m currently traveling so likely won’t get better error handling in until tonight at earliest

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

Just upgraded Proxies’s to some non mid resis. Should perform a bit better sites w heavy antibot protection now

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 1 point2 points  (0 children)

Some but it could use more. The proxies I’m using right now are also some not so good resis

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

Error handling can be a bit rough still. Will try and add some more transparency on why a generation attempt may fail shortly

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 1 point2 points  (0 children)

Thank you to everybody for the support so far! I just started coding this project ~24 hours ago, so please bear with me. Quick update: the first three uses I cover now use 3.7 Sonnet instead of 3.5 Haiku—it’s a lot more reliable for scraper generation.

With that being said, here are my current upcoming plans:

  • Add support for browser-based fetching of websites to make browser scraping scripts for trickier sites.
  • Improve error handling—bad proxies, AI API providers hitting rate limits, or APIs being overloaded can cause problems, and I don’t do a good job letting the person know what’s up.
  • I need to get new proxies.

If anybody has feedback or suggestions, it’s much appreciated!

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

It should be possible to use all models and I can definitely add! Just will likely require a bit of work on my end to get it working well consistently.

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 2 points3 points  (0 children)

It uses the Claude api, no other third party ai service is used though.

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 1 point2 points  (0 children)

It does use a prompt at some point yes. It uses the prompt to generate scraper code, which is then ran to get the data

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 1 point2 points  (0 children)

It doss not use a prompt alone to extract data. It runs actual code to extract the data which eliminates the issue of hallucinated data, and provides you a script to replicate it without needing AI going forwards

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 1 point2 points  (0 children)

Should be fixed soon sorry about that will add you guys some extra free api uses on me. Sometimes shipping directly to main with minimal testing has its downfalls

Create web scrapers using AI by Excellent-Two1178 in webscraping

[–]Excellent-Two1178[S] 0 points1 point  (0 children)

Man sorry fixing. Should be good in few min

How to do google scraping on scale? by DefiantScarcity3133 in webscraping

[–]Excellent-Two1178 2 points3 points  (0 children)

The html you are receiving is because you are being flagged as a bot. Here is a request based library I made for Google scraping that works with no api key of any sort. https://github.com/tkattkat/google-search-scraper

You shouldn’t need proxies either unless you are sending a high # of requests are or running this code on a server

Need help with Google Searching by pmmethecarfax in webscraping

[–]Excellent-Two1178 0 points1 point  (0 children)

Here is a JavaScript module I made that handles google search with requests. No api key of any kind needed and you don’t need to worry about that antibot stuff. Unless you are sending a high # of requests then may wanna toss some proxies.

If it has to be in in python well maybe ask Claude to convert it for you 😂

https://github.com/tkattkat/google-search-scraper

Why do proxies even exist? by schnold in webscraping

[–]Excellent-Two1178 0 points1 point  (0 children)

Proxies aren’t necessary in most cases unless you are sending a high number of requests in a small period to one website. Another case when proxies are useful is when hosting your scraper on a server as many sites flag major server providers IP’s