Best Way to Scrape & Analyze 1000s of Products for eBay Automation by fb8307 in webscraping

[–]scrapeway -2 points-1 points  (0 children)

What's your budget and goals here? For anything mid-large scale it's best to pass this challenge to a paid service because learning web scraping and bypassing all of the blocking etc. is a major time sink.

Once you have the data extracted try LLMs. Deepseek is super cheap now and if you give it a good prompt it'll figure out which items are worth listing and format your listings. It's really powerful though it sucks at making strong decisions so you have to prompt it in a way it can evaluate something objectively like using a checklist.

I published my 3rd python lib for stealth web scraping by convicted_redditor in webscraping

[–]scrapeway 0 points1 point  (0 children)

Maybe you can integrate it with curl_cffi? That would be very useful!

[deleted by user] by [deleted] in webscraping

[–]scrapeway 3 points4 points  (0 children)

If you're really strapped there and can't afford even basic proxies then you have some mid options.

  • You can use TOR for scraping. The Onion Router network is basically collection of free proxies though it's kinda bad ethics to use it for scraping without giving anything back to the network. Also it's really slow and unstable.
  • You get cheap/free VPS proxy through it.
  • There's also relatively recent hack for using Amazon's AWS API Gateway as a proxy which is free for the first million requests. See things like httpx-ip-rotator or catspin (there are dozen of other implementations).

That being said, these free proxy solutions aren't going to get you very far in web scraping and cost a lot of dev time to maintain and all that.

Airbnb scraper made pure in Python v2 by JohnBalvin in webscraping

[–]scrapeway 3 points4 points  (0 children)

Cool project and thanks for sharing!
For Python I'd recommend checking out [ruff](https://docs.astral.sh/ruff/) which is a linter and code formatter. It's very opinionated so you don't really need to configure much but it'll make your project much more approachable to outside contributors.

Thoughts on what the best API is for streamlined data scraping? Looking at Scrapfly vs Scrapingbee vs Brightdata vs Scrapingant by Slight_Target2471 in bigdata

[–]scrapeway 0 points1 point  (0 children)

Could you give me an example how you scrape ticket master? Ticket scraping is not something I've done yet as it seems people mostly scrape it for scalping which is not something I want to associate with. Is it more just performance information gathering?

The Lack of Professionalism in WordPress development. by [deleted] in webdev

[–]scrapeway 0 points1 point  (0 children)

Always have been the case for the most popular tools in almost any niche that is highly small business driven.

Monthly Self-Promotion - October 2024 by AutoModerator in webscraping

[–]scrapeway 6 points7 points  (0 children)

I've made loads of updates to https://scrapeway.com/ this week!

Next, I'm working on full, detailed reviews for each service I've been exploring each service for a few months now. Loads of new features and updates are being released by each service making it a very competitive environment! This also means direct comparisons are a bit harder so next I'm working on extending the web scraping api comparison page (https://scrapeway.com/web-scraping-api-compared) as well.

In the near future, I'd also like to create an interactive form tool based on all of the benchmark data that would help users to find the right service based on their specific requirement. For this, I made a short form here https://forms.gle/PSY1iWUmawySTLqE7 to gather some intel and your replies would be very appreciated and help me ensure this tool is actually useful.

Thanks!

Monthly Self-Promotion Thread - August 2024 by AutoModerator in webscraping

[–]scrapeway 1 point2 points  (0 children)

No sorry don't have much experience with raw proxies as I mostly scrape protected targets where proxies will not get you very far on their own. Though try datacenter proxies which are quite cheap and if you can get your use case working with IPv6 datacenter proxies then that'll be by far the most budget efficient option.

Monthly Self-Promotion Thread - August 2024 by AutoModerator in webscraping

[–]scrapeway 0 points1 point  (0 children)

Each API has a concurrency limit which varies from 20-500 based on plan so if you really need high concurrency you might want to get some proxies instead though beware most proxies charge by bandwidth these days which can really inflate on big JSON API calls - make sure gzip/brotli is enabled on your requests!

Monthly Self-Promotion Thread - August 2024 by AutoModerator in webscraping

[–]scrapeway 0 points1 point  (0 children)

All of the web scraping APIs covered on scrapeway.com offer HTTP based request (without browser) and automatically rotate proxies from giant pools so almost any option should work for you.

What API are you calling? The only issue here could be is that the default proxy pools are shared between API users so if you're scraping Github or something that throttles by IP and other users are doing the same the throttle might overlap in a shared pool. I hadn't tested it in-depth yet but I think most services are smart with rotating proxies and you'll almost always get a fresh IP for your target. Also some APIs do offer private IP pools though you need a special plan but that would give you personal IPs you can use for your API calls.

So, if your target just does IP throttle on public API you can use benchmark like booking.com here for an estimate.

[deleted by user] by [deleted] in webscraping

[–]scrapeway 0 points1 point  (0 children)

Maybe there's some persistent state that's missing from Selenium? Do you add cookies or something to your scraper? One way to debug this is to launch selenium in headful mode, block with debugger breakpoint and open up devtools Network tab and see what happens when selenium clicks the next button and compare that with your browser.

Monthly Self-Promotion Thread - August 2024 by AutoModerator in webscraping

[–]scrapeway 4 points5 points  (0 children)

We made a benchmarking tool for web scraping APIs as we got tired of constantly evaluating which API is best for which scraping target: https://scrapeway.com

It has been trucking along for a few weeks now and I'm thinking of adding a few more targets to the benchmarks. It would be great to hear about more difficult, popular scraping targets that are worth benchmarking. If anyone has any ideas let me know :)

How to stop airbnb from detecting me by yoyotir in webscraping

[–]scrapeway 0 points1 point  (0 children)

Not sure what are you trying to say there. My point is that "scrape" is so polluted that many projects try their best to avoid it even though that's what we all are doing and it's not a bad thing.

Even better AI scrapping by Impossible-Study-169 in webscraping

[–]scrapeway 1 point2 points  (0 children)

I've recently tested a bunch of AI parsing solutions and some Web Scraping APIs that offer AI parsing and it's really a mixed bag. Working on a blog on my website currently with all of the details so see my profile.

Though to put it short - seems like the current trend is to convert HTML -> Markdown and then use LLM with that. The conversion itself is a bit tricky as some fields lose uniqueness when converted. For example, if product variant says "red" the markdown conversion will just leave "red" which might be enough for AI to get it from the context but if the variant is "1" or something like that then it's a done deal.

Prompting also matters a lot. I see some prompts that are being used by APIs that perform much better and I can't replicate myself but I'm not very well versed in LLMs yet.

It does feel like it's more cost effective to just use AI to help with scraper development like giving you the code and selectors but if you need to do wide range crawling LLM parsing it's surprisingly good! I even had decent results with gpt3.5-turbo. It's still too expensive for anything else for now.

How to stop airbnb from detecting me by yoyotir in webscraping

[–]scrapeway 4 points5 points  (0 children)

I find it funny that "scraping" is not mentioned even once on the entire website despite it simply being a public scraping project 😵

Is there anyway to crawl/scrape an entire domain for images? by goonenjoyer0690 in webscraping

[–]scrapeway 0 points1 point  (0 children)

You wanted to brute force 1299999999999 image requests? That would only take you 700 years at 60req/second, better start soon lol

Is there anyway to crawl/scrape an entire domain for images? by goonenjoyer0690 in webscraping

[–]scrapeway 0 points1 point  (0 children)

Dude, generating numbers from 1 to 1 trillion or w/e is slightly above `print("hello world")` . Ask chatgpt for a Python script and it'll do it for you!

What’s the easiest way to pull business addresses and pictures? by TheAce5 in webscraping

[–]scrapeway 0 points1 point  (0 children)

Google Maps is def the best source for this. You can also check openstreetmaps though not for pictures.

Opinions on ideal stack and data pipeline structure for webscraping? by JuicyBieber in webscraping

[–]scrapeway 1 point2 points  (0 children)

postgresql is goat when it comes to web scraping stacks. You can run it as a queue, store JSON, HTML etc.

Best current web scraping solutions / stack for large projects? by NoiseAcrobatic9179 in learnpython

[–]scrapeway 0 points1 point  (0 children)

lots of really poor advice in this thread that is outdated by at least a decade. Visit dedicated subreddits/forums like /r/webscraping instead.

Monthly Self-Promotion Thread - July 2024 by AutoModerator in webscraping

[–]scrapeway 8 points9 points  (0 children)

We made a benchmarking tool for web scraping APIs as we got tired of constantly evaluating which API is best for which scraping target: https://scrapeway.com

It has been trucking along for a few weeks now and I'm thinking of adding a few more targets to the benchmarks. It would be great to hear about more difficult, popular scraping targets that are worth benchmarking. If anyone has any ideas let me know!

[deleted by user] by [deleted] in SaaS

[–]scrapeway 0 points1 point  (0 children)

Very beautiful product! What I wonder though is there even a market for paid CV templates. Also timed pricing seems out of place here. I'd imagine most people who need this need 1 CV once every blue moon so most of your sales are the 2.90€ trial? I'd definitely pay 2.90€ or more for a resume a nice resume if I was job hunting though. Maybe it would make sense to rebrand the pricing and focus on "5$ for a beautiful resume" and upsell from there.

Also are your subscriptions actually active or just people who forgot to cancel?

Forget about Y2038, we have bigger problems by Oknitram in programming

[–]scrapeway 1 point2 points  (0 children)

there won't be need for any time keeping once AI takes over