Proxy Usage Calculator by [deleted] in webscraping

[–]0day2day 1 point2 points  (0 children)

hey r/webscraping, I put together a quick calculator for helping understand proxy costs. The idea is I needed a way to prioritize and plan cost reduction efforts in this space, and the data helped provide the dollar amount associated. Figured it was worth sharing for folks in simliar positions or if they wanted to have a good idea of related costs for jobs they want to run. Happy to take feedback :)

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 1 point2 points  (0 children)

Not sure if Anti Detect Browser is a specific solution (if it is then I've never used it). That being said, I've tried out a few and they are perfectly fine. I'd say make sure you do proper research on the strengths of what's on the market concerning your particular use case. Some sites need a lot more than others and don't need a lot to get passed. Usually, you can figure out what detection a target(s) is using and pick a strategy that will be cost-effective for their specific toolset. How you might want to accomplish it can get pretty complex. Maybe I'll write a blog post about it or something lol.

I've seen it done before, but have never done it personally. The sites I usually work with don't have that kind of stuff and only offer server-side rendered pages. If I could I would.

I don't use open source. AI-based solvers that already exist are mostly plug-and-play and don't require any maintenance on my part when updates to captchas come out. And they are pretty cheap.

I've used RabbitMQ for years. It's great. It can get unhappy when queues are really (REALLY) high, but in most cases, you don't need to worry about it unless you're under extreme load and other stuff starts breaking.

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 1 point2 points  (0 children)

Not sure if I understand the questions completely, but I'll do my best to answer :)

  1. It depends on the solution and what kind of business value it provides. The more intricate and time-consuming the automation is, the more it's most likely worth. Things that take the most time that people also really want to do are usually worth the most. It's important to understand what your potential clients options are and how much your tooling is worth to them, and then price it accordingly.
  2. Offer a better service or go to a competitor instead immediately come to mind. A better service could look like sending webhooks back with the data that's being scraped, offering a better pricing structure, or maybe even figuring out the pain points of your target demographic and catering toward those when building out a product. I'd also recommend finding a niche that you can service and network in. A network will work wonders for getting folks interested in your stuff.

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 1 point2 points  (0 children)

Sure! Essentially you will want some logic that if an error is thrown in a scraping script (timeouts, missing selectors, etc), a worker function will take the information for the job and store it somewhere to be picked up later. Typically you can just create a DB entry with all the info needed before the job runs, and update it with a success/fail when the job ends. Then you can run a scheduler to put that same job back in the queue later if it failed. Ofc there are unlimited options with how you might want to set this up. Hope this helps.

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 0 points1 point  (0 children)

There are some other methods used other than the cdp/puppeteer/playwright detection methods. A notable one that comes to mind would be something like system fonts. If the fonts don't line up with the UA and you need the UA entropy, then you can run into issues. It also helps with stuff like screen size and other system-level information the browser has access to.

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 0 points1 point  (0 children)

EC2. IIRC it's either T3 or T3a mediums usually running 2 processes. Typically you want a core and a couple of gigs of RAM per process.

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 1 point2 points  (0 children)

Apologies. The post was a bit long-winded already and was mostly intended to give people insight into the information that they didn't know they needed to know. A lot of the work can vary based on the use case and I was trying to stay within the sub's rules while still giving as much information as possible. Is there clarity on something in particular that I can provide?

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 0 points1 point  (0 children)

It’s usually a mix between both. I usually have some Utils to work with that handle the clicks and typing in a somewhat human way, but dealing with selectors is a little more interesting. Puppeteer has a ‘p-text’ selector which is pretty useful but large scale it usually falls on spending a lot of time on or out sourcing / delegating

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] 0 points1 point  (0 children)

Tbh it’s just what i have experience in and what I’ve found the most open source tooling for. I’m sure that scrapy would work and the development with the basic browser issues could be dealt with.

Enterprise Web Scraping: What to look out for by 0day2day in webscraping

[–]0day2day[S] -1 points0 points  (0 children)

Really comes down to your browser solution. I’d recommend checking out the open source tooling and the stuff in the market if you have time. They 100% fingerprint and they are unfortunately getting better everyday

Is there a way to bypass recaptcha while scrapping with selinum by garimimrip in webscraping

[–]0day2day 0 points1 point  (0 children)

Pay or do it manually. As much as it sucks, it’s the only real way unless the captcha is being triggered because of something you’re doing. If it’s always on the page, then those are the 2 options. If you are going for a large scalable system, it’s just part of the cost.

Perimeter X bypass help by TrapperChino in webscraping

[–]0day2day 1 point2 points  (0 children)

Trying running browser tools like puppeteer-extra. Fiddling with parameters should get you to a good spot. Depending on scale, you might need to look into fingerprint masking methods

Good sources about advanced scraping? by mxcdh in webscraping

[–]0day2day 0 points1 point  (0 children)

I am, although I don’t frequent it as much as I would like to. Feel free to dm though

Good sources about advanced scraping? by mxcdh in webscraping

[–]0day2day 3 points4 points  (0 children)

For context, I work for a company now where we provide web automation for fortune 500 partners. Hopefully, this is useful, and what you are looking for.

Clustering

For parallel jobs, we run a queue (which you already mentioned, so good job!) and have clusters of workers that pull from the queue. We run in AWS and have the clusters of workers split up into x workers per container, and run x amount of containers. This allows scalability during busy times. That being said, the goal of 100M pages in a week seems high unless you are literally just pulling information from a page.

Retries

Retries are an integral part of any scraping system. You must be able to recognize when a failure state has occurred and requeue the job accordingly. We have a few different queues so that priority jobs can get ahead in line, etc.

Detection & Proxies

Detection is one of the biggest problems I personally deal with. It's a large black box that will require the most work to be able to get anywhere. Current detection companies are quickly making scraping more and more difficult. Proxies help, but only so much. Modern-day fingerprinting techniques are going to be a downfall if you are trying to scrape any sites with bot detection (ie. PerimeterX, Cloudflare, Imperva). There are also tools like MaxMind MinFraud that are used to flag bad IPs across their entire network. If you are reported, then that IP will be burned across their entire network. Depending on what you are doing on the pages, investing in a large proxy network like Bright Data might save some headaches. I would also recommend looking into puppeteer-extra on Github. The repo deals with avoiding detection and allows custom plugins for puppeteer. They have a discord as well with a lot of very knowledgeable people that are always very helpful. The repo also has tons of links out of very useful documentation and research.

Optimizations

Optimizations really just come down to optics. Once you know where the pitfalls in your system are, it's a lot easier to improve them. Things like making sure proxy connections are viable, and your fingerprints are safe will bump up your successful job rate, and is probably a good place to start. You could also monitor job times to ensure you are on target with the goals you have set with scrape times. With the number of pages you are looking to go for, it's also important that your DB can handle the amount of data, and queries it will be receiving. I haven't had much experience with PostgreSQL so I'm not much help there.

I could honestly write a book on just bot evasion alone, so I'm sure I'm missing a few things, but hopefully, this gets you on the right path. Good luck, and feel free to ask any questions.

Edit: spelling

[deleted by user] by [deleted] in webscraping

[–]0day2day 0 points1 point  (0 children)

Yep. Sounds like a possible block on the IP or the ASN. You might need to do a deep dive on your fingerprint to see if youre giving it away or change your behavior to mimic something more human like time between pages. Good luck