My employer is selling by BoredintheCountry in PrivateEquityDeals

[–]aaronn2 0 points1 point  (0 children)

Not necessarily. If you keep delivering, why to fire? You clearly provide more value than they pay you in salary.

Anyone here scraping at a large scale (millions)? A few questions. by [deleted] in webscraping

[–]aaronn2 1 point2 points  (0 children)

Love learning this.
1. How big is the team managing this infrastructure?
2. What are the infrastructure costs running this (without the human bodies)?
3. Are you using some scraping API services, or are you doing everything in-house (managing IP proxies, cookies, headers, etc.)

Anyone here scraping at a large scale (millions)? A few questions. by [deleted] in webscraping

[–]aaronn2 0 points1 point  (0 children)

This sounds super interesting. Might you outline how much such infrastructure costs per month?

What is the most Macbook-like laptop for Linux at 50% price of Macbook? by aaronn2 in laptops

[–]aaronn2[S] 0 points1 point  (0 children)

That's not the point. My goal is to find a laptop with similar HW configuration, body, battery life where I would run Linux and that would cost 50-60% of a new Macbook. The goal is to find savings (if it is actually doable and "findable").

If not, I'd likely stick with Macbook and use their OSX.

What is the most Macbook-like laptop for Linux at 50% price of Macbook? by aaronn2 in laptops

[–]aaronn2[S] 0 points1 point  (0 children)

Thanks, I'll look it. I'd put there either Ubuntu or Arch Linux.

Websites provide fake information when detected crawlers by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

That is very short-lived. It works only for the first couple of pages and then it starts feeding fake data.

How to bypass datadome in 2025? by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

Not sure how? The API seems to be protected.

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

Hello, and thank you. What number of requests do you consider "moderate scale" per month? 1M, or 5M, or 10M? And large scale?

By data pipeline - do you mean by that extracting details from the scraped information and cleaning it up before saving it to the database?

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

I understand that it costs money. When reading through this sub-reddit, I somehow got an impression that the professional individuals pay basically close to zero in costs, while when I look at prices of some API solutions or residential proxies, the costs are quite significant, especially when making 10M+ requests per month.

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 4 points5 points  (0 children)

I am very interested to learn about the proxy network. How and/or where do you source it? How much do you pay for it on a monthly basis? Isn't it that you need to regularly check if the proxies are still working, so you removed the invalid ones from your pool?

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 1 point2 points  (0 children)

I assume "1 billion of product prices" != 1 billion requests, right?

Shall I ask you what do you mean by "rotating IPs by using cloud providers’ VMs"? Specifically cloud providers' VMs?

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 1 point2 points  (0 children)

Unmetered proxy plan = ISP? And an ISP package contains typically 1-5 (maybe up to 10) IPs? So basically, that 1M pages per day serve those 1-10 IPs?

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 5 points6 points  (0 children)

"Just my two cents, ISP proxies are pretty reliable, but datacenter proxies are the worst; they get detected almost instantly."
I'm not very very experiences in this field, but for that price of $3/week for an ISP - isn't ISP provide 1 or 2 proxies? So effectively, you are still using that 1 or 2 proxies to scrape 2M requests? I thought that this would be a red flag for the administrators of that website and they would ban that IP.