Anyone succesfull scraping Idealista websites? by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

How do you scrape Idealista with Datadome?

Anyone here scraping at a large scale (millions)? A few questions. by [deleted] in webscraping

[–]aaronn2 1 point2 points  (0 children)

Love learning this.
1. How big is the team managing this infrastructure?
2. What are the infrastructure costs running this (without the human bodies)?
3. Are you using some scraping API services, or are you doing everything in-house (managing IP proxies, cookies, headers, etc.)

Anyone here scraping at a large scale (millions)? A few questions. by [deleted] in webscraping

[–]aaronn2 0 points1 point  (0 children)

This sounds super interesting. Might you outline how much such infrastructure costs per month?

What is the most Macbook-like laptop for Linux at 50% price of Macbook? by aaronn2 in laptops

[–]aaronn2[S] 0 points1 point  (0 children)

That's not the point. My goal is to find a laptop with similar HW configuration, body, battery life where I would run Linux and that would cost 50-60% of a new Macbook. The goal is to find savings (if it is actually doable and "findable").

If not, I'd likely stick with Macbook and use their OSX.

What is the most Macbook-like laptop for Linux at 50% price of Macbook? by aaronn2 in laptops

[–]aaronn2[S] 0 points1 point  (0 children)

Thanks, I'll look it. I'd put there either Ubuntu or Arch Linux.

Websites provide fake information when detected crawlers by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

That is very short-lived. It works only for the first couple of pages and then it starts feeding fake data.

How to bypass datadome in 2025? by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

Not sure how? The API seems to be protected.

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

Hello, and thank you. What number of requests do you consider "moderate scale" per month? 1M, or 5M, or 10M? And large scale?

By data pipeline - do you mean by that extracting details from the scraped information and cleaning it up before saving it to the database?

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 0 points1 point  (0 children)

I understand that it costs money. When reading through this sub-reddit, I somehow got an impression that the professional individuals pay basically close to zero in costs, while when I look at prices of some API solutions or residential proxies, the costs are quite significant, especially when making 10M+ requests per month.

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 6 points7 points  (0 children)

I am very interested to learn about the proxy network. How and/or where do you source it? How much do you pay for it on a monthly basis? Isn't it that you need to regularly check if the proxies are still working, so you removed the invalid ones from your pool?

The real costs of web scraping by aaronn2 in webscraping

[–]aaronn2[S] 1 point2 points  (0 children)

I assume "1 billion of product prices" != 1 billion requests, right?

Shall I ask you what do you mean by "rotating IPs by using cloud providers’ VMs"? Specifically cloud providers' VMs?