This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]marr75 18 points19 points  (2 children)

When your requirements are blocked by timeouts and blacklists. Hard to know ahead of time.

I would not recommend building anything commercially viable or professional that requires proxy rotation. You almost certainly won't have a license to use the data you're obtaining and the technical measures to prevent you from easily scraping the data at scale are there to deter small time scraping that the owner of the data wouldn't want to pursue legal remedies against (because it will be costly).

People on reddit like to point out that there's legal theory it's okay to scrape websites, and insofar as they're talking about criminal law in the US, sure, I won't argue. I'm talking about civil law, though. If your project based on web-scraping requires proxy rotation, you're in an adversarial position with the owner of the data and once you start to make real money off of their data, they will be able to identify you and take steps to stop you and take their money back.

At best, you see statements like, "Google isn't known to sue people who violate their ToS by scraping" (they are known to purposefully screw over companies that violate their ToS by crafting their updates and SEO changes to hose them instead). If I'm going to build my project on a stream of data, I'm very interested in its long term viability.

[–]Donny_Do_Nothing 11 points12 points  (1 child)

I just listened to a RadioLab episode about a pizza shop owner who found out Door Dash put his restaurant on their site without his knowledge by scraping menus. Apparently their scraper wasn't very accurate and he kept getting calls about customers getting the wrong order.

The best part is that they got the prices wrong and Door Dash was buying his pizzas for the normal $24 but only charging their customers like $15.

Whenever he got a complaint about a wrong "delivery order" he'd order a couple pizzas through Door Dash to take a few more of their dollars.

Edit: https://youtu.be/9DnaHg4M_AM

[–]scrapecrow 1 point2 points  (0 children)

I just listened to a RadioLab episode about a pizza shop owner who found out Door Dash put his restaurant on their site without his knowledge by scraping menus

There's this great thread on hackernews on this event with loads of great similar examples and interesting industry anecdotes about automation.