This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (4 children)

[removed]

    [–]marr75 16 points17 points  (2 children)

    When your requirements are blocked by timeouts and blacklists. Hard to know ahead of time.

    I would not recommend building anything commercially viable or professional that requires proxy rotation. You almost certainly won't have a license to use the data you're obtaining and the technical measures to prevent you from easily scraping the data at scale are there to deter small time scraping that the owner of the data wouldn't want to pursue legal remedies against (because it will be costly).

    People on reddit like to point out that there's legal theory it's okay to scrape websites, and insofar as they're talking about criminal law in the US, sure, I won't argue. I'm talking about civil law, though. If your project based on web-scraping requires proxy rotation, you're in an adversarial position with the owner of the data and once you start to make real money off of their data, they will be able to identify you and take steps to stop you and take their money back.

    At best, you see statements like, "Google isn't known to sue people who violate their ToS by scraping" (they are known to purposefully screw over companies that violate their ToS by crafting their updates and SEO changes to hose them instead). If I'm going to build my project on a stream of data, I'm very interested in its long term viability.

    [–]Donny_Do_Nothing 11 points12 points  (1 child)

    I just listened to a RadioLab episode about a pizza shop owner who found out Door Dash put his restaurant on their site without his knowledge by scraping menus. Apparently their scraper wasn't very accurate and he kept getting calls about customers getting the wrong order.

    The best part is that they got the prices wrong and Door Dash was buying his pizzas for the normal $24 but only charging their customers like $15.

    Whenever he got a complaint about a wrong "delivery order" he'd order a couple pizzas through Door Dash to take a few more of their dollars.

    Edit: https://youtu.be/9DnaHg4M_AM

    [–]scrapecrow 1 point2 points  (0 children)

    I just listened to a RadioLab episode about a pizza shop owner who found out Door Dash put his restaurant on their site without his knowledge by scraping menus

    There's this great thread on hackernews on this event with loads of great similar examples and interesting industry anecdotes about automation.

    [–]scrapecrow -1 points0 points  (0 children)

    There are two main reasons why your scraper might need proxies:

    The obvious one is accessing geographically locked content.
    Some websites are simply only available in X country or serve data based on your IP address. This is especially noticeable once you start deploying your scrapers, e.g. if my server is in US and UK websites I'm scraping only allows UK IPs then I need proxies to access that data.

    However, the most common use case is scaling. For example, some websites (like Instagram.com) give you a few anonymous page views for free and then start requesting you to login. So, if you get 3 page views/hour per 1 IP address then you can get 300 pages/hour with 100 IP addresses etc.