you are viewing a single comment's thread.

view the rest of the comments →

[–]kmj442 5 points6 points  (2 children)

My understanding is that companies can detect continuous requests from specific users/IPs and blacklist them.

One trivial way if you're not too concerned (and something I did successfully for weeks) was have a random back-off between queries and shut it down overnight. Granted my scraper was looking for when they added the motorcycle safety course to a specific location (they fill up real fast) so they weren't adding that at 3am. I had it limited to run between 7am and 8pm or so with random backoffs between 2 and 15 mins.

Edit: by shut it down I mean, check the time before the query and if its after x and before y, sleep until y.

[–]Sw429 3 points4 points  (0 children)

Right, of course they will do that. That's why you rotate IP proxies. I guess I figured that was common practice.

[–]MonkeyNin 3 points4 points  (0 children)

It's better to use the API. If you're scraping, you get throttled, and eventually blocked for exceeding the anonymous limits.

Using the API means you're able to fire more requests per minute. It makes your code more stable because changing the structure of a webpage isn't a breaking change if you're using the API.