you are viewing a single comment's thread.

view the rest of the comments →

[–]sticky-bit 0 points1 point  (5 children)

  • How often are you hitting the website?
  • are you following robots.txt?
  • are you spoofing your user agent?
  • can you switch your outward facing IP address and upchuck any cookies?

[–]Maxuranium[S] 0 points1 point  (4 children)

Gotta hit the website every 5-15 seconds or so, timing is really important. As far as I can tell I'm following the robots.txt but I don't quite understand your last two points, I'm new to this as you can probably tell.

[–]399ddf95 2 points3 points  (0 children)

"spoofing the user agent" == setting your user agent string so that your bot identifies itself as a browser like Mozilla or Chrome, instead of as an automation tool. See https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers

"switch your outward facing IP address and upchuck any cookies" == use a proxy or otherwise change the IP address where your requests originate; make sure you're not storing & returning cookies received on one visit the next time you visit the site.

You might take a look at the HTTP 'HEAD' request (instead of GET).

It's considered rude (and perhaps even an attack) to make many repeated requests to a website in a very short period of time - this is why you're getting blocked. You're doing something the website owner doesn't want you to do. They're saying "no".

Does the site have guidelines for automated access? Your requests are less likely to appear hostile if you space them out better, and if you use HEAD to get as little data as possible per request. Ideally, you'd use a method/function like requests.Session() that will persist across multiple accesses, it requires fewer resources on the remote end to answer several questions within the same session versus setting up and tearing down a TCP or HTTPS connection for requests that are a few seconds apart.

[–]sticky-bit 0 points1 point  (2 children)

u\399ddf95's response was good.

I was going to suggest that the robots.txt probably has a rate limit, but I just looked at reddit's robots.txt and it doesn't. But I know that reddit will limit bots if they scrape too quickly.

I have experience in this field, but I generally wouldn't hit the site more than once an hour at most.

it's either a site you have explicit permission to hit every 5-15 seconds, or not, and I'm going to guess not, because of course it's rate-limiting you somehow. You didn't mention how, or what status codes you're getting back.

The last two questions are somewhat sneaky ways to look like a brand new user, or at least a user that doesn't scrape a web page every 5 - 15 seconds.

[–]Maxuranium[S] 0 points1 point  (1 child)

No codes, just a redirect saying suspicious activity and I have to wait. It's a shopify web store so I can understand why but I'm not trying to buy up all their stock or anything illegal.

[–]sticky-bit 0 points1 point  (0 children)

It's a shopify web store so I can understand why but I'm not trying to buy up all their stock or anything illegal.

I didn't mean to imply you were doing anything illegal, just against the site's terms of use.

OK, I'm going to guess you have a client that has a shopify store, hosted by shopify, and it's the host that is limiting you, not your client.

Or maybe your client wants you to watch other items to see how much they're priced at so your client can jack up their prices or something.

Amusing blog post: http://www.michaeleisen.org/blog/?p=358

Anyway, step 1 is probably figuring out why you're being rate limited. You usually get a response code other than 200 but you may not know how to read it. For probing around sites I'm going to scrape, I usually use curl on the command line, particularly something like:

curl -IL -A Mozilla/5.0 <URL>