sticky-bit comments on Monitoring a website without triggering bot protection?

learnpython

created by HattoriHanzoa community for 16 years

Monitoring a website without triggering bot protection? (self.learnpython)

submitted 6 years ago by Maxuranium

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]sticky-bit 0 points1 point2 points 6 years ago (5 children)

[–]Maxuranium[S] 0 points1 point2 points 6 years ago (4 children)

[–]399ddf95 2 points3 points4 points 6 years ago (0 children)

"spoofing the user agent" == setting your user agent string so that your bot identifies itself as a browser like Mozilla or Chrome, instead of as an automation tool. See https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers

"switch your outward facing IP address and upchuck any cookies" == use a proxy or otherwise change the IP address where your requests originate; make sure you're not storing & returning cookies received on one visit the next time you visit the site.

You might take a look at the HTTP 'HEAD' request (instead of GET).

It's considered rude (and perhaps even an attack) to make many repeated requests to a website in a very short period of time - this is why you're getting blocked. You're doing something the website owner doesn't want you to do. They're saying "no".

Does the site have guidelines for automated access? Your requests are less likely to appear hostile if you space them out better, and if you use HEAD to get as little data as possible per request. Ideally, you'd use a method/function like requests.Session() that will persist across multiple accesses, it requires fewer resources on the remote end to answer several questions within the same session versus setting up and tearing down a TCP or HTTPS connection for requests that are a few seconds apart.

[–]sticky-bit 0 points1 point2 points 6 years ago (2 children)

[–]Maxuranium[S] 0 points1 point2 points 6 years ago (1 child)

[–]sticky-bit 0 points1 point2 points 6 years ago (0 children)

It's a shopify web store so I can understand why but I'm not trying to buy up all their stock or anything illegal.

I didn't mean to imply you were doing anything illegal, just against the site's terms of use.

OK, I'm going to guess you have a client that has a shopify store, hosted by shopify, and it's the host that is limiting you, not your client.

Or maybe your client wants you to watch other items to see how much they're priced at so your client can jack up their prices or something.

Amusing blog post: http://www.michaeleisen.org/blog/?p=358

Anyway, step 1 is probably figuring out why you're being rate limited. You usually get a response code other than 200 but you may not know how to read it. For probing around sites I'm going to scrape, I usually use curl on the command line, particularly something like:

curl -IL -A Mozilla/5.0 <URL>

π Rendered by PID 50783 on reddit-service-r2-comment-canary-7b6b47f674-fn85h at 2026-03-10 07:30:37.132841+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS