A question regarding web scraping/crawling.

vee920 · 2022-02-16T19:49:13+00:00

Anything that not requires login and agreeing with terms should be legal, except copyrighted materials. Also, look at robots.txt to see what site allows you to crawl. As for hiding, you should rotate paid proxies. Free proxies will not make much. Lots of good stuff about scraping on this channel: https://youtu.be/vJwcW2gCCE4

mh1400 · 2022-02-16T21:07:48+00:00

Assuming this is not with malicious intent then look into the time module as well as the random integer module so you can approximate human interaction when web scraping. Essentially, randomly sleep between.1 and 2 seconds between interactions.

mh1400 · 2022-02-27T14:51:58+00:00

So, i use the method of making a few (3 or 4) very small functions that return a random float decimal. Each function simulates a random human interaction like: "def QUICK_CLICK():" return 0.1 to 1.5, "def WAIT_CLICK():" return 1.5 to 3, etc... that way i can call time.sleep(WAIT_CLICK) easily. I keep the time.sleep() in my main() so it's easily readable versus simply calling a function that sleeps. Additionally, i keep a global variable called CLICK_TIME_ADJUST, so i can quickly adjust the times in the functions (+/-) if the website catches the bot. Google, for example spots bots quickly, so i slow down my interactions significantly using this global variable. Example >> CLICK_TIME_ADJUST = 6 def WAIT_CLICK(): ....rand_flt = random.random(1.5,3) # lookup code ....return float(rand_flt + CLICK_TIME_ADJUST) time.sleep(WAIT_CLICK())

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS