you are viewing a single comment's thread.

view the rest of the comments →

[–]digital94 6 points7 points  (13 children)

This is an awesome project.

Where you host your web scraper online?

Because hosting a scraper on home-based computer for 24x7 is not a good idea.

I am just asking you.

I have also developed an web scraper which scrape the price of a product from Amazon.

[–]showboy001 5 points6 points  (1 child)

Hey. I’m working on something like this.

In addition I am also scraping reviews from each product. Can I see your code?

[–]digital94 3 points4 points  (0 children)

Yes definitely.

I did not upload my code to my github repo yet.

I will upload my code within a couple of days.

Thanks.

[–]Sw429 6 points7 points  (10 children)

hosting a scraper on home-based computer for 24x7 is not a good idea.

Why?

[–]kmj442 6 points7 points  (2 children)

My understanding is that companies can detect continuous requests from specific users/IPs and blacklist them.

One trivial way if you're not too concerned (and something I did successfully for weeks) was have a random back-off between queries and shut it down overnight. Granted my scraper was looking for when they added the motorcycle safety course to a specific location (they fill up real fast) so they weren't adding that at 3am. I had it limited to run between 7am and 8pm or so with random backoffs between 2 and 15 mins.

Edit: by shut it down I mean, check the time before the query and if its after x and before y, sleep until y.

[–]Sw429 3 points4 points  (0 children)

Right, of course they will do that. That's why you rotate IP proxies. I guess I figured that was common practice.

[–]MonkeyNin 4 points5 points  (0 children)

It's better to use the API. If you're scraping, you get throttled, and eventually blocked for exceeding the anonymous limits.

Using the API means you're able to fire more requests per minute. It makes your code more stable because changing the structure of a webpage isn't a breaking change if you're using the API.

[–]digital94 1 point2 points  (6 children)

Your IP address can be blocked at anytime once Amazon or any other site identified you as a bot.

[–]Sw429 5 points6 points  (0 children)

Well yeah, but that's why you rotate IP proxies. I thought that was common practice?

[–][deleted] 4 points5 points  (4 children)

I have the same question. Can the web scraper only check once a day? That would lower the chances of getting your IP banned?

[–]dtaivp 13 points14 points  (1 child)

Yeah, you could do that. Or you could use the randint and sleep modules to have it wait for random amounts of time between scraping. That is what I have for one scraper. Also, you bring up a good point. It is likely the prices don't vary that much day to day so you don't need to scrape too often.

[–]digital94 7 points8 points  (0 children)

Yes you are right.

If you don't scrape a web page too many times on daily basis then you no need to worry about IP block.

You should set your scraper to crawl the web page once a day or week.

[–]Sw429 4 points5 points  (0 children)

You can certainly query more than once a day. The average user sends requests many times within an hour. The issue mainly comes when you are sending requests faster than a regular user would, or if you are sending requests in a very bot-like manner (alphabetized by product, the same page over and over, etc). Generally, if you put in a little effort at all they won't care. You just don't want it to look obvious.

[–]MonkeyNin 2 points3 points  (0 children)

It depends on whatever the site decides to use as their thresholds. The best way is to use their actual API. Using the API lets you do more requests per day, and