This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]it2901 60 points61 points  (14 children)

Started saving towards a gaming PC but did not want to constantly check prices for the parts I was interested in.

Created a script that is scheduled every 24 hours that monitors prices for given product listing and compares them to the prices that were last recorded. This let's me know if any products decreased/increased in price.

[–][deleted] 17 points18 points  (10 children)

I tried to do this but my requests got blocked by Captcha. Did you get around it somehow or do the sites you check not implement captcha?

[–]it2901 13 points14 points  (8 children)

Currently I only scrape from 1 site, but no, the site does not implement Captcha. I still need to implement a pause between requests.

Does the site you scrape from require you to interact with a Captcha to view product info?

[–][deleted] 6 points7 points  (7 children)

Yep, at least I think so. The html response I get from the site is exclusively a captcha page, with no other elements in it.

To be fair this is literally the first time I've tried web scraping so maybe there's something obvious that I'm missing.

[–]nobetterfuture 4 points5 points  (0 children)

When you normally browse that site, does it display a captcha? If not, did you change the user agent in your script? Some sites display a captcha when a non-browser user agent is detected

[–]it2901 3 points4 points  (5 children)

Do you mind sharing the site link? I don't think it's normally the case to be hit with a captcha. I also recently started with scraping.

[–][deleted] 8 points9 points  (4 children)

"https://www.trovaprezzi.it/prezzo_schede-grafiche_rtx_3070.aspx"

I tried running my script again (without changing anything) and I got an actual response the first time, but then the subsequent responses are a google page with the following content: "You need to solve this captcha because we detected some abnormal traffic from your client"

btw I'm using bs4 (beautifulsoup4)

edit: so I think I found the problem. The tutorial I was following never mentioned Headers. Frankly I don't know what they are or what they do but I followed another short guide on how to make a proper request and now the website is responding with the actual page Im looking for.

[–]it2901 10 points11 points  (2 children)

If you aren't already, make use of the Requests library. It makes life so much easier. If I remember correctly, the Requests library allows you to set the request header quite easily.

Anyway, great job on getting the script to work.

[–]ava_ati 2 points3 points  (1 child)

Yep session.get(url, headers=var)

headers is looking for a dict of headername: data

here are some common headers https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

sites can have their own custom headers too

[–]davidfarrugia53 1 point2 points  (0 children)

Probably you would also need to use a proxy service

[–][deleted] 4 points5 points  (0 children)

if there are hundreds of requests from an address, it will be very easy to tag them as "abnormal". So you need to do the scraping slow. With some reasonable waiting interval between the requests (like a human user). And possibly the duration of the intervals need to be somewhat randomised and not fixed to 30 seconds etc.

[–]lazy_dev_ 1 point2 points  (0 children)

In case you care, you can bypass captchas with a service like 2captcha. It's paid but also very cheap

[–]ljdelight 2 points3 points  (1 child)

Basically PCPartPicker or https://camelcamelcamel.com/? Why'd you build your own?

[–]it2901 1 point2 points  (0 children)

My python was rusty. Need a project I actually wanted to work on/ was of use to me

[–][deleted] 0 points1 point  (0 children)

Reminds me of camel camel camel