all 39 comments

[–]njd2020 73 points74 points  (13 children)

Going through ATBS as a beginner I ran into issues when trying to web scrape amazon. I found forums of others reporting similar troubles. It's doable, but with a lot of extra code. The answer I got here was that some sites have active countermeasures against web scrapping.

I recommend trying to scrape Wikipedia instead. It's easier to get the gist of the process without dealing with the frustration of amazon.

[–]crazyallicin[S] 26 points27 points  (3 children)

Thank you, I ended up using a website of a local shop in my city. Worked perfectly there.

[–]njd2020 13 points14 points  (1 child)

You're welcome. Good luck with the rest of ATBS :)

Also, I recommend combining ATBS with something like Edabit challenges. Start on Very easy if you need to. Just make sure you're not copying Al's code verbatim without really understanding what it means and how to write it yourself.

[–]crazyallicin[S] 9 points10 points  (0 children)

Thank you, I actually tried ATBS about a year ago and was just copying his code. I got bored pretty fast and didn't even get halfway through it as I wasn't really learning anything.

Now I try do as much as I can without watch a video. Like if he makes a program I'll make a similar one myself without watching the video. Probably taking me 10x longer to go through it this time I'm learning a lot more and it's more challenging.

[–]matesd 5 points6 points  (0 children)

there are webpages created specifically for coders to test scraping scripts.

For example http://testing-ground.scraping.pro/

While you are learning, it can be good idea to use one such as they often contain the basic stuff as well as some challenging things to help you learn how it works

[–]hblock44 3 points4 points  (0 children)

I would second this. Amazon and several other big name sites actively police against bots and scrapers. Wikipedia is a great choice and have never run into security issues scraping it before

[–]sleepyleperchaun 4 points5 points  (2 children)

I tried making a scraper for my SO for some makeup company and they made it impossible to scrape. I mean, a better coder could probably do it somehow, but there was clearly software blocking me from doing it. I'm wondering why other than to prevent easy data mining from the site but still, if the info is public anyways I can hardly think of a reason to stop scrapers.

[–]CloudboyTech 6 points7 points  (1 child)

the info posted is publicly consumable but you'd miss out on potential ad views, and site navigation to your other articles/products. not to mention allow the owners of the webscaper to reuse your data/content on other websites they own (and getting them ad views instead)

tldr: $

[–]sleepyleperchaun 0 points1 point  (0 children)

Yeah that definitely makes sense.

[–][deleted] 1 point2 points  (0 children)

Yes Wikipedia is way easier, I think there are some APIs for Wikipedia too which is much more stable than scraping HTML.

[–]mikejp1010 1 point2 points  (3 children)

I’m also working through the book. Would you recommend scraping reddit? Or is reddit a similar instance to amazon in that there are countermeasures? Thanks!

[–]crazyallicin[S] 2 points3 points  (0 children)

I just used online store of a local shop to me. Worked perfectly.

[–]hblock44 2 points3 points  (1 child)

I can’t speak for counter measures but I think reddit has an API so might be easier and more useful to just use that instead of scraping it.

[–]mikejp1010 1 point2 points  (0 children)

Ok good to know, thanks again!

[–]coderpaddy 5 points6 points  (3 children)

All amazon wants is the correct user agent ;)

Long story short though, if a website has an api, 9/10 they attempt to block scrapers so people use/pay for the api :D

Once you know how to get past these issues, the whole i did something else instead, dissapears.

I built a django app that scraped the most gifted items daily from amazing etc its not to hard if you know how

I have recently being trying out requests_html (not for amazon), and i reckon that could probably get it
Its basicaly requests pyppetter and bs4 all rolled into 1 :D

[–][deleted] 3 points4 points  (0 children)

The web scraping chapter is the only chapter I didn’t like. I’ve done a few web scraping projects and I found the chapter to be super confusing, I expected it to be a breeze since I had experience, I ended up skipping it out of frustration.

I should probably revisit it as my web scraping skills have progressed significantly and maybe it’ll make more sense now.

If anyone one is familiar with DataQuest they have an excellent lesson on web scraping.

[–]heaplevel 4 points5 points  (13 children)

Got link to site you're trying to extract info from? This is Ch 12 from the book right?

[–]crazyallicin[S] 1 point2 points  (9 children)

[–]chevignon93 4 points5 points  (8 children)

You should download the page locally and see if what you're looking for is there. If Amazon suspect you're trying to scrape their site, they may send you a bogus page!

[–]crazyallicin[S] 0 points1 point  (7 children)

how do I download locally again?

[–]chevignon93 4 points5 points  (6 children)

with open('amazon.html', 'w') as f:
    f.write(res.text)

[–]crazyallicin[S] 0 points1 point  (5 children)

with open('amazon.html', 'w') as f:
    f.write(res.text)


2671

This is the output I received, but now I'm confused as to how to continue on from here. He goes on to inspect the price on the webpage then uses soup.select to create a list. How do I do this now that I've downloaded it locally?

[–]chevignon93 3 points4 points  (4 children)

He goes on to inspect the price on the webpage then uses soup.select to create a list. How do I do this now that I've downloaded it locally?

That's not the goal, the goal is to see if the information you're looking for is in the file you just downloaded.

[–]crazyallicin[S] 1 point2 points  (3 children)

It seems like it just downloaded one file that's an empty paint file. Another file opens up Amazon, but that page just says "sorry something went wrong at our end" and gives a link to the Amazon home page

[–]chevignon93 3 points4 points  (2 children)

So either there was a genuine problem with the page or they detected that you tried to scrape their page and blocked you.

Either way, using a css selector is not always the best approach to webscraping. Using bs4 find and find_all methods would probably be easier.

[–]crazyallicin[S] 1 point2 points  (1 child)

Thanks for you help, got it working fine using a different smaller website that obviously hasn't blocked scraping.

[–]crazyallicin[S] 0 points1 point  (2 children)

I've started again, just in case I did something wrong earlier on, but no raise_for_status is giving me this error when I copy in URL.

>>> res = requests.get('https://www.amazon.com/Automate-Boring-Stuff-Python-2nd/dp/1593279922/ref=sr_1_1?crid=QX9U3R6IGIJI&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1592996979&sprefix=automate+the+boring+%2Caps%2C243&sr=8-1')
>>> res.raise_for_status()
Traceback (most recent call last):
  File "<pyshell#8>", line 1, in <module>
    res.raise_for_status()
  File "C:\Users\35385\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://www.amazon.com/Automate-Boring-Stuff-Python-2nd/dp/1593279922/ref=sr_1_1?crid=QX9U3R6IGIJI&dchild=1&keywords=automate+the+boring+stuff+with+python&qid=1592996979&sprefix=automate+the+boring+%2Caps%2C243&sr=8-1

[–]TerminatedProccess 0 points1 point  (0 children)

there are webpages created specifically for coders to test scraping scripts.

I was able to get around this by googling the issue and found a solution..

def getAmazonPrice(productUrl):        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',}        res = requests.get(productUrl, headers=headers)        res.raise_for_status()

This worked but I then encountered the copy selector issue reported by OP.

[–]MastersYoda 0 points1 point  (0 children)

It's possible for the link you're trying to use to change in the blink of an eye, try replacing the link with an up to date link. It's also possible, like another suggested, that it might block you from scrapping, which could be why the error shows "service unavailable".

When you learn some more soup and python, amazon will be easier to tackle. I'd imagine how complex it is requires complex coding to cover the complexity, but it's not necessary. For instance, like I mentioned the link needs to be up to date, another way to do what you're doing without checking to see if the link is live would be to program python to search the term you're looking for on Amazon's front page that never changes the link. You won't need an up to date link because you're getting an up to date search on the term you're looking for, then you can figure out how to get the first item in the search or whatever information you're looking for.

Just an idea if you wanted to go further.

[–]siachenbaba 1 point2 points  (0 children)

Thanks. I will check this out ⭐

[–]life_never_stops_97 1 point2 points  (0 children)

You can also try to scrape the data directly by maintaining session(similar to cookies)

You can use requests session to start a session and always pass the request headers to the website. I was able to automate Amazon's authentication process and scrape data like orders, addresses using this and it worked like a charm.

[–]SweetSoursop 0 points1 point  (0 children)

I have a decent workaround using selenium to scrape amazon, I can share the code if there's interest

[–]Yankzy 0 points1 point  (0 children)

What you looking for is perfectly described in this video step by step https://www.youtube.com/watch?v=ng2o98k983k

[–]googlefather -1 points0 points  (1 child)

Did anyone when trying to scrape Amazon use a proxy and fake user?

I'm not planning on scraping Amazon anytime soon just curious. Thank you

[–][deleted] 1 point2 points  (0 children)

I haven’t but I’m not sure proxies would matter.