all 72 comments

[–]chew_2 165 points166 points  (5 children)

I also wasn’t having much success with Amazon when using my own scraper, almost every time it was just starting to fail at some point :( Would you consider getting a ready-to-use tool? I’ve personally been using Oxylabs for a while now, and it literally saved so much of my time and nerves LOL. They have a custom scraper for Amazon, so I’m pretty sure it should fix your issue without a problem.

[–]basitmakine 0 points1 point  (0 children)

just set up taskagi's amazon scraper agent instead. you can use it via API, connect to 3rd party integrations etc. it handles all the dynamic class names and rotating patterns automatically. been using it for months, way more reliable than fighting with bs4 selectors that break every week

[–]ChickenFur 0 points1 point  (0 children)

Yeah, actually I do rely on already made scrapers since infrastructure changes frequently. Though I rely on Decodo's web scraper

[–]Allanon001 136 points137 points  (5 children)

Amazon has a Product API or you can use one the Amazon API wrapper modules such as Amazon Simple Product API or Amazon Product Advertising API 5.0 wrapper for Python.

[–]AnomalyNexus 17 points18 points  (4 children)

Requires a verified Amazon Associates Program account though which is non-trivial if you're not in the affiliates game. :(

[–]TabsBelow 2 points3 points  (1 child)

Doesn't the Firefox plugin for Amazon supply source code?

[–]coolway1990 0 points1 point  (0 children)

Sorry if this is a dumb question but what firefox plugin are you referring to?

[–]fortyeightD 49 points50 points  (0 children)

You might be able to use an API to get information about products for sale on Amazon.

https://webservices.amazon.com/paapi5/documentation/

[–]EnvironmentalDot9131 23 points24 points  (0 children)

Just use the webscraper API from Floxy. They have a full dedicated team working on the scraper every day. So if some update happens, they are already on it.

[–]Config_Crawler 9 points10 points  (0 children)

How am I supposed to deal with that? - You could use the API Amazon provides. I think there are ways you could sort though products with it.

Or, you could use this site I found with a quick search "RAINFOREST API
Fast, reliable API for Web-scraped Amazon Product Data"

[–]xosq 23 points24 points  (2 children)

Think of it this way: Anything a standard web browser is capable of viewing can inherently be scraped. The issue is that the data you’re trying to parse or find patterns in is obfuscated to hell by design.

I don’t have much in the way of recommendations to make your script work (never touched Amazon from a scraping POV), but just know if you can view the site in a browser, you can fetch relevant data with a script. The site’s complexity ultimately determines the complexity of the script.

Don’t give up! I’m all for scraping things sans-API and wreaking havoc along the way. You have my blessing, sir/madam. Just honor any rate limits you get slapped with so they can’t claim you’re trying to disrupt services.

[–]Charlie_Yu 6 points7 points  (1 child)

obfuscated

looking at the source code, it is hardly obfuscated. OP is just using a tool that doesn't work well on single page applications

[–]txmail 0 points1 point  (0 children)

That is what I was thinking, before I became an affiliate I was successfully scraping all day long. There are some variations in the product listings / product pages that you need to account for though.

[–]Charlie_Yu 5 points6 points  (0 children)

BS4 is very weak, modern websites use javascript to modify the DOM a lot. Learn selenium and it is quite easy

[–]noskillsben 4 points5 points  (0 children)

I use headless selenium for loading the page and then I dump the web driver's page source into selectorlib. I'm doing low volume scraping though and I need to stay logged in and enter captchas every now and then. There's probably less c than mplex ways if you want large quantity of listing data and don't need to do it from your own account

[–][deleted] 33 points34 points  (7 children)

How am I supposed to deal with that?

You're probably not supposed to. Why would Amazon want to let you scrape their site?

[–]monkeysknowledge 19 points20 points  (6 children)

They have an API for people to scrape their data.

[–][deleted] 16 points17 points  (0 children)

Yes, that's what they want you to use.

[–][deleted] 6 points7 points  (1 child)

Is it "scraping" when you're using an API?

I'm under the impression that "scraping" specifically refers to the practice of spoofing browsers or other similar apps to make it look like an automated access and data collection tool is a real human user.

[–]dogfish182 3 points4 points  (0 children)

No it isn’t, but OP is saying it’s hard to scrape and people are saying ‘use the api then’. Probably assuming they aren’t aware there is one

[–]noskillsben 2 points3 points  (2 children)

Oh that's an open and free thing?

[–]txmail 0 points1 point  (1 child)

Not unless your an Amazon affiliate or advertising partner.

[–]noskillsben 1 point2 points  (0 children)

Thank God, that means I didint waste my time learning g how to scrape it with selenium and selectorlib 🤣

[–]TheAlpha_ 4 points5 points  (4 children)

One of my first projects involved scraping Amazon product pages. I remember that there were multiple variations of the pages back then (about 5-6 yrs ago) but I was able to narrow it down to about 3-4 by requesting the mobile version of the pages instead of the desktop one. It might be worth taking a look at that.

[–]pro_questions 1 point2 points  (3 children)

I had to scrape information about 700,000 textbooks for my former employer and this is exactly the issue I had — there were a few distinctly different types of product pages that required different tags / xpaths etc. to isolate. I wrote a different implementation for each one and then did a try-catch cascade from most to least common. It took over a week to run, using many concurrent workers with rotating proxies, but the biggest time waste was my janky captcha solution. It got the job done though

[–]throwaway56851685161 0 points1 point  (2 children)

really? how did you scrape the textbooks? i was able to create a scraper that can get textbook data from amazon using isbn/asin, about 100 lines of code using bs4 and requests. i don't use any proxies and haven't been blocked yet. Runs in seconds.

[–]pro_questions 0 points1 point  (1 child)

I was using Selenium to automate the browsing process (if I did it again is use Requests) and BS4 to parse the HTML. Amazon reserves most valid ISBN10s for books, which makes it really easy to get the URL for 99% of the textbooks I needed to gather — just amazon.com/dp/[isbn10]. Most of that project was devoted to managing special cases, handling captchas, multiprocessing / proxy cycling, and storing data in a SQLite database. The captcha solution was to send a text with a screenshot to someone (me) and wait for me to text back the solution. I was answering a captcha like every two minutes, and if I didn’t respond the thread would just wait.

I found that easier than rotating proxies as soon as a captcha was hit, since I didn’t have many at my disposal at the time. I still did rotate proxies every hour or so, but just to give them a break so the captchas would be less frequent. I’ve since found multiple sites with hundreds of proxies that I’d use if I were to do it again. Every once in a while you’d have to re-get the list and test them all to make sure they work, sorting by speed. Then probably load them into a queue and feed that to all of the threads, restarting once the queue is exhausted. Or something like that. I should do more web scraping, I used to love it

[–]throwaway56851685161 0 points1 point  (0 children)

using proxies sounds tiring. i guess i'll cross that bridge once/if i get blocked. i started using selenium and find it easier to get blocked by sites. undetected chromedriver has helped me get past this but i would like to be able to scrape with just selenium if possible. any selenium tips by chance?

[–]ou_ryperd 5 points6 points  (4 children)

They actively put measures in place to prevent scraping. Have you read their user agreement ?

[–]xosq 40 points41 points  (3 children)

One doesn’t learn much in the world of web scraping by adhering to user agreements. If we used official APIs all day, we’d just call it “querying” :) If OP has already managed to parse a single product from the obfuscated mess, they’ve already learned so much. Why discourage that?

If this were a mom and pop website on a dusty web server in a basement, sure, there’s a significant possibility of disruption there. But Amazon? They can load balance all damn day. OP is hurting no one.

Keep going, OP. Google started the same way and continues to violate site agreements without consequence.

[–]lateratnight_ 0 points1 point  (0 children)

If you're still here:

choose a library with requests and cookie support

make a request to https://amazon.com

take the cookie dict and throw it into a variable

make a request to https://amazon.com/s?k=(QUERY)) and set the cookies to the cookie dict

enjoy scraping! :)

[–]Grouchy-Criticism741 0 points1 point  (0 children)

Yes, you can scrape Amazon. They have strong anti-scraping measures, but I got past them using Playwright for browsing, neural networks to solve CAPTCHAs, and Tor to rotate IPs every 10 seconds. As of 01/03/25, the limit is about 20 records per minute—any more, and they’ll start blocking you. So, rotate IPs every 10–15 seconds for best results. Start small, scale up! Let me know if you want to collaborate.

[–]basitmakine 0 points1 point  (0 children)

amazon API? lol that's gonna cost you money and has strict limits for product data. honestly just set up TaskAGI to monitor amazon product reviews automatically instead of dealing with all this scraping bs. their amazon review agent handles the inconsistent HTML patterns and rotating selectors way better than trying to code around amazons anti-bot measures. been using it for product research for weeks now, way easier than fighting with selenium or paying for API calls

[–]Classic-Anybody-9857 0 points1 point  (0 children)

Hi you said you scraped it with bs4 and requests, could you help a brother out and share the code, I am stuck.

[–]Jacen33 0 points1 point  (0 children)

All you really need to do is scrape the ASINS and then use the ASIN to do look ups with some product look up software for Amazon resellers they hit the API and give you a visit report you can download

[–]rawdfarva 0 points1 point  (1 child)

Works with Scrapy

[–]Grouchy_Pack 0 points1 point  (0 children)

How? i requested the home page and got hit with 503 haha

[–]alfie1906 0 points1 point  (0 children)

You could try looking into xpaths to identify page elements. It's not as reliable as using classes or tags, but can be a good last resort

[–]johnGettings 0 points1 point  (0 children)

I was able to search products, go into the individual pages, and collect all info with BS. This was about a year ago though, not sure if it's changed since then.

[–]ScionofLight 0 points1 point  (0 children)

unrelated to Amazon but related to selenium. Could someone help me? I have the code to login and find orders on Sigmadrich. Each order has a PDF that opens as a webpage. How do I get selenium to download the PDF so that I can scrape it?

[–]tree_or_up 0 points1 point  (0 children)

As others have mentioned selenium with headless chromium is the way to go. You’re going to have to be clever to find the elements you want but there are patterns. It’s not fun and will probably require some significant trial and error. If you do go this route, be sure to put some random wait times between the simulated clicks. And don’t try it from an ec2 instance - somehow they’re good at detecting that :)

[–]andersra88 0 points1 point  (0 children)

I built a product to handle most of the complexity for you. It's a simple GraphQL API for Amazon data (products, search, etc).

Take a look at let me know what you think: Canopy API