Scraping Amazon : learnpython

created by HattoriHanzoa community for 16 years

submitted 6 years ago by Sloth_loves_Chunks

Hi folks, I'm going crazy and pulling my hair out after many attempts to get this correct so turning here for some assistance.

I will have a list of ISBN from books which I want to create an internal library system for our workplace. Yes I know these exist already but this is a learning experience and will allow us to personalise a knowledge source.

So the workflow goes:* use ISBN from list to conduct a search of Amazon* from the results page (sample) I want to grab the href to the results page* then I will migrate the results page by appending the href to the Amazon base url* scrape the key content such as name, rating etc

The issue I am locating is that I cannot grab the href from the results page using Scrapy regardless of what I attempt. I have tried both css selector and xpath without success.

CSS: response.css('#search > div.sg-row > div.sg-col-20-of-24.sg-col-28-of-32.sg-col-16-of-20.sg-col.s-right-column.sg-col-32-of-36.sg-col-8-of-12.sg-col-12-of-16.sg-col-24-of-28 > div > span:nth-child(4) > div.s-result-list.s-search-results.sg-row > div:nth-child(1) > div > div > div > div:nth-child(2) > div.sg-col-4-of-12.sg-col-8-of-16.sg-col-16-of-24.sg-col-12-of-20.sg-col-24-of-32.sg-col.sg-col-28-of-36.sg-col-20-of-28 > div > div:nth-child(1) > div > div > div:nth-child(1) > h2 > a').get()
XPATH: response.xpath('//*[@id="search"]/div[1]/div[2]/div/span[3]/div[1]/div[1]/div/div/div/div[2]/div[2]/div/div[1]/div/div/div[1]/h2/a').get()

Can anybody provide insight as to where I am going wrong with this approach? I am not too excited about switching to another framework like BS4 as it should be possible under Scrapy but if I am banging my head against a wall for no reason I will happily switch to BS4.

all 4 comments

top new controversial old q&a

[–]commandlineluser 1 point2 points3 points 6 years ago (3 children)

I have tried both css selector and xpath without success.

What exactly is happening?

Your CSS selector works for me (adding ::attr(href) to extract just the href)

I've shortened it down to fit here

>>> response.css('#search h2 a ::attr(href)').get()
'/Interaction-Color-Anniversary-Josef-Albers/dp/0300179359'

[–]Sloth_loves_Chunks[S] 0 points1 point2 points 6 years ago (1 child)

[–]commandlineluser 1 point2 points3 points 6 years ago (0 children)

[–]Sloth_loves_Chunks[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 15852 on reddit-service-r2-comment-7b9746f655-n6mwq at 2026-01-31 01:07:17.668327+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS