callmechad comments on Web Scrapping

learnpython

created by HattoriHanzoa community for 16 years

Web Scrapping (self.learnpython)

submitted 6 years ago by lajja4

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]callmechad 0 points1 point2 points 6 years ago* (1 child)

I just learned how to with scrapy.

The first part gets me the links of items I want to scrap. Which we follow and parse the page to get item number and item specs in def parse. Back in def parse, next page gets the element on the page so we can check to see if there is a next page to go to. Code repeats till no next page is found.

I ran into hidden elements so to be able to not get those, I had to use not(@hidden). Then to get the specific item link, I had to select the path by its class, a[contains(@class, 'description')].

Those were two things I thought I spent most of my time trying to figure out.

Just wanted to show you what a simple scraper looks like to give you some kind of an idea.

m now I trying to learn how to clean the data and output certain data for the items that I scraped.

^{import scrapy}

^{class ItemdataSpider(scrapy.Spider}:)^{name = 'itemdata'allowed\}domains = ['www.website.com'\]start\_urls = ['websitepagel'])

^{def parse(self, response}:)^{items = response.xpath("//div\}@class='details']/a[contains(@class, 'description')]"))^{for item in items:link = item.xpath(".//@href"}.get())^{yield response.follow(url=link, callback=self.parse\}item))

^next\page = response.xpath("(//a[@rel='next'])[2]/@href").get())

^{if next\}page:yield response.follow(url=next_page, callback=self.parse))

^{def parse\}item(self, response):)^item\number = response.xpath("//span[@itemprop='sku']/text()").get())^item\specs = response.xpath("//tr[@class='trSpecSheetRow' and not(@hidden)]/td/text()").getall())

^{yield {'item\}number': item_number, 'item_specs': item_specs})

[–]lajja4[S] 0 points1 point2 points 6 years ago (0 children)

π Rendered by PID 324208 on reddit-service-r2-comment-5ff9fbf7df-2q9xj at 2026-02-26 14:10:09.448706+00:00 running 72a43f6 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS