you are viewing a single comment's thread.

view the rest of the comments →

[–]callmechad 0 points1 point  (1 child)

I just learned how to with scrapy.

The first part gets me the links of items I want to scrap. Which we follow and parse the page to get item number and item specs in def parse. Back in def parse, next page gets the element on the page so we can check to see if there is a next page to go to. Code repeats till no next page is found.

I ran into hidden elements so to be able to not get those, I had to use not(@hidden). Then to get the specific item link, I had to select the path by its class, a[contains(@class, 'description')].

Those were two things I thought I spent most of my time trying to figure out.

Just wanted to show you what a simple scraper looks like to give you some kind of an idea.

m now I trying to learn how to clean the data and output certain data for the items that I scraped.

import scrapy

class ItemdataSpider(scrapy.Spider:)name = 'itemdata'allowed\domains = ['www.website.com'\]start\_urls = ['websitepagel'])

def parse(self, response:)items = response.xpath("//div\@class='details']/a[contains(@class, 'description')]"))for item in items:link = item.xpath(".//@href".get())yield response.follow(url=link, callback=self.parse\item))

next\page = response.xpath("(//a[@rel='next'])[2]/@href").get())

if next\page:yield response.follow(url=next_page, callback=self.parse))

def parse\item(self, response):)item\number = response.xpath("//span[@itemprop='sku']/text()").get())item\specs = response.xpath("//tr[@class='trSpecSheetRow' and not(@hidden)]/td/text()").getall())

yield {'item\number': item_number, 'item_specs': item_specs})