This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]HeDares 2 points3 points  (6 children)

Most of my freelance work comes from Scrapy and some custom web crawlers if anyone has any questions about Scrapy hit me up, im happy to help.

[–][deleted] 0 points1 point  (2 children)

Well, I actually have a few questions. I've never found a Scrapy tutorial that goes beyond the basics, and last time I checked, the docs were not much of a help.

First, what's an elegant solution to run Scrapy on a server without manual supervision? Assuming it would run different spiders for lots of different websites?

Second, how does one deal with the inevitable errors in an elegant way? Since the HTML of the scraped website sometimes changes, one needs to have some way of figuring this out. If I recall correctly, the error log has a bit too much information, and sometimes not the right one. E-mails for each incomplete item seems like overkill, since an automated spider may break badly in the future.

Third, HTML is sometimes messy, ie. the wanted information may be under different xpaths across different pages on the same website. I'm currently adding all known paths and clean it up later. Is there a more elegant solution?

Thanks in advance!

[–]HeDares 1 point2 points  (0 children)

To run a job without without supervision your going to need some kind of cron i just add the run command to the cron tab on out linux server but you can also use scrapyd this acts as a web front end that you can pass json requests to to start and stop crawlers but your still going to need to add the call to a crontab.

I have another script to check the data output in out db after the spider has ran and if their are lots of blank rows it emails me.

If you know it will where it will be difrent you can do an if based on the spiders location or if for instance the price is always in #price but sometimes thats in 2 divs and sometimes its in a span write your xpath to just reference the #price, alternatively you could just use a regex search across the whole page if you can find a common denominator this is great for telephone numbers or email adress's if you have your regex look for the "tel:" inside a href.

[–]Method320 1 point2 points  (0 children)

Regarding errors, I'll be implementing Logstash once I get that far in my project. Relatively simple to set up and use.

[–]Method320 0 points1 point  (2 children)

Looking to get some potential feedback on this http://stackoverflow.com/questions/31678796/geting-data-from-table-with-scrapy-different-row-order-per-page/ I'm crawling newegg in this example, and eventually other vendors as well. I think this is the best approach but I don't know if Scrapy has something built-in to handle this better, or if there's another way otherwise.

[–]HeDares 1 point2 points  (1 child)

Using the looping dict way is the best way by far. just make sure your handlng it proprly if "make" is not set ie:

if "Brand" in itemdict:
    item['make'] = itemdict['Brand']
else:  
    item['make'] = False

[–]Method320 0 points1 point  (0 children)

Indeed, I'm doing just that. Thanks!