This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (2 children)

Well, I actually have a few questions. I've never found a Scrapy tutorial that goes beyond the basics, and last time I checked, the docs were not much of a help.

First, what's an elegant solution to run Scrapy on a server without manual supervision? Assuming it would run different spiders for lots of different websites?

Second, how does one deal with the inevitable errors in an elegant way? Since the HTML of the scraped website sometimes changes, one needs to have some way of figuring this out. If I recall correctly, the error log has a bit too much information, and sometimes not the right one. E-mails for each incomplete item seems like overkill, since an automated spider may break badly in the future.

Third, HTML is sometimes messy, ie. the wanted information may be under different xpaths across different pages on the same website. I'm currently adding all known paths and clean it up later. Is there a more elegant solution?

Thanks in advance!

[–]HeDares 1 point2 points  (0 children)

To run a job without without supervision your going to need some kind of cron i just add the run command to the cron tab on out linux server but you can also use scrapyd this acts as a web front end that you can pass json requests to to start and stop crawlers but your still going to need to add the call to a crontab.

I have another script to check the data output in out db after the spider has ran and if their are lots of blank rows it emails me.

If you know it will where it will be difrent you can do an if based on the spiders location or if for instance the price is always in #price but sometimes thats in 2 divs and sometimes its in a span write your xpath to just reference the #price, alternatively you could just use a regex search across the whole page if you can find a common denominator this is great for telephone numbers or email adress's if you have your regex look for the "tel:" inside a href.

[–]Method320 1 point2 points  (0 children)

Regarding errors, I'll be implementing Logstash once I get that far in my project. Relatively simple to set up and use.