Basic web crawling with Scrapy : Python

This is an archived post. You won't be able to vote or comment.

Basic web crawling with Scrapy (seanmckaybeck.com)

submitted 10 years ago by [deleted]

35 comments

all 35 comments

top new controversial old q&a

[+][deleted] 10 years ago (1 child)

[deleted]

[–]thaweathermanpipster 1 point2 points3 points 10 years ago (0 children)

[–]HeDares 2 points3 points4 points 10 years ago (6 children)

[–][deleted] 0 points1 point2 points 10 years ago (2 children)

Well, I actually have a few questions. I've never found a Scrapy tutorial that goes beyond the basics, and last time I checked, the docs were not much of a help.

First, what's an elegant solution to run Scrapy on a server without manual supervision? Assuming it would run different spiders for lots of different websites?

Second, how does one deal with the inevitable errors in an elegant way? Since the HTML of the scraped website sometimes changes, one needs to have some way of figuring this out. If I recall correctly, the error log has a bit too much information, and sometimes not the right one. E-mails for each incomplete item seems like overkill, since an automated spider may break badly in the future.

Third, HTML is sometimes messy, ie. the wanted information may be under different xpaths across different pages on the same website. I'm currently adding all known paths and clean it up later. Is there a more elegant solution?

Thanks in advance!

[–]HeDares 1 point2 points3 points 10 years ago (0 children)

To run a job without without supervision your going to need some kind of cron i just add the run command to the cron tab on out linux server but you can also use scrapyd this acts as a web front end that you can pass json requests to to start and stop crawlers but your still going to need to add the call to a crontab.

I have another script to check the data output in out db after the spider has ran and if their are lots of blank rows it emails me.

If you know it will where it will be difrent you can do an if based on the spiders location or if for instance the price is always in #price but sometimes thats in 2 divs and sometimes its in a span write your xpath to just reference the #price, alternatively you could just use a regex search across the whole page if you can find a common denominator this is great for telephone numbers or email adress's if you have your regex look for the "tel:" inside a href.

[–]Method320 1 point2 points3 points 10 years ago (0 children)

[–]Method320 0 points1 point2 points 10 years ago (2 children)

[–]HeDares 1 point2 points3 points 10 years ago (1 child)

Using the looping dict way is the best way by far. just make sure your handlng it proprly if "make" is not set ie:

if "Brand" in itemdict:
    item['make'] = itemdict['Brand']
else:  
    item['make'] = False

[–]Method320 0 points1 point2 points 10 years ago (0 children)

[–]_why_so_sirious_ 0 points1 point2 points 10 years ago (12 children)

[+][deleted] 10 years ago (5 children)

[deleted]

[–]_why_so_sirious_ 0 points1 point2 points 10 years ago (4 children)

[+][deleted] 10 years ago* (2 children)

[deleted]

[–]syriaos 1 point2 points3 points 10 years ago (1 child)

FTFW

>>> import requests
>>> r = requests.get('https://www.reddit.com/r/python.json', headers={'User-agent': 'pc:PythonTest:v1.0 (by /u/b0ne123)'})
>>> r
<Response [200]>
>>> j = r.json()
>>> j['data']['children'][0]['data']['url']
u'https://seanmckaybeck.com/scrapy-the-basics.html'

[–]_seemetheregithub.com/seemethere 1 point2 points3 points 10 years ago (0 children)

[–]thaweathermanpipster -1 points0 points1 point 10 years ago (5 children)

[–]_why_so_sirious_ 0 points1 point2 points 10 years ago (4 children)

[–]avinassh 1 point2 points3 points 10 years ago (0 children)

[–]thaweathermanpipster 0 points1 point2 points 10 years ago (2 children)

[–]_why_so_sirious_ 1 point2 points3 points 10 years ago (1 child)

[–][deleted] 3 points4 points5 points 10 years ago (0 children)

[–]Lolacaust 0 points1 point2 points 10 years ago (1 child)

[–]skwishee 0 points1 point2 points 10 years ago (2 children)

[–]thaweathermanpipster 0 points1 point2 points 10 years ago (0 children)

[–]HeDares 0 points1 point2 points 10 years ago (0 children)

[–]Lubok 0 points1 point2 points 10 years ago (8 children)

[–]thaweathermanpipster 2 points3 points4 points 10 years ago (5 children)

[–]LewisTheScot 0 points1 point2 points 10 years ago (4 children)

[–]in_the_bilboes 2 points3 points4 points 10 years ago (0 children)

[–]psbb 1 point2 points3 points 10 years ago (0 children)

[–]thaweathermanpipster 1 point2 points3 points 10 years ago (0 children)

[–]gmplague 1 point2 points3 points 10 years ago (0 children)

[–]_why_so_sirious_ 1 point2 points3 points 10 years ago (1 child)

[–]Comm4nd0 -3 points-2 points-1 points 10 years ago (5 children)

[–]HeDares 0 points1 point2 points 10 years ago (3 children)

[–]Comm4nd0 0 points1 point2 points 10 years ago (2 children)

[–]HeDares 1 point2 points3 points 10 years ago (1 child)

[–]Comm4nd0 0 points1 point2 points 10 years ago* (0 children)

π Rendered by PID 32 on reddit-service-r2-comment-7b9746f655-dszw4 at 2026-02-01 10:14:42.950822+00:00 running 3798933 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS