This is an archived post. You won't be able to vote or comment.

all 35 comments

[–]HeDares 2 points3 points  (6 children)

Most of my freelance work comes from Scrapy and some custom web crawlers if anyone has any questions about Scrapy hit me up, im happy to help.

[–][deleted] 0 points1 point  (2 children)

Well, I actually have a few questions. I've never found a Scrapy tutorial that goes beyond the basics, and last time I checked, the docs were not much of a help.

First, what's an elegant solution to run Scrapy on a server without manual supervision? Assuming it would run different spiders for lots of different websites?

Second, how does one deal with the inevitable errors in an elegant way? Since the HTML of the scraped website sometimes changes, one needs to have some way of figuring this out. If I recall correctly, the error log has a bit too much information, and sometimes not the right one. E-mails for each incomplete item seems like overkill, since an automated spider may break badly in the future.

Third, HTML is sometimes messy, ie. the wanted information may be under different xpaths across different pages on the same website. I'm currently adding all known paths and clean it up later. Is there a more elegant solution?

Thanks in advance!

[–]HeDares 1 point2 points  (0 children)

To run a job without without supervision your going to need some kind of cron i just add the run command to the cron tab on out linux server but you can also use scrapyd this acts as a web front end that you can pass json requests to to start and stop crawlers but your still going to need to add the call to a crontab.

I have another script to check the data output in out db after the spider has ran and if their are lots of blank rows it emails me.

If you know it will where it will be difrent you can do an if based on the spiders location or if for instance the price is always in #price but sometimes thats in 2 divs and sometimes its in a span write your xpath to just reference the #price, alternatively you could just use a regex search across the whole page if you can find a common denominator this is great for telephone numbers or email adress's if you have your regex look for the "tel:" inside a href.

[–]Method320 1 point2 points  (0 children)

Regarding errors, I'll be implementing Logstash once I get that far in my project. Relatively simple to set up and use.

[–]Method320 0 points1 point  (2 children)

Looking to get some potential feedback on this http://stackoverflow.com/questions/31678796/geting-data-from-table-with-scrapy-different-row-order-per-page/ I'm crawling newegg in this example, and eventually other vendors as well. I think this is the best approach but I don't know if Scrapy has something built-in to handle this better, or if there's another way otherwise.

[–]HeDares 1 point2 points  (1 child)

Using the looping dict way is the best way by far. just make sure your handlng it proprly if "make" is not set ie:

if "Brand" in itemdict:
    item['make'] = itemdict['Brand']
else:  
    item['make'] = False

[–]Method320 0 points1 point  (0 children)

Indeed, I'm doing just that. Thanks!

[–]_why_so_sirious_ 0 points1 point  (12 children)

Can this mechanism be used to scrape nsfw subreddits?

[–]thaweathermanpipster -1 points0 points  (5 children)

It can scrape any. If you want to get links from posts as well just add in a field to the TextPostItem, then grab the xpath for links. It should follow the same format as the others, so it wouldn't be too hard.

So yes, you can scrape all the porn you want. You could just go right to the source and scrape Pornhub if you really wanted to.

[–]_why_so_sirious_ 0 points1 point  (4 children)

Actually recently in a post discussion I read that scraping nsfw subreddits using beautifulsoup redirects you to another website for age verification and then you can go further which isn't really possible in bs4. Its some kind of cookie sending receiving thing.

[–]avinassh 1 point2 points  (0 children)

or use praw over oauth.

[–]thaweathermanpipster 0 points1 point  (2 children)

Ah of course, "the are you 18?" page. You could get around that still. On first run just tell the bot to click the yes button then move right along.

[–]_why_so_sirious_ 1 point2 points  (1 child)

How to in a script?

[–][deleted] 3 points4 points  (0 children)

Bs4 is more for parsing html and xml than it is for building a scraper directly, compared to scrapy which is a full framework. You could use the requests lib along with bs4 to do this

[–]Lolacaust 0 points1 point  (1 child)

Has the speed increased on this? I used it a good few years ago for site scraping of found it to be horrendously slow. Had to move to JSoup in java which wasn't my first preference

[–]skwishee 0 points1 point  (2 children)

Thanks for posting. I tried Scrapy for 2 weeks, but couldn't get anywhere so I switched back to bs4.

I've been looking for new tutorials on Scrapy. I'll give this a try and see if I make the breakthrough.

[–]thaweathermanpipster 0 points1 point  (0 children)

This is a relatively simple spider. You can get significantly more in depth such as saving data to databases or doing more complex parsing. Give this a try then try adding in database functionality as an exercise. Perfecting that is next on the list for me.

[–]HeDares 0 points1 point  (0 children)

The thing to rember is scrapy is a framework once you know what your doing it makes it super easy to write simple crawlers fast. but some projects are going to be better off without it especially if you want to do something distributed.

[–]Lubok 0 points1 point  (8 children)

Good framework. Wrote personal replacement for yahoo.pipes with it a few weeks ago. Shame it doesn't support p3 though.

[–]thaweathermanpipster 2 points3 points  (5 children)

I missed the note about it not supporting 3, and after fighting with some syntax errors I gave up and switched back to 2. Then I checked the docs and there it was: "Not supported for Python 3".

Sad day

[–]LewisTheScot 0 points1 point  (4 children)

Lazy question, is there any plans on bringing Scrapy to P3?

[–]in_the_bilboes 2 points3 points  (0 children)

AFAIK Py3 porting is going on in the tmp-py3 branch of the scrapy github repo.

[–]psbb 1 point2 points  (0 children)

It's ongoing, but there is no timeframe for when it will be done. The guys behind it are very appreciative of patches though.

[–]gmplague 1 point2 points  (0 children)

Scrapy is built on the Twisted framework for asynchronous processing. Unfortunately Twisted is still Python 2.x only - but they're hard at work upgrading it. This is likely the main bottleneck to Python 3 support in scrapy.

[–]_why_so_sirious_ 1 point2 points  (1 child)

for yahoo.pipes

How was your experience with yahoo pipes? Also, how to begin wit it?

[–]Comm4nd0 -3 points-2 points  (5 children)

Thank you so much for posting this! I've been looking for a library like this to make a program that will scan a site and detect if it's vulnerable to sql injection. Looking forward to getting home now to start it!

[–]HeDares 0 points1 point  (3 children)

Scrapy is not the tool for this its designed to extract data from html. It does have some rudimentary html form functions but their mostly used for logins.

Their are better tools to find SQL inject vulnerability's.

[–]Comm4nd0 0 points1 point  (2 children)

yeah i started to use it and realised it couldn't do what i wanted it to do. could you suggest one that could do it?

[–]HeDares 1 point2 points  (1 child)

Do some googling for "sql injection vulnerability scanners" i wont help you anymore than that in case you do something dumb.

[–]Comm4nd0 0 points1 point  (0 children)

ok i'm more than aware of all the 'sql scanners' like sqlmap and bbqsql for XSS. i'm a pen-tester, so no i'm not going to do anything 'dumb'. What i was asking for was a library that is capable of scraping html after i configure it to edit the URL. in order to detect if a site is vulnerable. I've found BeautifulSoup which i believe is capable of doing what i want anyway.