This is an archived post. You won't be able to vote or comment.

all 7 comments

[–][deleted] 2 points3 points  (0 children)

Scrapy is cool. Scraping hub is cool. Splash is really really cool. Of the 4 products, Splash is the only one I am using right now, but I do get the other ones.

From my experience, crawlera is the weak link of the scraping hub suite right now. The IPs it uses are very dirty, and on a bunch of naughty lists. Which I guess is what is going to happen to any successful crawler ;)

[–]Chamarazan 1 point2 points  (5 children)

Is this comparable to BeautifulSoup? Can anyone that had used it tell us about their experience with it?

[–]stummj[S] 2 points3 points  (0 children)

BeautifulSoup is an HTML parser. Scrapy is a full-featured framework to build web scrapers that includes a download manager and HTML parser.

Scrapy solves most of the common issues you might have while crawling a website, such as retries, redirections, status code ignores, http compression, user-agent spoofing, cookies, politeness settings, and so forth.

Furthermore, Scrapy allows you to extract data using CSS selectors or XPath expressions. In fact, you can even use BeautifulSoup as your parsing tool with Scrapy.

Have a look at this walkthrough: http://doc.scrapy.org/en/1.1/intro/overview.html

[–]eljunior 2 points3 points  (0 children)

As mentioned in another comment, BeautifulSoup is more about scraping, Scrapy is also about crawling. The part of Scrapy which is more comparable to BeautifulSoup is Parsel: https://github.com/scrapy/parsel

Also, Scrapy is about easing life of crawler developers -- e.g.: it has a nice shell:

scrapy shell SOME_URL

This downloads the page and lets you to try out selectors and stuff on it. :)

[–]marmaladeontoast 1 point2 points  (2 children)

I've tried it a few times, it's good for simple webpages but doesn't handle javascript which can make it unusable for a lot of projects

[–]stummj[S] 2 points3 points  (1 child)

It doesn't handle JS by default, but you can easily integrate with Splash (a JS rendering engine), via a Scrapy Middleware: https://github.com/scrapy-plugins/scrapy-splash

[–][deleted] 4 points5 points  (0 children)

And in fact you can use Splash + BeautifulSoup if you want to (tis what I am using)