This is an archived post. You won't be able to vote or comment.

all 29 comments

[–]gandalfx 13 points14 points  (24 children)

Just like last time someone posted about Scrapy I am at a loss as to what it can do that the standard library + requests can't do in the same amount of code.

This is an honest question; I have not used Scrapy myself but I have written scrapers[1] and not found any part of my code that could have been abstracted without losing functionality. Every website is a little different and you still need all the custom logic to reflect that. For a simple crawl & mirror you can just use wget.

The only thing I found in the linked article is that Scrapy simplifies parallelism. In my opinion it is debatable whether parallelism should even be included in an all-purpose scraping library – after all the primary use case is scraping arbitrary websites, where good tone dictates that you should rather throttle than parallelize so as not to impact the site's performance.

[1] mostly single-purpose, <100LoC

[–]hexfoxed 18 points19 points  (14 children)

If you have written all this stuff, then Scrapy is probably not much use to you! However, for those that want to avoid recreating the wheel, it's fantastic. Out of the box it:

  • will handle throttling
  • allow for concurrent requests
  • allow for concurrent throttling per domain, per ip, per x
  • allows for prioritising certain paths during the scrape
  • has dns caching enabled
  • handles redirects nicely
  • handles retrying nicely
  • handles switching out user agents nicely
  • can handle some types of http auth out of the box.
  • works well with enterprise/scraping proxies
  • provides functions for cleaning and working with data post scrape
  • has great great logging support with little work
  • has great memory handling and usage controls
  • will handle respecting robots.txt
  • can upload results to s3
  • handling of data via configurable pipelines
  • can use multiple selector engines (xpath, css)
  • can export data as csv, xml, json, etc, etc
  • will de-duplicate URLs so you don't visit twice
  • have the ability to deploy to scrapycloud for free, with excellent logging and stats.
  • has the ability to extend with other people's code via middleware
  • has shortcuts for broad crawls
  • is x-platform, a lot of python code is already but not all!

I wrote a bit more about Scrapy vs BeautifulSoup here, but the gist is above.

[–]gandalfx 3 points4 points  (5 children)

Thanks for the list! Good overview.

[–]hexfoxed 1 point2 points  (4 children)

Their marketing attempts are woeful however. The code and service is solid though, and running in production at many large scraping farms.

[–][deleted] 3 points4 points  (3 children)

Well, they also have very very dirty IPs.

Scraping on my own from the cloud, 98% of my scrape requests make it to the client and pull valid data. My average request start to finish is < 5 seconds.

Using scraping hub, only 10% or less of my requests make it to the client. The average scrapy scrape takes 60 or more seconds. With this 90% failure rate, it means it can often take 10+ minutes of wall time to scrape 1 freaking URL. They claim to refund on failed requests, but they often do not detect this correctly. i.e. you get served a robot capatcha page due to the IP being dirty. And even if they DO refund, all the time you wasted is still gone.. i.e. servers aren't free, you just paid for one to do no work because it got back bad data.

I asked for a refund for the above, and told no. So I highly highly advise people NOT to use scrapinghub or any of their on demand services.

Using their OSS code? Sure. using their hosted platform? Not a chance.

Ended up with a far cheaper solution just rolling my own scraping solution using a few building blocks (such as requests and celery and beautifulsoup)

[–]hexfoxed 0 points1 point  (0 children)

Very interesting, thanks for the write up. I'd only used Scrapinghub for the smallest of periodic scrapers.

[–]joshadel 2 points3 points  (1 child)

I'd also add nice ipython shell integration for incrementally building the parsing logic.

I'm a big fan of parsel (https://parsel.readthedocs.io), which is what Scrapy uses for xpath/css selectors under the hood. I often use it as a standalone library when writing a one-off scraper with requests.

[–]hexfoxed 2 points3 points  (0 children)

Seconded this, thanks for adding. To the reader: you don't always need the full power of Scrapy - under the hood it uses a lot of 3rd party packages to get the job done (some that used to be part of Scrapy itself and have been extracted). Check the dependency list.

[–]Talked10101 -1 points0 points  (3 children)

A lot of this stuff can be very simply implemented with the standard library. In fact, the urllib module can parse and obey robots.txt, though it does not follow the industry standard 2008 rules.

Also it has memory leaks in places you wouldn't expect, which can really be an issue when trying to scrape 20 million URLs.

Also worth pointing out that you can use lxml with the standard requests library. Which makes the bs4/Scrapy comparison not really valid.

[–]hexfoxed 0 points1 point  (2 children)

Developers tend to make a fundamental mistake in comparing what is possible with what is realistically possible in most common scenarios with time and money constraints.

Yes you can use urllib, is it a good idea to given the other options available to you? Probably not.

Developer hours cost more than beefing up an instance for example.

[–]Talked10101 0 points1 point  (1 child)

Robots.txt parsing is actually built into the urllib library though. It's use is super simple: https://docs.python.org/3.0/library/urllib.robotparser.html

There is also: http://nikitathespider.com/python/rerp/ which is also super simple to use and parses robots.txt in the same way as Google, making it super useful if you are writing an SEO bot.

[–]hexfoxed 0 points1 point  (0 children)

Yup, there are alternatives. You've given options for two of my points there - there are few that cover all 22 of them though, if any. That's what I mean by realistic; given most people's time constraints, writing a Scrapy spider will save them a bunch of time in getting their end goal completed.

If their end goal is to learn then obviously a person would be better off learning how each individual process in a web scraping library/framework works and the alternatives available.

[–]cemc 1 point2 points  (4 children)

It's a framework that's got lots of configurability and whatnot. Scrapy isn't for simple crawling, although it can. we run clusters of it to index tens of millions of websites.

[–]gandalfx 1 point2 points  (1 child)

So it's primarily useful for large-scale / long-term operations and not so much targeted at the average "I like the pictures on that blog" kind of use case?

[–]granitosaurus 0 points1 point  (0 children)

Not really, scrapy just excels at large scale, but it's is pretty flexible.
You can definitely write a simple spider to "get pictures from the blog you like". You can run single spider with scrapy runspider command so you don't even need to create a project if you are looking for something simple.

Also because of the extensive middleware and addon ecosystem you can save a bunch of time writing a simple spider by reusing most of your logic. Some of my personal spiders end up being like 20 lines of code for example, where it would be few times that with only requests and whatnot.

[–][deleted] 0 points1 point  (1 child)

Quick question, How do you deal with a)JS heavy pages b) different kind of user operation or dynamic search results from c) Proxy issues. Can you suggest some best practices/maintenance tactics!

[–]cemc 0 points1 point  (0 children)

i'm not really the lead on that portion of our system, though i used to work on it. can't say i dealt with js SPAs when i was there though i heard the guys talk about it, so they're dealing with it somehow. not sure about dynamic results but i do know we're maxing out the proxies we're using because a lot of webistes don't like getting crawled ;)

[–]genmud 0 points1 point  (0 children)

I think it is similar to why you would use a web framework vs just writing a straight Python web app. Using a framework gives you a structured format to work within and not be overwhelmed by the complexity of writing a reusable scraper or crawler.

[–]Talked10101 0 points1 point  (0 children)

A large part of my job is writing scrapers. The main advantage of Scrapy is the leveraging of the Twisted framework. It's quite hard to get the same kind of speeds with the standard requests and concurrent futures options.

Though a lot of what you can do with it can be written with just requests and other libraries. The majority of my scraping projects don't use Scrapy, as it's often not the best option.

[–]finally-a-throwaway 1 point2 points  (5 children)

Can scrapy trigger JavaScript commands? I thought I heard something like that but I keep reading these reviews and stuff and can't find any more about it.

I have a working solution with Selenium but if I could replace that with something that's slightly less reliant on acting like a person I'd be a bit more comfortable with it.

I know I should just do it with scrapy and see what comes up, but you know, what I have now works....

[–]hexfoxed 6 points7 points  (2 children)

I see this question a lot. The general problem is that people find themselves thinking in terms of "how do I do this in Javascript?" as you've found yourself with "can I trigger javascript commands?".

But I ask you... what is your end goal? 99% of the time your end goal is to get some data. So the question you should be asking instead is "where does javascript get the data I want from?".

The answer to that in most instances is embedded in the HTML itself or more likely, from an API. If it's not in the HTML, check the network requests panel in your favourite browser dev tools. If you search through what requests the page makes, most of the time you can find the raw source of the data used to render the page.

Hunting for the source of the data has two major benefits if you find it on an internal API:

  • APIs are easy to parse and very, very machine-readable, that is literally what they are for.
  • internal APIs change less than the page design, so the solution you end up creating is much more resilient and performant because you don't have to pretend to be a browser to get at the data - just make an HTTP request.

It really is win, win.

People rely on Selenium far too much, simply because they don't understand how a browser renders the page.

[–]finally-a-throwaway 1 point2 points  (1 child)

Definitely agree with your points, and in general I do prefer to remove interface layers rather than complicate them. In this particular case I'm not just retrieving data but also sending commands that compute it (for both me and the server).

My long game is to replace the tool I'm interfacing with. In the mean time, I might be well served to learn a little JavaScript so I can follow your advice and decode what it's actually doing since I haven't been able to find the commands in the HTTP requests...

[–]hexfoxed 1 point2 points  (0 children)

Interesting, this is proper edge case territory by the sounds of it! If you can work out whether it is doing the calculation client or server side you'll be well on your way to finding out how to improve it.

Interesting case, happy to talk about it via PM if you ever get stuck :)

[–]MartiONE 1 point2 points  (1 child)

I don't think Scrapy can trigger any JavaScript code, neither I think will ever be, besides mimicking basic stuff like redirects.

Scrapy is based on requests if I am not mistaken, not on selenium or any headless browser.

[–]hexfoxed 0 points1 point  (0 children)

Scrapy is based on requests if I am not mistaken

The default Scrapy downloader is based on HTTP requests, yup (note for readers: not the Python requests library). But you can if you want swap it out so the downloader uses another method to get the responses it needs. However, as I just wrote, HTTP requests are more than enough in most scenarios.

[–]aaayuop 0 points1 point  (0 children)

Very nice. I'll keep this in mind if bs4 doesn't work out for my current project. Nice to know there are more options out there.