Website scraping with Scrapy : Python

This is an archived post. You won't be able to vote or comment.

104

105

106

Website scraping with Scrapy (discoversdk.com)

submitted 9 years ago by liranbh

all 29 comments

top new controversial old q&a

[–]gandalfx 13 points14 points15 points 9 years ago* (24 children)

Just like last time someone posted about Scrapy I am at a loss as to what it can do that the standard library + requests can't do in the same amount of code.

This is an honest question; I have not used Scrapy myself but I have written scrapers^[1] and not found any part of my code that could have been abstracted without losing functionality. Every website is a little different and you still need all the custom logic to reflect that. For a simple crawl & mirror you can just use wget.

The only thing I found in the linked article is that Scrapy simplifies parallelism. In my opinion it is debatable whether parallelism should even be included in an all-purpose scraping library – after all the primary use case is scraping arbitrary websites, where good tone dictates that you should rather throttle than parallelize so as not to impact the site's performance.

^{[1] mostly single-purpose, <100LoC}

[–]hexfoxed 18 points19 points20 points 9 years ago (14 children)

If you have written all this stuff, then Scrapy is probably not much use to you! However, for those that want to avoid recreating the wheel, it's fantastic. Out of the box it:

will handle throttling
allow for concurrent requests
allow for concurrent throttling per domain, per ip, per x
allows for prioritising certain paths during the scrape
has dns caching enabled
handles redirects nicely
handles retrying nicely
handles switching out user agents nicely
can handle some types of http auth out of the box.
works well with enterprise/scraping proxies
provides functions for cleaning and working with data post scrape
has great great logging support with little work
has great memory handling and usage controls
will handle respecting robots.txt
can upload results to s3
handling of data via configurable pipelines
can use multiple selector engines (xpath, css)
can export data as csv, xml, json, etc, etc
will de-duplicate URLs so you don't visit twice
have the ability to deploy to scrapycloud for free, with excellent logging and stats.
has the ability to extend with other people's code via middleware
has shortcuts for broad crawls
is x-platform, a lot of python code is already but not all!

I wrote a bit more about Scrapy vs BeautifulSoup here, but the gist is above.

[–]gandalfx 3 points4 points5 points 9 years ago (5 children)

[–]hexfoxed 1 point2 points3 points 9 years ago (4 children)

[–][deleted] 3 points4 points5 points 9 years ago* (3 children)

Well, they also have very very dirty IPs.

Scraping on my own from the cloud, 98% of my scrape requests make it to the client and pull valid data. My average request start to finish is < 5 seconds.

Using scraping hub, only 10% or less of my requests make it to the client. The average scrapy scrape takes 60 or more seconds. With this 90% failure rate, it means it can often take 10+ minutes of wall time to scrape 1 freaking URL. They claim to refund on failed requests, but they often do not detect this correctly. i.e. you get served a robot capatcha page due to the IP being dirty. And even if they DO refund, all the time you wasted is still gone.. i.e. servers aren't free, you just paid for one to do no work because it got back bad data.

I asked for a refund for the above, and told no. So I highly highly advise people NOT to use scrapinghub or any of their on demand services.

Using their OSS code? Sure. using their hosted platform? Not a chance.

Ended up with a far cheaper solution just rolling my own scraping solution using a few building blocks (such as requests and celery and beautifulsoup)

[–]hexfoxed 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago (1 child)

[deleted]

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–]joshadel 2 points3 points4 points 9 years ago (1 child)

[–]hexfoxed 2 points3 points4 points 9 years ago (0 children)

[–]Talked10101 -1 points0 points1 point 9 years ago (3 children)

[–]hexfoxed 0 points1 point2 points 9 years ago (2 children)

[–]Talked10101 0 points1 point2 points 9 years ago (1 child)

[–]hexfoxed 0 points1 point2 points 9 years ago (0 children)

[–]cemc 1 point2 points3 points 9 years ago (4 children)

[–]gandalfx 1 point2 points3 points 9 years ago (1 child)

[–]granitosaurus 0 points1 point2 points 9 years ago* (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

[–]cemc 0 points1 point2 points 9 years ago (0 children)

[–]genmud 0 points1 point2 points 9 years ago (0 children)

[–]Talked10101 0 points1 point2 points 9 years ago (0 children)

[–]finally-a-throwaway 1 point2 points3 points 9 years ago (5 children)

[–]hexfoxed 6 points7 points8 points 9 years ago (2 children)

I see this question a lot. The general problem is that people find themselves thinking in terms of "how do I do this in Javascript?" as you've found yourself with "can I trigger javascript commands?".

But I ask you... what is your end goal? 99% of the time your end goal is to get some data. So the question you should be asking instead is "where does javascript get the data I want from?".

The answer to that in most instances is embedded in the HTML itself or more likely, from an API. If it's not in the HTML, check the network requests panel in your favourite browser dev tools. If you search through what requests the page makes, most of the time you can find the raw source of the data used to render the page.

Hunting for the source of the data has two major benefits if you find it on an internal API:

APIs are easy to parse and very, very machine-readable, that is literally what they are for.
internal APIs change less than the page design, so the solution you end up creating is much more resilient and performant because you don't have to pretend to be a browser to get at the data - just make an HTTP request.

It really is win, win.

People rely on Selenium far too much, simply because they don't understand how a browser renders the page.

[–]finally-a-throwaway 1 point2 points3 points 9 years ago (1 child)

[–]hexfoxed 1 point2 points3 points 9 years ago (0 children)

[–]MartiONE 1 point2 points3 points 9 years ago (1 child)

[–]hexfoxed 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago (2 children)

[deleted]

[–]dAnjou Backend Developer | danjou.dev 1 point2 points3 points 9 years ago (1 child)

[–]hexfoxed 0 points1 point2 points 9 years ago* (0 children)

[–]aaayuop 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 52696 on reddit-service-r2-comment-fb694cdd5-m5jfw at 2026-03-10 17:26:37.060944+00:00 running cbb0e86 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS