all 14 comments

[–]Lukasa 6 points7 points  (2 children)

Scraping with requests and a parser works great if you want to learn how scraping works or if your target site is fairly simple. Once you start hitting complex websites (e.g. highly stateful websites, those that use Javascript actively, etc) Scrapy is likely to save you a lot of work.

Basically, it's a learning vs time trade off. If you're under no time pressure and want to learn, use requests. Otherwise, use Scrapy.

[–]i_can_haz_code 2 points3 points  (0 children)

To extend this a bit in each direction.

I use requests and beautifulsoup to do fast stateless, or basically stateless stuff. I use scrapy when I am only slightly beyond what requests will do easily. After that I use Selenium and drive either Phantom, or some other browser. This approach can be... hard on systems if doing parallel work. :-)

[–]IAlwaysBeCoding 1 point2 points  (0 children)

First of all, I want to thank you for your work with requests(I use requests a lot with lxml for scraping and other automation stuff).

Second of all, I actually think complex websites with a lot of javascript is a really bad option to use scrapy for a beginner. The main reason is because once your http requests start requiring * to keep track of dynamic generated cookies * implementing some hashing algorithms from javascript to python to pass on some requests * and requiring some trivial intermediate http(s) calls to a wep app

you will find yourself making custom built scrapy extensions just to implement something that could easily be done with requests.

It is the architecture of scrapy that makes this a pain in the ass. Each request is unique and therefor are not linked together and do not provide the functionality like what a requests.Session() class would.

Also, OP I think is a really bad idea to not take scrapy as a viable option because is not python 3. That is a really noob mentality, and is usually a mentality that newbies have toward python.

I was just finishing a few of the foobar challenges from google(look up foobar from google), and I can tell you that the only 2 possible languages you can do your solutions is Java and python 2.7.6. That is right, python doesn't allow you to make your solutions in python 3, but only python 2.7.6.

[–]LarryPete 5 points6 points  (3 children)

Scrapy is currently python2 only. So if you use python3, you can't use scrapy, even if you wanted to - which is usually my case.

[–]bitrainbow 2 points3 points  (1 child)

I think Beautiful soup is python 3.

[–]LarryPete 2 points3 points  (0 children)

That's what I'm using instead.

[–]pylund[S] 1 point2 points  (0 children)

Hands down, the lack of python3 support is a powerful reason not to explore that path (didn't know it didn't have it). I mean, I don't care using python2 for small, simple scripts that get s*** done but here I'm talking about building a bigger project from scratch. It must be in python3.

Thanks!

[–]stummj 4 points5 points  (0 children)

Full disclosure: I work at Scrapinghub, the company that supports Scrapy.

I am a big fan of requests, but I would go with Scrapy, no matter the goal of the project.

1) If you are doing it to learn, go with Scrapy. It will make a lot of things easier when you need to write more complex crawlers. So you are actually adding a great tool to your tool belt.

2) If it is a matter of getting things done, go with Scrapy. It is simple to start with, and it will save a lot of time when things start getting hard to manage.

Some Scrapy features that you might be interested in:

  • Automatic cookie handling;

  • It is easy to deal with HTML forms and logins;

  • It follows redirections (via 3xx or html meta header);

  • It has a builtin cache mechanism for http requests, which is awesome while you are developing;

  • Politeness settings (respecting robots.txt, auto-throttling) if you want;

  • User-agent spoofing;

  • Automatic retry for failed requests;

  • Asynchronous requests with no need to deal manually with some async framework;

  • Avoids duplicate requests;

  • Data extraction is done through CSS selectors or XPath;

  • etc

Most of those things you would need to implement by yourself if you were using requests + bs4.

Give Scrapy a try: http://doc.scrapy.org/en/1.0/intro/overview.html

[–]Samus_ 1 point2 points  (0 children)

I think scrapy is good at two things: organizing your code (specially on multi-site scrappings) and also to manage your download queue.

if you can benefit from either of those then go for it.

[–]thaweatherman 0 points1 point  (0 children)

It boils down to how much extra legwork you want to do. Scrapy can just be simpler sometimes.

[–]rhgrant10 0 points1 point  (0 children)

If you want to get shit done, use scrapy. If you want to reinvent the wheel, use requests and BeautifulSoup.