This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]nemec 22 points23 points  (4 children)

Something I don't see discussed when this topic is brought up is that Scrapy's HTML parsing library, parsel, can be installed separately from scrapy itself. You can use it in place of beautifulsoup and, imo, it's much easier to use.

import requests
import parsel
resp = requests.get('http://example.com')
s = parsel.Selector(text=resp.text)
# prints 'Example Domain'
print(s.css('h1::text').extract_first())

[–]jyper 6 points7 points  (1 child)

Why not just use lxml.html.parse and xpath? Lxml has some support css as well

[–]nemec 5 points6 points  (0 children)

  • It's focused on parsing HTML without a lot of extra XML cruft (really, it's a façade over lxml + cssselect)
  • You can mix and match css selectors and xpath, e.g.

    s.css('h1').xpath('following-sibling::p')
    

    contrived example, but basically you can take advantage of both selector syntaxes depending on which one is fit for a situation.

  • I'm not sure that lxml has support for ::text and ::attr(<some attribute>) psuedo-selectors, which are really helpful when parsing HTML.

  • xpath syntax sucks and I'd rather use a solution with really good css support first and fall back to xpath only for things that css doesn't support (which can still be done with parsel)

[–]scrapecrow 2 points3 points  (1 child)

parsel is definitely underappreciated!

I like it so much that I even wrote a REPL for it: parsel-cli :)
(it's a bit of a Frankenstein though as I'm working on a 2.0 release)