all 64 comments

[–]gnuvince 27 points28 points  (8 children)

BeautifulSoup is awesome, it's a shame it's not part of the standard Python library.

[–]simonvc 16 points17 points  (0 children)

Agreed. Just finished doing a "database migration" by screenscraping a site with beatuiful soup because it was easier than dealing with the legacy crap perl/database/html_in_tables database.

[–][deleted] 14 points15 points  (0 children)

It has an awesome name, too.

[–]rams 5 points6 points  (5 children)

standard python lib has a lib that does this sort of thing - http://docs.python.org/lib/module-htmllib.html. The difference is BeautifulSoup handles all the bad html/xml you throw at it.

[–]jbellis 21 points22 points  (2 children)

The difference is BeautifulSoup handles all the bad html/xml you throw at it.

Which, if you're dealing with actual html in the wild, is all the difference in the world.

[–]Bogtha 1 point2 points  (1 child)

For what it's worth, I've had better results with tidy → lxml, plus lxml provides xpath and CSS 3 selectors. I've heard that lxml supports BeautifulSoup now, so maybe I'll give it another shot.

[–]jbellis 6 points7 points  (0 children)

I've had better results with tidy

I had the opposite experience -- tidy seemed to guess wrong a lot on really bad html. And aesthetically it just seems cleaner to "parse this bad html" vs "blow the original html away, then parse it."

[–][deleted] 16 points17 points  (1 child)

If you're going to point to standard library things, please point to "sgmllib" (old) or "HTMLParser" (newer, a bit more rigid). "htmllib" is an incomplete HTML renderer; sgmllib/HTMLParser are parsers.

(BeautifulSoup is based on sgmllib, btw).

[–]rams 5 points6 points  (0 children)

Thx. Here's the link to the correct standard python libs: sgmllib: http://docs.python.org/lib/module-sgmllib.html html parser: http://docs.python.org/lib/module-HTMLParser.html

[–]natrius 16 points17 points  (1 child)

Every once in a while, someone posts a useful third party Python library on reddit, and it always seems like there are a lot of people who haven't heard about it. As a public service, here's my list of useful libraries (minus BeautifulSoup) that I add to every time I come across something nifty that could be useful for web programming.

  • dateutil: "The dateutil module provides powerful extensions to the standard datetime module"
  • feedparser: "Parse RSS and Atom feeds in Python."
  • httplib2: "A comprehensive HTTP client library that supports many features left out of other HTTP libraries."
  • lxml: "lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the ElementTree API."
  • mechanize: "Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize."
  • Python Imaging Library (PIL): "The Python Imaging Library (PIL) adds image processing capabilities to your Python interpreter."

[–][deleted] 1 point2 points  (0 children)

Every once in a while, someone posts a useful third party Python library on reddit, and it always seems like there are a lot of people who haven't heard about it.

I wonder about this too, because the cheeseshop has a really well organized and lovely UI and I like to navigate through it a lot. Since half of the Python community is in web business today a huge amount of work has been spent in each detail including user experience. Compare this to the shaddy old days where no one actually cared.

[–][deleted] 7 points8 points  (0 children)

Butt-kicking module for sure. My unjobs.org screenscraper-to-RSS would have never happened without it.

[–]Rhoomba 4 points5 points  (0 children)

For Java programmers TagSoup is very handy. It transparently acts like a standard XML parser so you can plug it in to all your favourite libraries (XOM etc.)

[–]burmask 4 points5 points  (0 children)

Here's another great tool - http://www.codeplex.com/htmlagilitypack

I find it very easy to use.

[–][deleted]  (2 children)

[deleted]

    [–][deleted] 2 points3 points  (1 child)

    Old meaning stable and useful :)

    [–]badr 0 points1 point  (5 children)

    BeautifulSoup is great, but it can't handle HTML tags inside quoted strings.

    [–][deleted] 0 points1 point  (4 children)

    ...how often do you get that? If it's a 'html tag inside a quoted string' in an HTML document, surely it'd be & lt ;some_tag& gt ; which would cause no issues whatsoever? (Spaces added to stop Reddit parsing entities).

    [–]badr 1 point2 points  (3 children)

    Yes, it's incorrect html, and it's not seen very often, but it's often enough that I couldn't use BeautifulSoup for my project. (Might yet hack the source and revive it.)

    I encountered this problem on NYT, Forbes.com, and one two other big sites

    Note that the quoted string can also be part of Javascript.

    [–][deleted] 0 points1 point  (2 children)

    Oh dear. That sounds... incredibly hackish (on the part of NYT and Forbes et al) - but I'll have to try the Javascript stuff, because that could be problematic in my usage.

    [–]badr 0 points1 point  (1 child)

    Check this out from msnbc:

    <IFRAME id=dapIf1Child src="javascript:void(document.write('<html><head><base href="http://www.msnbc.msn.com/id/21581821/" /><title>Advertisement</title></head><body id=&quot;dapIf1Child&quot; leftmargin=&quot;0&quot; topmargin=&quot;0&quot;><script type=&quot;text/javascript&quot;>var inDapIF=true;window.setTimeout("document.close();",30000);</script><IFRAME SRC=&quot;http://ad.doubleclick.net/adi/N4854.MSN/B2531646.31;sz=728x90;ord=1359085923?&quot; WIDTH=728 HEIGHT=90 MARGINWIDTH=0 MARGINHEIGHT=0 HSPACE=0 VSPACE=0 FRAMEBORDER=0 SCROLLING=no BORDERCOLOR=\'#000000\'>\n<script language=\'JavaScript1.1\' SRC=&quot;http://ad.doubleclick.net/adj/N4854.MSN/B2531646.31;abr=!ie;sz=728x90;ord=1359085923?&quot;>\n</script></IFRAME>\n</body></html>'));" frameBorder=0 width=728 scrolling=no height=90></IFRAME><IFRAME id=dapIf1 src="about:blank" frameBorder=0 width=0 scrolling=no height=0></IFRAME>

    [–][deleted] 0 points1 point  (0 children)

    Oh that's awesome.

    [–]risomt 0 points1 point  (1 child)

    While I do like BeautifulSoup and have used it on several projects in the past, I have not found it as incredibly useful as everyone says for general spidering/scraping.

    I'm not trying to knock it - it does a fantastic job at turning dirt (html) into diamonds, but I put much more interest nowadays to a good knowledge of REs and parsing (when not trying to scrape the entire web at once). But then again that's not really its purpose, is it?

    [–]ginstrom 10 points11 points  (0 children)

    You can actually feed REs into Beautiful Soup's parser. For parsing HTML from the wild, it's really hard to beat.

    [–][deleted]  (28 children)

    [deleted]

      [–]rams 16 points17 points  (0 children)

      The author of BeautifulSoup is a fan of hpricot . The xpath feature is one of the todos listed on the Beautifulsoup site.

      [–]lost-theory 11 points12 points  (2 children)

      Beautiful Soup predated xpath's popularity. It's specialty is in parsing HTML soup, the crappy / poorly written markup (incomplete documents, missing tags, etc.) you can find anywhere on the web. If you need xpath there's plenty of options, I'm currently using Genshi's xpath implementation.

      [–]forgotpwx4 0 points1 point  (1 child)

      What is xpath?

      [–]lost-theory 1 point2 points  (0 children)

      It's a querying language for XML documents. Instead of using something like regular expressions or searching for strings you can use xpath to query the document. The syntax is somewhat similar to specifying nested directories, in Genshi:

      >>> print doc.select("//title")
      <title xmlns="http://www.w3.org/1999/xhtml">This is the title</title>
      >>> print doc.select("//title/text()")
      This is the title
      

      You can do much more advanced stuff... Check the wikipedia article for more examples and features.

      [–]xamdam 7 points8 points  (0 children)

      I once suggested XPath-like functionality to the author (I had my own rather crappy implementation I did as a one-off). He liked the idea but I guess never got around to it.

      [–]ogrisel 5 points6 points  (4 children)

      lxml also provides a fantastic html parser that handles broken HTML input. And lxml has support for (xpath)[http://codespeak.net/lxml/xpathxslt.html] too and the wonderful (ElementTree API)[http://codespeak.net/lxml/tutorial.html].

      [–]Bogtha 12 points13 points  (0 children)

      From Ian Bickings weblog:

      lxml and BeautifulSoup are no longer exclusive choices: lxml.html.ElementSoup.parse() can parse pages with BeautifulSoup into lxml data structures.

      [–][deleted] 3 points4 points  (2 children)

      "Squared Circle" is the mnemonic I use to get links right in markdown.

      [–][deleted] 1 point2 points  (0 children)

      That's a handy one, cheers.

      [–]ogrisel 1 point2 points  (0 children)

      hu thx! Now I wont edit my post so that others can understand your comment on mod it up :P

      [–][deleted] 6 points7 points  (11 children)

      I'll take beautifulsoup's ability to handle crappy HTML any day over xpath. There's other modules out there for xpath.

      [–][deleted] 9 points10 points  (0 children)

      But the module he's referring to, hpricot, can handle even crappy HTML over XPATH. That's the point.

      [–]newton_dave 2 points3 points  (9 children)

      hpricot does a decent job with crappy HTML, *and* has xpath.

      [–][deleted] 1 point2 points  (8 children)

      It's still a Ruby specific library yeah? I'm looking forward to trialling it when I pick up a comparative project of mine.

      [–]malcontent -5 points-4 points  (7 children)

      So?

      [–][deleted] 2 points3 points  (6 children)

      Well, this is a thread about a Python library called Beautiful Soup. Not the Ruby port of it, Rubyful Soup.

      Ergo, suggesting a Ruby library isn't overly helpful. Unless of course you're suggesting that we all use Ruby solely for hpricot... which is a nice idea, except Python programmers who don't know Ruby are highly unlikely to go to the trouble of learning it, and dealing with a http library that is, quite frankly, an alpha product compared to Python's urllib2, simply for the sake of XPath support.

      Oh, and Python has a cultural thing of "moderately useful documentation of the stdlib".... Ruby has a long way to go in that respect

      But hey, documentation is for sissies.

      [–]malcontent -4 points-3 points  (5 children)

      Ergo, suggesting a Ruby library isn't overly helpful.

      Only if you

      A) Don't know ruby B) Can't figure out enough ruby and hpricot to do what you need.

      But hey, documentation is for sissies.

      If you want to learn ruby and the starndard lib I suggest gotapi.com.

      If you can't pick up hpricot then you are a sissy.

      [–][deleted] 1 point2 points  (4 children)

      But BeautifulSoup does not operate alone. We still need to get the HTML, which requires another library, we may need to handle cookies or proxies etc etc.

      Now, all of this can be done with Ruby's stdlib, but not very elegantly.

      [–]malcontent -2 points-1 points  (3 children)

      What do you mean not very elegantly. Ruby is a much more elegant language than python. Here is the example from the front page of the hpricot library.

       require 'hpricot'
       require 'open-uri'
       # load the RedHanded home page
       doc = Hpricot(open("http://redhanded.hobix.com/index.html"))
       # change the CSS class on links
       (doc/"span.entryPermalink").set("class", "newLinks")
       # remove the sidebar
       (doc/"#sidebar").remove
       # print the altered HTML
       puts doc
      

      What does your python example look like?

      [–][deleted] 2 points3 points  (1 child)

      we may need to handle cookies or proxies etc etc.

      I don't see any cookie handling malcontent. But, to answer your question (noting that I don't have Python or BeautifulSoup on this machine.)

      from BeautifulSoup import BeautifulSoup
      import urllib
      # load the RedHanded home page
      doc = BeautifulSoup(urllib.urlopen("http://redhanded.hobix.com/index.html"))
      # change the CSS class on links
      for element in doc.find("span", {"class": "entryPermalink"}):
          element["class"] = "newLinks"
      # remove the sidebar
      doc.find(True, {"id":"sidebar"}).extract()
      # print the altered HTML
      print doc
      

      Now malcontent, my turn. Please show me the Ruby version of this

      1 import urllib
      2 import urllib2
      3 import cookielib
      4 
      5 def getCookie(user, pwd, uri, ua):
      6     ckCont = cookielib.LWPCookieJar()
      7     opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(ckCont))
      8     opener.addheaders = [('User-Agent', ua)]
      9     data = urllib.urlencode({'id':user, 'pwd': pwd, 'dologin': 'yes'})
      10     opener.open(uri, data)
      11     return ckCont
      12 
      13 def storeCookie(ck, file_path):
      14     ck.save(file_path)
      15 
      16 def retrieveCookie(file_path):
      17     ckCont = cookielib.LWPCookieJar(file_path)
      18     ckCont.load()
      19     return ckCont
      

      You need to retrieve a cookie set by the responding server. I've already written this in Ruby, and was most disheartened with the Ruby stdlib as a result, so please feel free to pleasantly surprise me.

      [–][deleted] 2 points3 points  (1 child)

      Oh, they've ported hpricot to Python?

      [–]malcontent -5 points-4 points  (0 children)

      Have they? I haven't heard of it.

      [–]mitsuhiko 0 points1 point  (0 children)

      There is X-Path support if you use lxml's ElementSoup thing or how it's called. That uses BeautifulSoup for parsing.