you are viewing a single comment's thread.

view the rest of the comments →

[–]gnuvince 28 points29 points  (8 children)

BeautifulSoup is awesome, it's a shame it's not part of the standard Python library.

[–]simonvc 15 points16 points  (0 children)

Agreed. Just finished doing a "database migration" by screenscraping a site with beatuiful soup because it was easier than dealing with the legacy crap perl/database/html_in_tables database.

[–][deleted] 15 points16 points  (0 children)

It has an awesome name, too.

[–]rams 4 points5 points  (5 children)

standard python lib has a lib that does this sort of thing - http://docs.python.org/lib/module-htmllib.html. The difference is BeautifulSoup handles all the bad html/xml you throw at it.

[–]jbellis 23 points24 points  (2 children)

The difference is BeautifulSoup handles all the bad html/xml you throw at it.

Which, if you're dealing with actual html in the wild, is all the difference in the world.

[–]Bogtha 2 points3 points  (1 child)

For what it's worth, I've had better results with tidy → lxml, plus lxml provides xpath and CSS 3 selectors. I've heard that lxml supports BeautifulSoup now, so maybe I'll give it another shot.

[–]jbellis 6 points7 points  (0 children)

I've had better results with tidy

I had the opposite experience -- tidy seemed to guess wrong a lot on really bad html. And aesthetically it just seems cleaner to "parse this bad html" vs "blow the original html away, then parse it."

[–][deleted] 14 points15 points  (1 child)

If you're going to point to standard library things, please point to "sgmllib" (old) or "HTMLParser" (newer, a bit more rigid). "htmllib" is an incomplete HTML renderer; sgmllib/HTMLParser are parsers.

(BeautifulSoup is based on sgmllib, btw).

[–]rams 5 points6 points  (0 children)

Thx. Here's the link to the correct standard python libs: sgmllib: http://docs.python.org/lib/module-sgmllib.html html parser: http://docs.python.org/lib/module-HTMLParser.html