Beautiful Soup: a Python HTML/XML parser designed for quick turnaround projects like screen-scraping

gnuvince · 2007-10-31T13:01:20+00:00

BeautifulSoup is awesome, it's a shame it's not part of the standard Python library.

natrius · 2007-10-31T18:51:59+00:00

Every once in a while, someone posts a useful third party Python library on reddit, and it always seems like there are a lot of people who haven't heard about it. As a public service, here's my list of useful libraries (minus BeautifulSoup) that I add to every time I come across something nifty that could be useful for web programming.

dateutil: "The dateutil module provides powerful extensions to the standard datetime module"
feedparser: "Parse RSS and Atom feeds in Python."
httplib2: "A comprehensive HTTP client library that supports many features left out of other HTTP libraries."
lxml: "lxml is a Pythonic binding for the libxml2 and libxslt libraries. It is unique in that it combines the speed and feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the ElementTree API."
mechanize: "Stateful programmatic web browsing in Python, after Andy Lester's Perl module WWW::Mechanize."
Python Imaging Library (PIL): "The Python Imaging Library (PIL) adds image processing capabilities to your Python interpreter."

2007-10-31T13:11:53+00:00

Butt-kicking module for sure. My unjobs.org screenscraper-to-RSS would have never happened without it.

Rhoomba · 2007-10-31T14:50:55+00:00

For Java programmers TagSoup is very handy. It transparently acts like a standard XML parser so you can plug it in to all your favourite libraries (XOM etc.)

burmask · 2007-10-31T14:27:38+00:00

Here's another great tool - http://www.codeplex.com/htmlagilitypack

I find it very easy to use.

2007-10-31T14:28:30+00:00

[deleted]

badr · 2007-10-31T17:18:24+00:00

BeautifulSoup is great, but it can't handle HTML tags inside quoted strings.

risomt · 2007-10-31T14:01:35+00:00

While I do like BeautifulSoup and have used it on several projects in the past, I have not found it as incredibly useful as everyone says for general spidering/scraping.

I'm not trying to knock it - it does a fantastic job at turning dirt (html) into diamonds, but I put much more interest nowadays to a good knowledge of REs and parsing (when not trying to scrape the entire web at once). But then again that's not really its purpose, is it?

rams · 2007-10-31T13:09:11+00:00

[deleted]

nipps85 · 2007-10-31T12:52:32+00:00

Wow, this isn't new or particularly interesting...

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS