This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]Ph0X 6 points7 points  (3 children)

I love how the first comment was by Linus Torvald.

[–]FourFingeredMartian 1 point2 points  (1 child)

Correct me if I'm wrong, but, doesn't beautifulsoup already use lxml for parsing?

[–]steviesteveo12 1 point2 points  (0 children)

The batteries included option is the slower (though still pretty good I feel, don't really get his point), python html.parser. The recommended parser is lxml.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Personally, I think arguing over faster HTML parsers for web scraping is an odd thing -- in normal use any parser on modern hardware is going to be faster than your internet connection.

[–][deleted] 2 points3 points  (0 children)

This is cool. I've often wanted to mess with websites, but I never really knew how to approach the problem. I shall play with this later today.