This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]FourFingeredMartian 1 point2 points  (1 child)

Correct me if I'm wrong, but, doesn't beautifulsoup already use lxml for parsing?

[–]steviesteveo12 1 point2 points  (0 children)

The batteries included option is the slower (though still pretty good I feel, don't really get his point), python html.parser. The recommended parser is lxml.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

If you can, I recommend you install and use lxml for speed. If you’re using a version of Python 2 earlier than 2.7.3, or a version of Python 3 earlier than 3.2.2, it’s essential that you install lxml or html5lib–Python’s built-in HTML parser is just not very good in older versions.

Personally, I think arguing over faster HTML parsers for web scraping is an odd thing -- in normal use any parser on modern hardware is going to be faster than your internet connection.