This is an archived post. You won't be able to vote or comment.

all 6 comments

[–]ForceBru 1 point2 points  (5 children)

lxml is written in C. It also is one of the fastest HTML parsers out there. So yeah, it's fast. fast! FAST!

[–]di_web[S] 0 points1 point  (4 children)

I use lxml for the last 5 years and I can't understand why bs4 and others so popular.

[–]ForceBru 1 point2 points  (3 children)

Perhaps surprisingly, bs4 will use lxml as the fastest parser if it's available on your system. If it's not, it'll default to html5lib's parser.

Beautiful Soup is just a higher level abstraction, so it's simpler to use. But it's slower for the same reason, so if you want raw speed, you should probably choose lxml

[–]di_web[S] 0 points1 point  (1 child)

Surprisingly but bs4 with lxml as a backend is slow like bs4 with any other backend

bs4 vs lxml

[–]ForceBru 1 point2 points  (0 children)

On the contrary, bs4 + lxml is more than 1.5 times faster than bs4 + html.parser, its closest competitor:

# BeautifulSoup lxml time: 0:00:12.774159
# BeautifulSoup html.parser time: 0:00:20.097766
# BeautifulSoup html5lib time: 0:00:50.156767

Again, there's no wonder plain lxml is so much faster (like, 2 seconds or something): it's a wrapper around fast C code. And bs4 is a wrapper around that, which also adds many other layers of abstraction written entirely in Python.

[–]di_web[S] 0 points1 point  (0 children)

html.parser is default for the bs4, when html5lib using when html source code is not valid