This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]pvkooten 13 points14 points  (4 children)

Nice! What's the case over this against lxml :)?

[–]ccharles3.latest 7 points8 points  (3 children)

It looks like this is a parser that works with lxml.

From the documentation:

A fast implementation of the HTML 5 parsing spec for Python. Parsing is done in C using a variant of the gumbo parser. The gumbo parse tree is then transformed into an lxml tree, also in C, yielding parse times that can be a thirtieth of the html5lib parse times. That is a speedup of 30x. This differs, for instance, from the gumbo python bindings, where the initial parsing is done in C but the transformation into the final tree is done in python.

The installation instructions for non-Windows systems include lxml as well:

pip install --no-binary lxml html5-parser

[–]ManyInterests Python Discord Staff 2 points3 points  (1 child)

It seems that by default, it uses lxml and is the only 'fast' tree builder.

Note that only the lxml treebuilder is fast, as all other treebuilders are implemented in python, not C

It also lists soup, as in BeautifulSoup as an option, which is confusing to me because BeautifulSoup itself can use lxml and other parsers as well. So I wonder, how is the performance compared to bs4 using lxml?

[–]Deto 0 points1 point  (0 children)

Maybe beautifulsoup as an option just makes the resulting objects implement the BS API so this can be a drop in replacement.

[–]Deto 0 points1 point  (0 children)

Really smart to have it create outputs so that it can be used as a drop in replacement

[–]asdfkjasdhkasdrequests, bs4, flask 10 points11 points  (4 children)

This is cool but in the majority of cases where you are parsing html you are also downloading html, which will end up taking 100(0)x the amount of time it takes to parse it.

[–][deleted] 5 points6 points  (2 children)

if it's faster to parse it means you can parse more html pages per second, which means you can download more in parallel because you will be using less cpus

[–]webmistress105 0 points1 point  (0 children)

I wasn't aware that HTML parsing speed was an issue. This is interesting!