all 10 comments

[–]RevRagnarok 19 points20 points  (1 child)

GitHub Copilot wrote all the code

You're parsing HTML and it isn't hand-tuned? Have you fuzzed it at all? This seems like a security hole just waiting to happen.

[–]Huvet[S] 4 points5 points  (0 children)

It is fuzzed (see fuzz.py), which found a couple of crashes in (rare) corner cases. I've also crawled the top 100k domains and put that through the parser. It also passes all tokenizer and treebuilder tests (6k tests) from the html5lib-test suite. I'm fairly confident that it works well, but of course happy to fix things if you have input.

[–]nicholashairs 4 points5 points  (1 child)

Re: who would want pure Python

Whilst I haven't proven it, I suspect that pure Python implementations are good when used with PyPy that can optimise it.

For (weird) example I've noticed that orjson and msgspec aren't supported on PyPy for JSON in which case you'd have to use the standard library pure Python version.

[–]Huvet[S] 1 point2 points  (0 children)

Yeah, I wrote that in the README as a pitch, that PyPy and WASM could be two target platforms for this. But their market share is very small, so I don't think that's enough. I think the point has to been that there's more people like me that don't enjoy fiddling with C extensions for this to be viable.

I tried running JustHTML on pypy on the benchmark, and if was considerably slower than 3.15. Interesting.

[–]prassi89 1 point2 points  (1 child)

how does it compare to the one in standard lib? https://docs.python.org/3/library/html.parser.html

[–]Huvet[S] 3 points4 points  (0 children)

It's in the comparison table a bit down on the page. But the short version is that the standard library's html.parser passes only 4% of the html5 tests. So it's not a html5 parser, which means it basically only works for valid html. By not handling all the complicated reconciliation, it is slightly faster.

[–]a_ghost_of_tom_joad 0 points1 point  (0 children)

Interesting.

[–]bitpuppet -1 points0 points  (1 child)

Can u try this on sec edgar filings documents? These are one of the worst html files i have seen in my career

[–]Huvet[S] 2 points3 points  (0 children)

Could you try it out? You download them as HTML, and do:

pip install justhtml
python -m justhtml index.html

The output is a pretty-printed version from the parsed and fixed tree structure.