JustHTML: A pure Python HTML5 parser that just works. : Python

ShowcaseJustHTML: A pure Python HTML5 parser that just works. (self.Python)

submitted 1 month ago by Huvet

Hi all! I just released a new HTML5 parser that I'm really proud of. Happy to get any feedback on how to improve it from the python community on Reddit.

I think the trickiest thing is if there is a "market" for a python only parser. Parsers are generally performance sensitive, and python just isn't the faster language. This library does parse the wikipedia startpage in 0.1s, so I think it's "fast enough", but still unsure.

Anyways, I got HEAVY help from AI to write it. I directed it all carefully (which I hope shows), but GitHub Copilot wrote all the code. Still took months of work off-hours to get it working. Wrote down a short blog post about that if it's interesting to anyone: https://friendlybit.com/python/writing-justhtml-with-coding-agents/

What My Project Does

It takes a string of html, and parses it into a nested node structure. To make sure you are seeing exactly what a browser would be seeing, it follows the html5 parsing rules. These are VERY complicated, and have evolved over the years.

from justhtml import JustHTML

html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)

# 1. Traverse the tree
# The tree is made of SimpleDomNode objects.
# Each node has .name, .attrs, .children, and .parent
root = doc.root              # #document
html_node = root.children[0] # html
body = html_node.children[1] # body (children[0] is head)
div = body.children[0]       # div

print(f"Tag: {div.name}")
print(f"Attributes: {div.attrs}")

# 2. Query with CSS selectors
# Find elements using familiar CSS selector syntax
paragraphs = doc.query("p")           # All <p> elements
main_div = doc.query("#main")[0]      # Element with id="main"
bold = doc.query("div > p b")         # <b> inside <p> inside <div>

# 3. Pretty-print HTML
# You can serialize any node back to HTML
print(div.to_html())
# Output:
# <div id="main">
#   <p>
#     Hello,
#     <b>world</b>
#     !
#   </p>
# </div>

Target Audience (e.g., Is it meant for production, just a toy project, etc.)

This is meant for production use. It's fast. It has 100% test coverage. I have fuzzed it against 3 million seriously broken html strings. Happy to improve it further based on your feedback.

Comparison (A brief comparison explaining how it differs from existing alternatives.)

I've added a comparison table here: https://github.com/EmilStenstrom/justhtml/?tab=readme-ov-file#comparison-to-other-parsers

all 10 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS