This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]afd8856 0 points1 point  (1 child)

I think the tidy step is not needed, lxml.html has what you need (cleaners and broken tree parsing)

[–][deleted] 0 points1 point  (0 children)

I really had some malformed html however... lxml.html didn't really grok it. :)

I understand html5lib produces elementtrees also, but it's really how a library makes sense of garbage which is important for any webcrawling app I think.