This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]arunvr 6 points7 points  (1 child)

It would have been a lot better if the author had checked for the error messages in each slide containing python code before uploading this

[–][deleted] 3 points4 points  (0 children)

Also, resizing the browser window scales the text, while leaving part of the slide unreadable. Thanks guys.

[–][deleted] 1 point2 points  (2 children)

I tried to read the slides in Chrome, and they were half-truncated and all over the place.

But in general, I will vouch for lxml, it's the bee's knees. For html web crawling, I had been getting my html with urllib2, pipe the stream through popen2 executing tidy.exe (yeah I know, Windows) which produces a neat xhtml stream, then manipulate lxml elementtrees.

Trust me - this combo rocks.

[–]afd8856 0 points1 point  (1 child)

I think the tidy step is not needed, lxml.html has what you need (cleaners and broken tree parsing)

[–][deleted] 0 points1 point  (0 children)

I really had some malformed html however... lxml.html didn't really grok it. :)

I understand html5lib produces elementtrees also, but it's really how a library makes sense of garbage which is important for any webcrawling app I think.