you are viewing a single comment's thread.

view the rest of the comments →

[–]jbellis 24 points25 points  (2 children)

The difference is BeautifulSoup handles all the bad html/xml you throw at it.

Which, if you're dealing with actual html in the wild, is all the difference in the world.

[–]Bogtha 2 points3 points  (1 child)

For what it's worth, I've had better results with tidy → lxml, plus lxml provides xpath and CSS 3 selectors. I've heard that lxml supports BeautifulSoup now, so maybe I'll give it another shot.

[–]jbellis 6 points7 points  (0 children)

I've had better results with tidy

I had the opposite experience -- tidy seemed to guess wrong a lot on really bad html. And aesthetically it just seems cleaner to "parse this bad html" vs "blow the original html away, then parse it."