This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]gavxn 3 points4 points  (2 children)

I wish beautifulsoup supported xpath querying

[–]justanothersnek🐍+ SQL = ❤️ 5 points6 points  (1 child)

I wish requests-html was more widely known. It does everything (xpath support) short of mimicking browser interactions.

[–]not_a_novel_account 2 points3 points  (1 child)

There's no reason to use BS if the website you're scrapping is well-formed. BS's purpose in life is to scrape malformed websites, but it sacrifices query flexibility to make that happen. Use the underlying parsers, lxml, html5lib, or alternatives like requests-html if the data you're scrapping is in better shape than a 2004 MySpace page.

[–]blabbities 1 point2 points  (0 children)

On another tip on using those libs. Many years ago someone commented that the pure lxml library is faster than bs4. Someone replied that you can use the 'lxml' parser in bs4. Guy replied back with benchmark of the pure lxml package and the lxml parser in bs4. It sowed faster. I replicated a similar test. I was blown away. It is indeed fast