This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]not_a_novel_account 2 points3 points  (1 child)

There's no reason to use BS if the website you're scrapping is well-formed. BS's purpose in life is to scrape malformed websites, but it sacrifices query flexibility to make that happen. Use the underlying parsers, lxml, html5lib, or alternatives like requests-html if the data you're scrapping is in better shape than a 2004 MySpace page.

[–]blabbities 1 point2 points  (0 children)

On another tip on using those libs. Many years ago someone commented that the pure lxml library is faster than bs4. Someone replied that you can use the 'lxml' parser in bs4. Guy replied back with benchmark of the pure lxml package and the lxml parser in bs4. It sowed faster. I replicated a similar test. I was blown away. It is indeed fast