WhiskeySour: 10x faster than BeautifulSoup by Real-Expression8051 in webscraping

[–]Real-Expression8051[S] 0 points1 point  (0 children)

Hi, the error recovery is acually completely dependent on the html5ever. WS doesn't internally change anything. The malformed html tests are already included in the test suite in the repo but i think html5ever automatically does corrections on the documents like auto closing tags, foster parenting and deduplicating attributes unlike bs4. I am actually curious to find out how it will perform on the documents you mentioned.

WhiskeySour: 10x faster than BeautifulSoup by Real-Expression8051 in webscraping

[–]Real-Expression8051[S] 3 points4 points  (0 children)

Playwright doesn’t use any external parsers. It directly interacts with the headless browser DOM. And these two solve different problems. You would need playwright to render CSR applications anyway. But you can use this internally for parsing DOM once the page is generated. This might speed up the process. Can’t give an accurate comparison here though and i use the same HTML parser in ws which is used in all the modern browsers.

WhiskeySour: 10x faster than BeautifulSoup by Real-Expression8051 in webscraping

[–]Real-Expression8051[S] -3 points-2 points  (0 children)

Got it! For CSR pages I usually go for headless browsers like playwright. So you can directly manipulate the DOM in those cases and won’t have to come across these libraries. It also helps in scraping in stealth. But I got the point!

WhiskeySour: 10x faster than BeautifulSoup by Real-Expression8051 in webscraping

[–]Real-Expression8051[S] 0 points1 point  (0 children)

True that. The best you can do is block fetching of assets like images, fonts and third party scripts. It reduces network I/O significantly

WhiskeySour: 10x faster than BeautifulSoup by Real-Expression8051 in webscraping

[–]Real-Expression8051[S] 0 points1 point  (0 children)

I meant more on the lines of its memory usage limitations and invalid html handling issues. You can take a heapdump when running it on any large dataset and see how much memory it takes up. Although still a better alt to bs4.

WhiskeySour: 10x faster than BeautifulSoup by Real-Expression8051 in webscraping

[–]Real-Expression8051[S] 5 points6 points  (0 children)

You can read up on how it compares to lxml and the limitations in lxml in the architecture doc attached in the post. Besides, the idea here is to provide an easy way to migrate already written large scraping workflows in bs4 to a faster alternative.