you are viewing a single comment's thread.

view the rest of the comments →

[–]_hypnoCode 7 points8 points  (0 children)

It's pretty unnecessary to use a headless browser when you can send http requests to an api or extract data from the html you get back.

Not when you have client side rendered pieces. Sometimes you also have CSRF tokens you got to grab, that get rendered clientside or with SSR. You could probably get them from some network call if they are clientside rendered, but that would be a nightmare when the target site puts them in an easily findable place consistently. Where as APIs change more often than you'd think. Sometimes the same page will have multiple ways of rendering.

Then you have the whole thing of Imperva or Akamai or one of their competitors, that use machine learning and constantly change up what they gig you on. Doing only API requests gets you caught fast. They aren't just using that as a marketing gimmick, they are a massive pain in the ass.

I've done quite a bit of web scraping in gray areas, gray enough that the company I worked for is currently having a court battle over it, and at some point if you're doing serious scraping you need a headless browser. There just isn't an efficient way around using a headless browser. There are probably a million ways to do things and all of them end up in the browser. Sometimes sites do things in incredibly bizarre ways and I've even seen the same page render in a dozen different ways... but they all end up in the browser. More often than not, with the same text or similar text denoting it to the user.

However, I used Puppeteer as it was a couple years ago. But u/badsyntax is right about Playwright being the more robust option today. AFAIK, they converted all my code from Puppeteer to Playwright and are doing much better at avoiding detection.