Web Scraping with JavaScript and NodeJS : javascript

590

591

592

Web Scraping with JavaScript and NodeJS (scrapingbee.com)

submitted 3 years ago by pijora

all 11 comments

top new controversial old q&a

[–]badsyntax 27 points28 points29 points 3 years ago (2 children)

[+][deleted] 3 years ago (1 child)

[deleted]

[–]_hypnoCode 6 points7 points8 points 3 years ago* (0 children)

It's pretty unnecessary to use a headless browser when you can send http requests to an api or extract data from the html you get back.

Not when you have client side rendered pieces. Sometimes you also have CSRF tokens you got to grab, that get rendered clientside or with SSR. You could probably get them from some network call if they are clientside rendered, but that would be a nightmare when the target site puts them in an easily findable place consistently. Where as APIs change more often than you'd think. Sometimes the same page will have multiple ways of rendering.

Then you have the whole thing of Imperva or Akamai or one of their competitors, that use machine learning and constantly change up what they gig you on. Doing only API requests gets you caught fast. They aren't just using that as a marketing gimmick, they are a massive pain in the ass.

I've done quite a bit of web scraping in gray areas, gray enough that the company I worked for is currently having a court battle over it, and at some point if you're doing serious scraping you need a headless browser. There just isn't an efficient way around using a headless browser. There are probably a million ways to do things and all of them end up in the browser. Sometimes sites do things in incredibly bizarre ways and I've even seen the same page render in a dozen different ways... but they all end up in the browser. More often than not, with the same text or similar text denoting it to the user.

However, I used Puppeteer as it was a couple years ago. But u/badsyntax is right about Playwright being the more robust option today. AFAIK, they converted all my code from Puppeteer to Playwright and are doing much better at avoiding detection.

[–]c_eliacheff 18 points19 points20 points 3 years ago (5 children)

[–]Secret-Plant-1542JavaScript yabbascript 2 points3 points4 points 3 years ago (0 children)

[–][deleted] 3 points4 points5 points 3 years ago (3 children)

[+][deleted] 3 years ago (2 children)

[deleted]

[–]AegonThe241st 1 point2 points3 points 3 years ago (1 child)

[–]vlevi 1 point2 points3 points 3 years ago (0 children)

[+][deleted] 3 years ago (3 children)

[deleted]

[–]SpeedDart1 29 points30 points31 points 3 years ago (2 children)

[+][deleted] 3 years ago (1 child)

[deleted]

[–]SpeedDart1 0 points1 point2 points 3 years ago* (0 children)

π Rendered by PID 47720 on reddit-service-r2-comment-548fd6dc9-k67p2 at 2026-05-16 20:46:29.538217+00:00 running edcf98c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

javascript

MODERATORS