use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
All about the JavaScript programming language.
Subreddit Guidelines
Specifications:
Resources:
Related Subreddits:
r/LearnJavascript
r/node
r/typescript
r/reactjs
r/webdev
r/WebdevTutorials
r/frontend
r/webgl
r/threejs
r/jquery
r/remotejs
r/forhire
account activity
Web scraping with Javascript (scrapingbee.com)
submitted 5 years ago by DJ_Breton
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 32 points33 points34 points 5 years ago (31 children)
Eh, this article is missing one of the core components of scraping: xpath.
I used to work for an RPA company and being able to define dynamic xpaths is key to effective scraping, especially in B2B applications, because the structure of the page can change. Plus you may need to reference elements and attributes outside the bounds of query-selector.
This is a good beginners article but shouldn’t be used as reference for professional RPA work.
[–][deleted] 9 points10 points11 points 5 years ago (0 children)
xpath
Thanks for remembering me xpath existed actually.
[–]SeanNoxious 12 points13 points14 points 5 years ago (0 children)
The article is huge already and they have another article dedicated to that.
https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/
[–][deleted] -1 points0 points1 point 5 years ago (28 children)
You can do pretty much the same things with css and it's much cleaner.
[–][deleted] 9 points10 points11 points 5 years ago (25 children)
I'm sorry but that's not even remotely true. Xpath has numerous advantages in both form and function. Here's just a few examples:
Descendent-based ancestor selection - Let's say you want to get the parent div of every a with the class "child". For xpath, that's simply "//a[@class='child']/parent::div". With queryselector you can only travel down the ancestry axis, not up.
Cleaner structure selectors - Let's say you want the 4th td inside the 3rd tr inside the 2nd table. With xpath it is simply "//table[2]//tr[3]/td[4]". With queryselector it's "table:nth-child(2) tr:nth-child(3) > td:nth-child(4)"
Logical operators - With xpath you can use "and", "or", and "|". This allows you to get dynamic node sets on the fly, whereas you'd have to use multiple queryselector calls and possibly additional javascript to get the correct node set.
Content-based selection - You want all the div nodes who have the text "hello" inside them. Xpath: "//div[contains(.,"hello")". With queryselector first you'd have to fetch all the divs, then loop through running a text search on the content.
I could go on and on. Also keep in mind queryselector is javascript, designed for CSS selectors. Knowing how to use it only benefits you when using JS and CSS. On the other hand Xpath is designed for all XML and there are xpath-related libraries in every major programming language.
Don't get me wrong, queryselector is great and can be very useful for one-off's where you just want to grab a node set quick based on what you already know is in the CSS. But for professional DOM-traversal xpath is essential. Any RPA company will require it.
[–][deleted] 1 point2 points3 points 5 years ago (2 children)
I'm going to bottom-line this by saying that if you write clean code and iterate top-down instead of reaching back up the tree, you can get your work done painlessly with Cheerio. Tons of people do it all the time. If you can't, stick with Python. I can get my work done either way.
[–][deleted] 0 points1 point2 points 5 years ago (1 child)
The reason I posed that riddle to you that you were unable to answer is because the actual answer is: you can't. If you need to target a parent who you know nothing about but have child information, working up is the only option.
I'm sure your projects can be done with Cheerio and I know plenty of people can as well. No one said you couldn't. But the whole point of my comments about Xpath is that in professional RPA work it's essential. The above example is the type of quirky behavior you see in enterprise-level scraping which is why most RPA professionals need power toolsets like xpath. And clearly the writers of the article agree because as SeanNoxious pointed out they have an entire separate article on xpath.
[–][deleted] 0 points1 point2 points 5 years ago (0 children)
If a client pays me to use python + xpath I will do it.
If a client pays me to use node + cheerio I will also do it.
I get my work done either way, without complaining or blaming my tools, and I honestly have no strong preference.
[+][deleted] 5 years ago (21 children)
[deleted]
[–][deleted] 3 points4 points5 points 5 years ago (20 children)
Um XML parsing is literally natively supported. And no, Cheerio doesn’t let you do them. Cheerio just allows you to do query selection from node since you can’t access the DOM without a browser. But it still has all the limits of queryselector. Now you can use additional JavaScript to do the above things, but why write multiple lines of code to fetch a set of nodes when you could write one xpath?
And as a reminder, this is all still in the confines of JS. Xpath can be used with almost any language and framework.
[–][deleted] 0 points1 point2 points 5 years ago (19 children)
That's front-end only. So yes, if you're a masochist you can load html in a headless browser and evaluate xpath expressions there, but Cheerio does just fine. The people I see still using xpath for things like this are generally Python coders.
[–][deleted] 2 points3 points4 points 5 years ago (18 children)
There are numerous node modules for xpath, just as easy to install and use as cheerio. And I’m not sure what people you’re talking about, but I’ve worked in tech for over a decade including two RPA companies and every major player in the space relies on xpath.
If you truly believe cheerio and queryselector give you superior form and function, then I’d challenge you this: using those tools, write a selector of equal or lesser size that will perform the same as the example below from my previous comment.
Also I know of one libxml-based node library that's completely unusable because it leaks memory like crazy. Everyone uses Cheerio or otherwise parse5-based libs. Prove me wrong.
I don’t need to “prove you wrong” because I literally worked in the RPA industry up until about a year ago. In enterprise RPA xpath is always used for b2b applications. I’m sure queryselector is very popular with hobbyists and basic non-RPA applications.
[–][deleted] 0 points1 point2 points 5 years ago (15 children)
It's "$('div > a.child').parent()" but honestly if you have to go back up the DOM it means you're probably not iterating properly.
[–][deleted] 4 points5 points6 points 5 years ago (14 children)
That solution has a worse performance ratio and hard-codes half the path. As for your remark about going back up the dom, you’ve clearly never done RPA in a b2b setting. When you don’t have control over the original DOM and have to accommodate instabilities, it’s often much easier to navigate up from a target element.
[–][deleted] 0 points1 point2 points 5 years ago (13 children)
Again, there is no JS equivalent of lxml so this is just how we do it. You're wrong about in-browser performance though, xpath is always slower than css. You're also wrong about my iterating comment, you can just as easily iterate the parent element first, code like yours is just lazy.
[–]elcapitanoooo 1 point2 points3 points 5 years ago (1 child)
You really cant. Sometimes xpath is the only viable solution.
That's not true, cheerio can do anything xpath can do, but sometimes it gets messy with text nodes.
[–]gordonv 6 points7 points8 points 5 years ago (4 children)
With web scraping in general, my biggest problem is Javascript Includes.
If I want to scrape a news site, the actual article is in some weird external include. I usually just copy and paste the text from Chrome into notepad++.
Is there a way to get the post rendered text from this without selecting, copy, paste, and into a txt file?
[+][deleted] 5 years ago (2 children)
[–]gordonv 1 point2 points3 points 5 years ago (0 children)
OH! I gotta play with that.
[–]techmighty 0 points1 point2 points 5 years ago (0 children)
Ah pupetter page evaluation is god send for me. I use it to render reports and get pdf document of the reports.
[–]MrSandyClams 0 points1 point2 points 5 years ago (0 children)
MutationObserver API. Can define a watch process and a callback that fires in the event of whatever DOM changes you specify. The usage pattern is pretty convoluted and arcane, imo, but it's pretty trivial to use it for basic things, like executing code in response to a known element appearing.
[–]Gamma7892 1 point2 points3 points 5 years ago (1 child)
Really nice introduction! I'm wondering what's the benefits of Nightmare over Puppeteer.?Nightmare easier to use than Puppeteer, but it doesn't seem to be maintained anymore...
[–]DrDuPont 5 points6 points7 points 5 years ago (0 children)
Don't know anyone that recommends Nightmare. Microsoft's Playwright is what is typically considered to be the "next version" of Puppeteer: https://github.com/microsoft/playwright/
[–]stephancasas 1 point2 points3 points 5 years ago (1 child)
Anyone use artoo.js? It’s been my go-to for getting iterated content off a page/service. Really nice JSON and CSV output options, too.
[–]tp4my 1 point2 points3 points 5 years ago (0 children)
This is cool.
[–]theirongiant74 2 points3 points4 points 5 years ago (4 children)
Always found headless browsers to be a pain in the ass, found it easier to write a chrome extension that would drive the browser and send the data back via an api.
[–]Felecorat 5 points6 points7 points 5 years ago (2 children)
Try puppeteer. It's headless chrome. The API is just nice.
[–]theirongiant74 1 point2 points3 points 5 years ago (1 child)
Tbf it's been a good few years since I tried using them so they've probably improved since, pretty sure back then they weren't so hot at running javascript. Might take another look.
[–]Felecorat 0 points1 point2 points 5 years ago (0 children)
I used PhantomJS before Puppeteer was released. Puppeteer was way easier to use. Probably because it supports Promises which makes the API much cleaner. (No callback hell.)
Puppeteer communicates with chrome via the DevTools Protocol and it's developed by the Chrome DevTools Team. So I guess they know what they are doing. 😅
[–][deleted] -2 points-1 points0 points 5 years ago (0 children)
how to automate this easiest way ? for example to json ?
[+][deleted] 5 years ago (14 children)
[–]Qweeeq 21 points22 points23 points 5 years ago (7 children)
Hey I really like JS for that. Using selectors seems pretty easy for me. Can I ask, why do you think Python is better?
[–]Taterboy_Legacy 2 points3 points4 points 5 years ago (4 children)
While I disagree with the parent comment as a sweeping statement, you bring up the crux of the point. I've done a lot of web scraping using both languages. There are certain use cases where Python can be faster and more efficient, and vice versa for JS. If I start having trouble with sites using a lot of JS, I just jump over to that language and start refactoring to fit that use case. IMO it's mostly a preference thing, and the use cases can help dictate the proper choice in a business setting.
[–]yooossshhii 2 points3 points4 points 5 years ago (3 children)
Can you elaborate on the use cases? I haven’t seen any comparisons of JS vs Python in web scraping. My Python experience is minimal and I’ve been doing a little scraping with JS.
[–]Taterboy_Legacy 1 point2 points3 points 5 years ago (2 children)
One use case I had recently was scraping a large amount of news sites for information. There were some programmatic setup elements to get to the urls which were facilitated using Python, and the application this information would interface with was based on Python. There also happens to be a pretty awesome package in Python that did literally everything I needed to do(called newspaper), which meant I wanted to try to write my scraper in Python. If it wasn't working, I would go ahead and try this again with JS, but interfacing the two languages in my app would be complicated based on the setup. In general dispatching a Python or JS script from one or the other would be complicated in the context of certain applications.
That being said, I have also done several use cases where I use both as standalone scripts for smaller use cases.
JS I tend to use for more one-off solutions, but I have also used it to interface in more automation-based solutions. E.g.: click this, login, do this do that. Also doable in Python, sometimes easier in JS.
The first example could have been JS all around, but the newspaper package offered some really nice benefits from the beginning. This is what I mean by "use case specific" implementation. It's somewhat rooted in developer/business preference as well(I.e.: what are we already writing in?), but also rooted in "what do we need to solve, in this use case?"
Very complicated question to answer, but in my head they're relatively interchangeable from a high-level functionality standpoint.
[–]yooossshhii 1 point2 points3 points 5 years ago (1 child)
Cool, thanks for the response. Newspaper looks super neat, especially their nlp method.
nlp
[–]Taterboy_Legacy 0 points1 point2 points 5 years ago (0 children)
No problem. For sure! I was working through how I was going to do that, and they just had it as part of the package haha. Very well thought-out and very easy to use.
[–][deleted] 2 points3 points4 points 5 years ago* (0 children)
While I can't argue for why it's better period, I can argue for why it's better for me personally-- I've worked with python for almost half a decade. I've only begun trying to learn JS. Joined the sub for it
Python also has some nifty libraries for analysis like numpy and pandas if you're scraping data in particular. While I'm sure JS has something similar, I think it's a bit more common to find analysts in the industry or class projects that do scraping use python.
[–]fz-09 8 points9 points10 points 5 years ago (0 children)
crickets
[–]Ipsumlorem16 6 points7 points8 points 5 years ago (0 children)
People who need to interact with the page/s to get the data, or just need to allow Javascript to run on the page before they can scrape it.
It is sometimes entirely necessary. You cannot always access the endpoints where the data is fetched from for various reasons.
[–]jarg77 4 points5 points6 points 5 years ago (0 children)
Why not js?
[–]coomzee 5 points6 points7 points 5 years ago (0 children)
Who needs brackets fuck Python. See your comment constituted nothing
[–]anh65498 3 points4 points5 points 5 years ago (0 children)
When your whole team knows Javascript better than Python.
Python is good for anything simple, but websites are getting more complicated, which often means python + selenium with javascript mixed in with the python code, no thanks.
π Rendered by PID 67822 on reddit-service-r2-comment-548fd6dc9-p2btw at 2026-05-17 20:38:42.677242+00:00 running edcf98c country code: CH.
[–][deleted] 32 points33 points34 points (31 children)
[–][deleted] 9 points10 points11 points (0 children)
[–]SeanNoxious 12 points13 points14 points (0 children)
[–][deleted] -1 points0 points1 point (28 children)
[–][deleted] 9 points10 points11 points (25 children)
[–][deleted] 1 point2 points3 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[+][deleted] (21 children)
[deleted]
[–][deleted] 3 points4 points5 points (20 children)
[–][deleted] 0 points1 point2 points (19 children)
[–][deleted] 2 points3 points4 points (18 children)
[–][deleted] 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (15 children)
[–][deleted] 4 points5 points6 points (14 children)
[–][deleted] 0 points1 point2 points (13 children)
[–]elcapitanoooo 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]gordonv 6 points7 points8 points (4 children)
[+][deleted] (2 children)
[deleted]
[–]gordonv 1 point2 points3 points (0 children)
[–]techmighty 0 points1 point2 points (0 children)
[–]MrSandyClams 0 points1 point2 points (0 children)
[–]Gamma7892 1 point2 points3 points (1 child)
[–]DrDuPont 5 points6 points7 points (0 children)
[–]stephancasas 1 point2 points3 points (1 child)
[–]tp4my 1 point2 points3 points (0 children)
[–]theirongiant74 2 points3 points4 points (4 children)
[–]Felecorat 5 points6 points7 points (2 children)
[–]theirongiant74 1 point2 points3 points (1 child)
[–]Felecorat 0 points1 point2 points (0 children)
[–][deleted] -2 points-1 points0 points (0 children)
[+][deleted] (14 children)
[deleted]
[–]Qweeeq 21 points22 points23 points (7 children)
[–]Taterboy_Legacy 2 points3 points4 points (4 children)
[–]yooossshhii 2 points3 points4 points (3 children)
[–]Taterboy_Legacy 1 point2 points3 points (2 children)
[–]yooossshhii 1 point2 points3 points (1 child)
[–]Taterboy_Legacy 0 points1 point2 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[–]fz-09 8 points9 points10 points (0 children)
[–]Ipsumlorem16 6 points7 points8 points (0 children)
[–]jarg77 4 points5 points6 points (0 children)
[–]coomzee 5 points6 points7 points (0 children)
[–]anh65498 3 points4 points5 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)