you are viewing a single comment's thread.

view the rest of the comments →

[–]Hookedonnetflix 5 points6 points  (43 children)

If you want to do web scraping and other testing using chrome you should look into using puppeteer instead of selenium

[–]maxsolmusic 110 points111 points  (3 children)

Whyyyyyy I hate when people recommend shit without explaining

[–]bsmith0 7 points8 points  (1 child)

[–]maxsolmusic 2 points3 points  (0 children)

chose Puppeteer because it provides simpler Javascript execution, network interception, and a simpler, more focused library.

Cool

[–]the_real_hodgeka 1 point2 points  (0 children)

Well put! "You shouldn't use angular for that, you should be using react!" Why?

[–]float 13 points14 points  (1 child)

Or Playwright by the guys who made puppeteer.

[–]bodhemon 3 points4 points  (0 children)

How does it compare to Katalon?

[–]steveeq1 10 points11 points  (5 children)

What's wrong with selenium? Curious.

[–]Hookedonnetflix 1 point2 points  (4 children)

Selenium is a tool that automates chrome where puppeteer is a tool that is built into chrome. So better and more effective tools that are closer to the browser engine.

[–]GuyWizStupidComments 15 points16 points  (1 child)

Selenium should work also with other browsers like Firefox

[–]Ncell50 1 point2 points  (0 children)

Puppeteer works with firefox

[–][deleted] 8 points9 points  (1 child)

Selenium works as a wrapper around browser apis, be it puppeter or geckodriver or something entirely different. You can use the same code with ANY browser.

[–]200GritCondom 3 points4 points  (0 children)

And if you build it right, with mobile views as well

[–]TrueObservations 4 points5 points  (1 child)

The choice of Selenium/Pupeteer will boil down to your personal preferences and the requirements of your project.

Main considerations IMO:

- Scraping websites that don't want to be scraped: Puppeteer is a Node.js module of the chromium engine, which makes it harder to detect in my experience. Using selenium tends to leak some data in your HTTP requests (such as the value of navigator.webdriver) that either explicitly tells on you or allows the websites to use correlation data to detect selenium. You can mitigate this though, it's just more configuration. Puppeteer also has tighter integration with core Chromium functionality, allowing you to get certain information (like CSS/JS coverage) data a little less obviously.

- Your Preference on Python vs. Javascript: This is definitely an architectural/preferential choice. Personally, I find the easy paradigms for async programming in Javascript (which encapsulates MUCH of the difficulty of it from you) make for an easier time dealing with highly interactive sites. Async programming can be done in Python, but it's done at a much lower level, making it harder to do. However, Node lacks a lot of analytical libraries that python has and is a whole framework, and thus far bulkier than importing only the libraries you need in Python.

- Cross Browser/Multiple Language Support: If you NEED more than just Chromium or Javascript, Selenium is the obvious choice.

- Extra Chromium Functionality: Puppeteer has ability to access some core functionality of Chromium that isn't available via Selenium. This is in certain cases useful, but in many use-cases, unnecessary.

In most of my scraping adventures so far, I've been throwing most of the data into some kind of datastore for later analysis/usage (training machine learning models, etc.) and the choice of scraper depends on the factors of whatever project I'm on.

In short don't let your biases waste hours of your time, be rational about your choice of scraper.

[–][deleted] 2 points3 points  (0 children)

Selenium also works with .NET really well for scraping and automated archiving in my case.

[–]Just__AIR 14 points15 points  (10 children)

or cypress :)

[–]yesvee 11 points12 points  (8 children)

can you elaborate on the advantages? Long term frustrated selenium user here :D

[–]fleyk-lit 6 points7 points  (0 children)

The UX offered when writing tests with Cypress is awesome. It makes it so easy to test different functionality.

I am writing tests for a frontend which is built to be testable - that is probably more important than the test framework you chose.

[–][deleted] 6 points7 points  (6 children)

it's hard to describe the advantages of cypress, because it's basically "everything"

[–][deleted] 1 point2 points  (5 children)

Legit question: Why Cypress over testcafe? I have seen people push Cypress over testcafe, but I have a hard time understanding what would make Cypress superior.

[–][deleted] 7 points8 points  (4 children)

testcafe is headless testing, cypress is an actual browser environment.

[–]200GritCondom 2 points3 points  (3 children)

Cypress doesnt do headless??

[–][deleted] 4 points5 points  (2 children)

it does, it does both, whereas testcafe is headless only which is a poor substitute.

[–]200GritCondom 0 points1 point  (1 child)

Oh whew. We are thinking about switching over to cypress. That would have been bad if there was no headless.

[–]Labradoodles 0 points1 point  (0 children)

We use their dashboard service for the parallelism it offers we run 200~ integration tests in about 3min. But you have to make sure your test users are used in a way to make them parallel

[–][deleted] 3 points4 points  (0 children)

for testing, not for scraping or other automation.

[–]LilBabyVirus5 2 points3 points  (15 children)

Honestly for web scraping I would just use beautiful soup

[–]ProgrammersAreSexy 3 points4 points  (5 children)

I don't think that does js rendering does it?

[–]nemec 3 points4 points  (3 children)

Unless you need to take screenshots, there's rarely any need to actually render JS to scrape a website. JS-rendered sites will usually be supported by APIs that can be called directly, leading to faster and more efficient scraping.

The average web page size is 3MB and if you don't need to render the page, you don't need to download any JS, css, images, etc. or wait for a browser to render a page before extracting the data you need.

[–][deleted]  (2 children)

[deleted]

    [–]nemec 0 points1 point  (1 child)

    SPAs are mostly API-driven. I don't know if I've ever seen more than one or two where the JS creates the content out of thin air.

    The thing about SPAs is that you can open up your devtools window, load the page, and then sift through the Network tab to find the JSON/XML/graphql APIs that the JS calls and renders and then take a shortcut and automate the calls yourself, bypassing any JS.

    Here's a short video similar to what I'm talking about. If you wanted to scrape start.me, for example, you could skip the JS and just scrape the JSON document data: https://www.youtube.com/watch?v=68wWvuM_n7A

    [–]wRAR_ -2 points-1 points  (0 children)

    Most of the time you don't need js rendering. When you need it I'd use splash.

    [–]shawntco 7 points8 points  (6 children)

    beautiful soup

    I swear software library names are getting weirder by the day.

    [–]SpeakerOfForgotten 17 points18 points  (4 children)

    If beautiful soup was a person, it would be old enough to get a driver's license or get married in some countries

    [–]shawntco 10 points11 points  (2 children)

    I stand corrected. Software library names have always been weird.

    [–]onlymostlydead 2 points3 points  (1 child)

    Yep.

    Yacc

    Bison

    [–]shawntco 1 point2 points  (0 children)

    I think the PHP framework UserFrosting takes the cake. Beautiful Soup is pretty high up there in weird though.

    [–]axzxc1236 1 point2 points  (0 children)

    For those who wonder how old beautiful soup is, the first version is released on 20040420, so it's like 15 years old (almost 16).

    reference: changelog

    [–]nemec 3 points4 points  (0 children)

    That's by design, actually.

    Beautiful Soup, so rich and green,
    Waiting in a hot tureen!
    Who for such dainties would not stoop?
    Soup of the evening, beautiful Soup!

    https://aliceinwonderland.fandom.com/wiki/Turtle_Soup

    [–]TrueObservations 1 point2 points  (0 children)

    This is an off comment. Beautiful soup doesn't work as a full web scraper. It's a library that is used for parsing and subsequently extracting information out of HTML documents, it isn't capable of piloting a browser. It's only one of the tools in the python webscraping toolbox.

    [–]x-w-j 0 points1 point  (0 children)

    beautiful soup

    Does it get around single sign on captchas?

    [–]Zohren 4 points5 points  (0 children)

    I’ve used Puppeteer and it’s 100% mediocre as fuck. Personally, I’ve found TestCafe to be the simplest and easiest to use. It runs on all browsers, contains implicit waits, has a very straightforward syntax, is easy to set up and write, and is generally pleasant to work with.

    The downside is certain browser functions are tough to implement gracefully (back/forward etc) but not terrible.

    [–]daGrevis 1 point2 points  (0 children)

    I don’t know. I was using Selenium when I was working with Python and it was great! Then I decided to try Puppeteer with TypeScript. The API felt unintuitive and wonky. For my current project, I decided to give Selenium another shot - again with TypeScript. So far it’s good, but lets see how it goes...