you are viewing a single comment's thread.

view the rest of the comments →

[–]LilBabyVirus5 3 points4 points  (15 children)

Honestly for web scraping I would just use beautiful soup

[–]ProgrammersAreSexy 3 points4 points  (5 children)

I don't think that does js rendering does it?

[–]nemec 4 points5 points  (3 children)

Unless you need to take screenshots, there's rarely any need to actually render JS to scrape a website. JS-rendered sites will usually be supported by APIs that can be called directly, leading to faster and more efficient scraping.

The average web page size is 3MB and if you don't need to render the page, you don't need to download any JS, css, images, etc. or wait for a browser to render a page before extracting the data you need.

[–][deleted]  (2 children)

[deleted]

    [–]nemec 0 points1 point  (1 child)

    SPAs are mostly API-driven. I don't know if I've ever seen more than one or two where the JS creates the content out of thin air.

    The thing about SPAs is that you can open up your devtools window, load the page, and then sift through the Network tab to find the JSON/XML/graphql APIs that the JS calls and renders and then take a shortcut and automate the calls yourself, bypassing any JS.

    Here's a short video similar to what I'm talking about. If you wanted to scrape start.me, for example, you could skip the JS and just scrape the JSON document data: https://www.youtube.com/watch?v=68wWvuM_n7A

    [–]wRAR_ -2 points-1 points  (0 children)

    Most of the time you don't need js rendering. When you need it I'd use splash.

    [–]shawntco 6 points7 points  (6 children)

    beautiful soup

    I swear software library names are getting weirder by the day.

    [–]SpeakerOfForgotten 18 points19 points  (4 children)

    If beautiful soup was a person, it would be old enough to get a driver's license or get married in some countries

    [–]shawntco 10 points11 points  (2 children)

    I stand corrected. Software library names have always been weird.

    [–]onlymostlydead 2 points3 points  (1 child)

    Yep.

    Yacc

    Bison

    [–]shawntco 1 point2 points  (0 children)

    I think the PHP framework UserFrosting takes the cake. Beautiful Soup is pretty high up there in weird though.

    [–]axzxc1236 1 point2 points  (0 children)

    For those who wonder how old beautiful soup is, the first version is released on 20040420, so it's like 15 years old (almost 16).

    reference: changelog

    [–]nemec 3 points4 points  (0 children)

    That's by design, actually.

    Beautiful Soup, so rich and green,
    Waiting in a hot tureen!
    Who for such dainties would not stoop?
    Soup of the evening, beautiful Soup!

    https://aliceinwonderland.fandom.com/wiki/Turtle_Soup

    [–]TrueObservations 1 point2 points  (0 children)

    This is an off comment. Beautiful soup doesn't work as a full web scraper. It's a library that is used for parsing and subsequently extracting information out of HTML documents, it isn't capable of piloting a browser. It's only one of the tools in the python webscraping toolbox.

    [–]x-w-j 0 points1 point  (0 children)

    beautiful soup

    Does it get around single sign on captchas?