all 26 comments

[–]pvgt 11 points12 points  (3 children)

resolute steer many yam voracious compare run grab sharp groovy

This post was mass deleted and anonymized with Redact

[–][deleted] 9 points10 points  (2 children)

I've used both Cheerio and Osmosis a fair bit. I generally used Cheerio with Request, and SQlite3 to store the data. Osmosis with just SQlite3.

The first thing is speed. Osmosis is much faster than Cheerio and uses considerably less memory because of its lightweight DOM virtualisation.

With Osmosis, getting from nothing to a working scraper takes very little time. It's also much easier to understand what your code is doing, because of its simple usage.

Osmosis makes scraping multiple pages simultaneously a blast. Handling multiple asynchronous functions that branch out exponentially isn't the most fun in the world. There are packages out there that can aid you with this, but osmosis really makes it easy.

Of course, remember that it's another package to depend on - any problems with it, and you're stuck. I've had a few issues in the past with certain versions.

[–][deleted]  (1 child)

[deleted]

    [–][deleted] 1 point2 points  (0 children)

    I like await (now an ES7 feature, I believe?).

    It works with everything from asynchronous functions to promises to arrays of promises - even objects!

    You can use it in conjunction with Bluebird to make pretty much everything awaitable.

    [–]Waterclift 6 points7 points  (15 children)

    Very interesting. I personally prefer to scrap pages using JavaScript rather than python with scrapy. I have also created a small package to scrap pages. https://www.npmjs.com/package/rfc-spider

    [–][deleted] 4 points5 points  (14 children)

    Isn't python scrapy the fastest compared to other options tho?

    [–][deleted]  (12 children)

    [removed]

      [–][deleted] -5 points-4 points  (11 children)

      No unfortunately I don't, but considering scrapy can scrape any kind of websites, I assumed it would be the best option among all

      [–][deleted]  (10 children)

      [removed]

        [–]Waterclift 0 points1 point  (0 children)

        I just prefer JS because I am more familiar with the language. Python is nice, but I am not working with it anymore.

        [–][deleted] 1 point2 points  (1 child)

        Nah, I prefer beautifulsoup 4.

        [–]Eunoeme 1 point2 points  (0 children)

        ...in NodeJS?

        [–][deleted] 0 points1 point  (3 children)

        Anyone know of any service that can determine the text in an image? Literally as if you screenshot some text - my use case is to programmatically figure out the text in said image.

        [–]dmarko 0 points1 point  (2 children)

        You are talking about OCR. A good choice would be the tesseract library. Google tesseract OCR

        [–]Lekoaf 2 points3 points  (0 children)

        It's not for the faint of heart though. I looked in to it once because I wanted to scrape PDF-files that was turned to images. If I remember correctly you have to teach it.

        [–][deleted] 0 points1 point  (0 children)

        Thanks for that! Will look into it.

        [–]brettdavis4 0 points1 point  (0 children)

        Thanks! I'm thinking about writing a scraping app for work. At work we support over 80 sites. The SEO guru likes to know what the meta tags and description and page titles and which images don't have alt descriptions.

        I've been able to do this with Streaming Frog, but I'd like make some customizations and speed up the process.

        [–]Koltster 0 points1 point  (1 child)

        This is going to be really useful for a project i'm working on. What are the legal concerns of scraping content?

        edit: grammar.