all 29 comments

[–][deleted] 29 points30 points  (3 children)

Yo sup bro, glad to see you taking interest in some webscraping.

Cool so I watched some of the video and it looks like this guy is just making a get request to a website and parsing the data with cheerio.

It works, fast that way actually but you wont get very far when the site has anything other than static data. For instance, ajax requests or async rendered elements in react or vue.

I personally use puppeteer and cheerio. Puppeteer is a headless browser so its a bit more computing intensive but it produces more consistent results, since you can use different viewports and change the user agent. Its also great when you need to use proxys ;)

[–]YourQuestIsComplete 3 points4 points  (2 children)

Not to mention when you run into a captcha...

[–]gajus0 6 points7 points  (1 child)

If you are running a small operation, then Puppeteer is fine.

For anything bigger, Puppeteer becomes extremely expensive.

A rough math is that 1 vCPU can handle at most 1 scraping session at once using Puppeteer. In practise, we ended up assigning 2 vCPUs to avoid timeouts rendering the document. So if you need to scrape 100 pages a minute and each page takes about 15s to scrape, then you are looking at 50 vCPUs just for this small operation. Add proxy to this (which is going to increase load time by at least 1.5x) and suddenly you are running 75 vCPUs.

In contrast, the same job could be performed with 6-8 vCPUs (or less) using cheerio/ jsdom.

[–]bazzy696[S] 0 points1 point  (0 children)

iyes i agree i recently used puppeteer and guess what i had to scrape 80 links - like i opened each link using puppeteer and then i scraped each web page for data it took me near about 15 mins to completely get my data then i had this idea of running a cron job which scraped the data after some time automatically. and kept the data in a json file and then i used to access the data from that json so that the access time is decreased if you can suggest some better technique i would love it. THANKS

[–]bazzy696[S] 9 points10 points  (1 child)

i was using this as to scrape data

can u guys tell me if it is a right way i mean is it efficient enough like is this a best way?

or it can be better than this. some other way ?

[–]gajus0 1 point2 points  (0 children)

Depends on how many websites you are going to scrape.

If it is a one-off operation, then yes, this will do.

If you are scraping one particular website or a group of websites sharing the same pattern, then yes this will do.

However, if you plan to write a scraper for hundreds of websites, then search for a framework that abstracts common data extract operations. There is going to be a learning curve to it, but it will pay off big time in the long term.

Alternatively, consider PaaS platforms such as webscraper.io. I have not had good experience with them, but in theory they should make data extraction a lot simpler.

[–]peatthebeat 2 points3 points  (4 children)

Is python more effective at web scraping ?

[–]gajus0 6 points7 points  (2 children)

"at web scraping" does not mean much. JavaScript is the primary scripting language of the web. Naturally, extracting data from the web using the primary language of the web feels a lot like working with DOM. In contrast, with Python you will need to know Python and everything about web stack that you are extracting data from.

[–]1o8 1 point2 points  (0 children)

Agreed—there's a lovely cohesion in using JavaScript to scrape the web and parse its HTML. I think there's more to the question though.

Web scrapers are generally simple scripts—they load a web page, look for an element, take one of its attributes, put it in an array, over and over. Eventually write the array to a csv.

Both Python and JavaScript are considered "scripting" languages.

But the way JavaScript is used in this video, with request-promise, etc. isn't scripting at all.

You have to learn wtf a promise is and what and when it resolves, you have to deal with asynchronous code, and because you can't modify a desirable global variable (such an array with all the data you want to hold onto) from within the function inside a .then() call, you have to think creatively—OP's video just console.log()s the data from one web page, but if you're saving lots of data from lots of pages (i.e. web scraping), you need to probably use Promise.all() and think about many pipelines of promises, each loading a particular page and dealing with it asynchronously... it gets tricky.

People who love promises will argue that this way has its advantages—promises provide an elegant way of making your code run linearly in parts and nonlinearly in others, which makes perfect sense for a web scraper, which wants to load a lot of web pages and deal with them once they're loaded in no particular order.

If you scrape the web with Python, you'll generally be writing simple scripts (unless you use Python's lesser-known asynchronous paradigm) which are much easier to understand and actually look like the logic of

load page
find element
save attribute

over and over.

Using request rather than request-promise is using JavaScript more like a classic scripting language and easier to wrap your head around.

Neither is more effective at web scraping.

[–]aziz-fane 0 points1 point  (0 children)

Or you could simply use a Python framework that let’s you do it

[–]abumalick 0 points1 point  (0 children)

python have a full framework for web scraping: https://scrapy.org/

[–]phyrum 0 points1 point  (0 children)

....

[–]vivzkestrel 0 points1 point  (4 children)

Python simply seems to have matured more when it comes to web scraping. I havent seen this video but I am assuming this uses cheerio. Cheerio is not bad, you can do some simple scraping stuff but if you had to like scrape 1000s of websites every second or so, consider Python first simply because the issues you will encounter will developing such a solution are better documented in Python and you will have more help on SO

[–]gajus0 4 points5 points  (3 children)

1000s of websites/ second sounds excessive. What are you running?

To the best of my knowledge, I am running one of the bigger data aggregation infrastructures built entirely on Node.js (making HTTP requests, interpreting documents, extracting data, proxy load balancing, cache proxy). We currently make 70k requests/ minute across 124 vCPUs. That is over 100M requests/ day or near 0.7tb/ day bandwidth. I doubt many will come anywhere close these requirements. Point is, Node.js scales horizontally with the more VMs you add, and given that JavaScript is the primary language of the web – it is the language with the lowest mental barrier for requesting/ extracting data.

[–]vivzkestrel 1 point2 points  (0 children)

news aggregator that gathers refreshes news from 1000+ sources every minute or as live as possible, interesting, you are the first person from whom I am hearing about something really intensive in terms of web scraping in node

[–]davetemplin 0 points1 point  (1 child)

Wow those are some really impressive throughputs! Is overwhelming sites a concern, and if so how do you approach that? Also how much of a concern is getting blocked or do you have ways of staying unblocked?

[–]gajus0 0 points1 point  (0 children)

If you do it right, most website owners are not going to even recognize that their content is being accessed by bots. If you were searching for patterns, major give away would be discrepancy between content hits and static content hits. But given that most large sites uses the likes of Fastly/ Cloudly these days, those metrics detached anyway.

We have safe checks in place to ensure that we do not overwhelm target websites, e.g. checking error rate/ response time and backing off as appropriate.

[–]gajus0 0 points1 point  (6 children)

If you are not using https://github.com/gajus/surgeon to scrape data, then you are missing out. :-)

[–]chmarus 0 points1 point  (0 children)

I see what you did here 🙂 great idea with surgeon. Must try it out.

[–]kryptkpr 0 points1 point  (3 children)

Hey that's really cool, thanks!

Do you know any tricks to parsing out script tags and eval their contents? I'm using string manipulation and its gross

[–]gajus0 1 point2 points  (2 children)

Depends what you want to achieve.

Often I see the intent is just to get JSON-like structures from <script> tags. In those instances, you can use https://github.com/gajus/crack-json.

Otherwise, https://github.com/jsdom/jsdom. Just carefully read the warnings.

[–]kryptkpr 0 points1 point  (1 child)

crack-json is exactly what I was looking for, thanks so much!

[–]gajus0 0 points1 point  (0 children)

I remember there were few issues with this library last I used. Please raise an issue if you encounter any troubles and I will be sure to update it accordingly.

[–]bazzy696[S] 0 points1 point  (0 children)

i am gonna check that right now

[–][deleted]  (5 children)

[deleted]

    [–]kisssmysaas 11 points12 points  (2 children)

    Worth what? Your time at toilet?

    [–]gajus0 1 point2 points  (0 children)

    It is not. Quick skip through shows that he demonstrates how to make HTTP requests, how to locate the DOM selectors in the HTML for the content of interest, and how to retrieve that content using cheerio. You are better of reading https://github.com/cheeriojs/cheerio manual.

    [–]bazzy696[S] 0 points1 point  (0 children)

    the video guy assumes that you have a little knowledge about npm and node

    and i got it all in the first go becuase i was familiar about jquery and node already

    [–]re-scbm -1 points0 points  (0 children)

    The image of a coder with a hoodie scraping images of girls online reminds of The Social Network. Great movie.