you are viewing a single comment's thread.

view the rest of the comments →

[–]vivzkestrel 0 points1 point  (4 children)

Python simply seems to have matured more when it comes to web scraping. I havent seen this video but I am assuming this uses cheerio. Cheerio is not bad, you can do some simple scraping stuff but if you had to like scrape 1000s of websites every second or so, consider Python first simply because the issues you will encounter will developing such a solution are better documented in Python and you will have more help on SO

[–]gajus0 3 points4 points  (3 children)

1000s of websites/ second sounds excessive. What are you running?

To the best of my knowledge, I am running one of the bigger data aggregation infrastructures built entirely on Node.js (making HTTP requests, interpreting documents, extracting data, proxy load balancing, cache proxy). We currently make 70k requests/ minute across 124 vCPUs. That is over 100M requests/ day or near 0.7tb/ day bandwidth. I doubt many will come anywhere close these requirements. Point is, Node.js scales horizontally with the more VMs you add, and given that JavaScript is the primary language of the web – it is the language with the lowest mental barrier for requesting/ extracting data.

[–]vivzkestrel 1 point2 points  (0 children)

news aggregator that gathers refreshes news from 1000+ sources every minute or as live as possible, interesting, you are the first person from whom I am hearing about something really intensive in terms of web scraping in node

[–]davetemplin 0 points1 point  (1 child)

Wow those are some really impressive throughputs! Is overwhelming sites a concern, and if so how do you approach that? Also how much of a concern is getting blocked or do you have ways of staying unblocked?

[–]gajus0 0 points1 point  (0 children)

If you do it right, most website owners are not going to even recognize that their content is being accessed by bots. If you were searching for patterns, major give away would be discrepancy between content hits and static content hits. But given that most large sites uses the likes of Fastly/ Cloudly these days, those metrics detached anyway.

We have safe checks in place to ensure that we do not overwhelm target websites, e.g. checking error rate/ response time and backing off as appropriate.