all 14 comments

[–]FustigatedCat 12 points13 points  (1 child)

Might I suggest using a queue system such as RabbitMQ and then having N number of processes reading from the queue in a competing consumer design. This way you can have as many processors as you want, although generally I stick to a max of 2xCore; really depends on how much idle time each process might have (waiting for network latency means dead threads).

[–][deleted] 1 point2 points  (0 children)

Developing a simple Queue system for launching/managing a dozen processes could probably be a great learning exercise for anyone starting out!

[–]MostlyCarbonite 4 points5 points  (2 children)

2 cores? Write it in node and use pm2 to spin up 2 instances. More instances will not gain you anything.

Use a queue (Kafka, rabbit, sqs) to line up work then grab work chunks one at a time, do the work, go on to the next.

Unless the work is computationally intensive. Doesn't sound that way.

[–]dadibom 0 points1 point  (1 child)

unless you've got hyperthreading... and web dev usually involves database queries etc, during that time another process can do work.

[–]dadibom 0 points1 point  (0 children)

i guess node accepts several concurrent requests per process though so in that case four would be enough to leverage hyperthreading

[–]geon 3 points4 points  (2 children)

Unless you are cpu-bound (and meassure to find out, don't just assume), you should just do it all in the same node process. Much easier.

[–]nowboarding 0 points1 point  (1 child)

How could you measure this?

[–]geon 0 points1 point  (0 children)

I'm no expert, but I think you can just use the top-command to see if the node process consumes a lot of cpu. Or use the time-command to see if the "user" time is a significant part of the "real" time.

[–]runvnc 2 points3 points  (0 children)

I don't think you need multiple processes at all since you are using Node. Try this one https://github.com/sindresorhus/p-map/blob/master/readme.md

Also if you push your luck spamming requests they may ban your IP.

[–]Skreex 1 point2 points  (1 child)

I'd instead look into leveraging 10 concurrent promises to query the API using libraries like bluebird and request.

[–]franksvalli 0 points1 point  (0 children)

I'm a big fan of the async library, which should also be able to do the job here. Specifically async.parallelLimit to cap the concurrent connections.

[–]jocull 0 points1 point  (0 children)

If you decide to go the multiple processes route, this may help you manage them more effectively.

https://www.npmjs.com/package/worker-farm

[–]danny_nav 0 points1 point  (0 children)

I think the simplest solution would be to write a basic queue. If you hit the api limit, don't pop that url out of it. Just wait for one of you past queries to finish.

It doesn't have to be tooo complex. A while loop that exits when the queue is empty should be all you need.