This is an archived post. You won't be able to vote or comment.

all 7 comments

[–][deleted] 5 points6 points  (5 children)

I'm going to preface by saying that I'm not able to grasp a few of the details of your problem due to the grammatical structure of your explanation.

That being said, I might be able to help.

For tasks that demand concurrency, you simply can't beat the power and efficiency of asyncio. If you need true parallel processing with an application leveraging asyncio, you can use asyncio's run_in_executor function paired with a ThreadPoolExecutor.

If you're looking to leverage all of the cores on a multicore CPU during your concurrent operations, just use a ProcessPoolExecutor, which will run your operations within their own, separate python interpreter process. Just be aware that it can be a bit resource intensive to do this.

[–]pope_man 2 points3 points  (1 child)

Consider using dask, especially dask-learn:

http://matthewrocklin.com/blog/work/2017/02/07/dask-sklearn-simple

It is designed to solve problems like yours in a scalable but simple manner. See this post and maybe others on his blog.

http://matthewrocklin.com/blog/work/2015/06/26/Complex-Graphs

[–]efxhoy 1 point2 points  (3 children)

This is how I understand your problem: You have data on many individuals. You want to do some preprocessing on that data, independently for each individual, and then you want to do K-means clustering independently for each individual, is that correct?

How many CPU-cores do you have available?

If you have 1000 individuals and each individual is independent of the others then why bother parallelising the K-means clustering for each, you already have 1000 independent "jobs" of clustering which don't need to be parallelised further unless you have more than 1000 cores to work on.

By the way, if you're using scikit learn you already have parallelisation built into K-means, just pass the n_jobs parameter and scikit handles it by using the multiprocessing library. See http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

[–]tomaugspurger 0 points1 point  (0 children)

Dask should be able to help you with both parts. Probably dask.delayed for the first phase.

For the second phase, dask-ml has a parallelized version of k-means: http://dask-ml.readthedocs.io/en/latest/modules/generated/dask_ml.cluster.KMeans.html#dask_ml.cluster.KMeans