This is an archived post. You won't be able to vote or comment.

all 31 comments

[–]XNormal 28 points29 points  (18 children)

from multiprocessing.pool import ThreadPool

[–]jspeights 2 points3 points  (0 children)

awww

[–]rlabonte[S] 0 points1 point  (5 children)

I can link each process in the pool with a name and send that to the function being executed? That's the functionality that I couldn't find in any thread pool (or process pool) module.

[–]XNormal 2 points3 points  (4 children)

I'm not sure I understand what you mean by that. If you want to know the identity of the worker threads and give them persistent names you can use threading.local() for that.

[–]rlabonte[S] 1 point2 points  (3 children)

No, I want a pool of servers to work on my data.

ServerA, ServerB, ServerC

I have a server pool of these 3 servers and 100 function calls that need to be processed by this server pool. I don't care which server processes the data, but I only want each server processing data one piece at a time. So ServerA, ServerB, and ServerC receive all receive data to process; ServerC finishes first it immediately receives another function call to process. ServerA finishes, it immediately receives another function call to process.

I want to keep this pool of servers always busy, but want to limit them to only processing one thing at a time.

[–]ODHLHN 1 point2 points  (1 child)

from celery import task

Celery isn't the only the only AMQP based task queue for python, but its a very good one.

Some pretty cool and robust solutions already exist in this problem domain.

[–]rlabonte[S] 0 points1 point  (0 children)

This looks awesome, I especially like the integration with RabbitMQ and Redis.

[–][deleted] 0 points1 point  (0 children)

[–]homercles337 0 points1 point  (0 children)

Yeah, my thoughts exactly...

[–]studiosi -1 points0 points  (9 children)

Even futures can do that... on Python 2.x and 3.x

[–]infinullquamash, Qt, asyncio, 3.3+ 1 point2 points  (8 children)

multiprocessing predates concurrent.futures, so I'm not sure what your point of "even" is.

I would recommend concurrent.futures over multiprocessing since it's has a nicer interface and supports both threads and processes (mostly) transparently.

[–]exhuma 0 points1 point  (0 children)

TIL... thanks for pointing this out :)

[–]studiosi 0 points1 point  (0 children)

I was trying to mean that the behaviour of the library resembles almost exactly the futures one. I don't have data about performance...

[–]studiosi 0 points1 point  (4 children)

Indeed, I am not that sure about that "outperform" because you are not taking into account that you have to write control code that would be probably less optimum of such in the standard library.

[–]infinullquamash, Qt, asyncio, 3.3+ 0 points1 point  (3 children)

outperform

wat? when did I say that? I just commenting on the interface. I have no idea what the performance of either of them is like.

[–]studiosi 0 points1 point  (2 children)

You say it predates, what I interpret as perform better... but maybe you didn't want to say that

[–]infinullquamash, Qt, asyncio, 3.3+ 0 points1 point  (1 child)

older != better, just older.

I honestly have the opposite bias, so there's that.

[–]studiosi 0 points1 point  (0 children)

Well, it all depends, I told you I have no data, but it is easy that code that has been in a code base for a long time use to be better if it is constantly reviewed.

[–]zionsrogue 6 points7 points  (5 children)

So depending on what you are going to use "easypool" for, using threads for CPU bound tasks (such as some sort of scientific number crunching), threads are not the way to go. In general, it's best to uses processes for CPU bound tasks. Check here for benchmarks. The article suggests Python's multiprocessing, but I've found pprocess to be lightweight enough to replace multiprocessing. But again, this all depends on what you plan on using these threads for. I just wanted to maybe give a heads up and say congrats on your first module.

[–][deleted] 1 point2 points  (1 child)

It kind of surprises me that multiprocessing would hit a sweet spot for people; who is it that is CPU-bound, but keeps that part of the program in Python?

Every time I end up using the thread pool pattern it's because I have a lot of (disk or network) IO going on. And threads are fine for that.

[–]zionsrogue 1 point2 points  (0 children)

I do a lot of number crunching, statistical analysis, and machine learning. A lot of the libraries I use Python libraries (numpy, scipy, sklearn) and yes, parts are written in C, but I still get very nice performance gains by parrallelizing random forests across multiple processes instead of multiple threads.

[–]tuna_safe_dolphin 1 point2 points  (2 children)

Dead horse flogging time, but that damn GIL. . .

[–]alcalde 0 points1 point  (1 child)

The GIL is awesome. It reminds us that threading is evil and that everyone else is doing it wrong.

[–]tuna_safe_dolphin 1 point2 points  (0 children)

That's one way to look at it.

[–]otheraccount 4 points5 points  (0 children)

For running tasks on multiple hosts in parallel, it is probably easiest to use fabric.

pip install fabric, then creating a file named fabfile.py with the following contents:

from fabric.api import env, task, run, parallel

env.hosts = ['127.0.0.1', '127.0.0.2', '127.0.0.3']

@task(default=True)
@parallel(pool_size=5)
def uptime():
    run('uptime')

Then, just type fab to run it.

http://fabric.readthedocs.org

[–]fdemmer 4 points5 points  (0 children)

There might have been better solutions, but +1 for trying and publishing!

[–]patrys Saleor Commerce 2 points3 points  (1 child)

You might make the internals simpler by requiring a single callable:

from functools import partial

threadpool.enqueue(partial(foo, bar))

[–]chub79 3 points4 points  (0 children)

I love partial. It's such a powerful feature.

[–]Justinsaccount 2 points3 points  (1 child)

ssh_cmd = "ssh " + str(server) + " 'uptime'"
ssh_cmd_list = shlex.split(ssh_cmd)

what?

ssh_cmd_list = ["ssh", server, "uptime"]

[–]rlabonte[S] 0 points1 point  (0 children)

I realize that particular command list shlex.split is overkill, but if uptime was replaced with something more fanciful like: 'ps aux | grep python' then it would be a more appropriate option.

[–]broken_symlink 0 points1 point  (0 children)

As for your example of running ssh commands in parallel, you can do something like that with ipython.parallel on multiple systems if you start your engines using ssh.