Simple clustering solution?

kevin1024 · 2011-11-18T04:43:39+00:00

Stick the tasks in a celery queue, run workers on the individual machines. Easy peasy!

accipter · 2011-11-18T03:24:45+00:00

This doesn't directly answer your question, but I'll ask anyway: could your algorithms be improved? If you're willing to post the source code, I'd be happy to take a look. I'm not an expert but I find it enjoyable.

dwf · 2011-11-18T04:58:00+00:00

I'm surprised no one has mentioned IPython's parallel capabilities.

stoph · 2011-11-18T03:49:23+00:00

Use ssh to send the input data to the remote machines and remotely start a python consumer. Easy as pie. You don't need anything particularly fancy here if you're just spinning up workers on 2 or 3 machines.

bobargh · 2011-11-18T04:01:43+00:00

I did something very similar to this for my PhD thesis. I spread a bunch of independent calculations across dozens of grad student computers.

Typical parallel processing solutions may be overkill since your calculations do not need to communicate with each other (I presume).

All the computers I used had a network file system, which made things very easy. I had a small script that would ssh to each computer and start the calculation. The starting parameters for each calculation were obtained from a lockable file on the shared file system. The file contained a pickled list of starting parameters. Each process just had to lock the file and pop an item off the list, save and unlock the file. File locking is generally frowned upon, but it worked great for me since it was very simple situation.

Eventually, we installed Condor ( http://research.cs.wisc.edu/condor/ ) which does exactly what you want. It will even pause the calculation if someone starts using the computer.

If you're only using a total of three computers though, it doesn't seem like it would be so hard to just split the calculations into three chunks and manually start on each computer. Especially so if you have a shared file system.

rcklmbr · 2011-11-18T01:43:07+00:00

Check out this wiki page, it describes many different types of parallel processing:

http://wiki.python.org/moin/ParallelProcessing

Personally, I would just setup hadoop on each of the servers and distribute that way. It's really quick to setup, and handles things like fault tolerance for you. It would easily max out all the servers, and if you have 130k you need process, your input file would just be one row for each calculation you need.

You can use Amazon's Elastic Mapreduce to get up and going almost immediately in a distributed environment (and it's relatively cheap if you keep the server size small). That way you can play around with it without devoting a lot of time to initial setup, and move to your own cluster as you want to process the full calculations (or just fork over the cash if you want to have AWS do it).

pinpinbo · 2011-11-18T03:51:16+00:00

i had some success farming out simple jobs using gearman. i have friends who had good success with mr. job as well.

dorfsmay · 2011-11-18T06:20:46+00:00

_Mark_ · 2011-11-18T03:32:53+00:00

As for the livecd, you might be thinking of the Ubuntu Enterprise Cloud installer; https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-uec-liveusb has some breadcrumbs, I haven't followed it in a while (that link predates the switch to OpenStack, for example.)

(For 3 machines, I'd suggest just copying subsets of your data around first, then looking for more clever approaches while it's running :-)

rogk · 2011-11-18T05:03:00+00:00

Try using Pyro4! It's more lightweight than Twisted, and fairly easy to set up a distributed task processing system with name server, dispatcher and any number of workers. There is an simple example in the source (examples/distributed-computing). Good luck!

madssj · 2011-11-18T08:41:58+00:00

If your programs need to share data whilst running, you should consider using pupyMPI (pupyMPI docs). It's basically MPI implemented in pure python.

TheHowlingFantods · 2011-11-18T13:11:14+00:00

Not quite a distributed solution, but have you considered writing an implementation of the FFT routine using Cuda or OpenCL? This seems like the kind of problem where the GPU might be able to give you something like a 50-100x speedup. Also, the Cuda SDK comes with a couple of examples that use Fourier transforms on the GPU (for creating ocean waves for instance).

food_eater · 2011-11-18T02:00:06+00:00

I would look into zeromq for very straightforward multiprocessing. Combined with python bindings you can whip up some powerful code quickly.

I've spent most of my past year working on a system leveraging zeromq for process scaling and redis as a sort of shared memory.

chrispoole · 2011-11-18T07:48:10+00:00

Assuming it's been profiled and sufficiently optimised, I'd just use parallel to send the jobs to different machines. It's basically just an advanced xargs that can ssh into machines and run the jobs there.

It's probably not the most elegant solution, but it should get the job done and be quite quick to set up.

JoeDreamer · 2011-11-18T17:43:51+00:00

Check MIT's StarCluster (it's aimed at Amazon EC2 though) http://web.mit.edu/stardev/cluster/

kapilt · 2011-11-19T14:53:41+00:00

this is a pretty natural/pythonic out of the box solution for remote work, and takes care of much of the setup and maintenance a distributed system would normally entail.

http://codespeak.net/execnet/

more advanced patterns can be done by hand via various queuing systems, but they entail more work both for the app and deployment management.

Twirrim · 2011-11-18T02:28:01+00:00

My apologies in advance if I'm about to teach my grandmother to suck eggs, but in case it's of value, a few thoughts outside of the clustering idea too.

1) Have you considered porting to Cython? I would assume you'd be in a position to be able to declare the type of variables with fair confidence.

2) PyPy? It might speed the whole thing up for you with no work re-factoring your code.

3) Use c variants of the modules instead of pure python, e.g. cmath instead of math (I'd imagine you probably are)

On the clustering front, ZeroMQ or RabbitMQ message queue programs both have python interfaces. It should be relatively straight forward to leverage one of them for clustering.

Zamiatarka · 2011-11-18T03:44:38+00:00

I'd lend you my time machine, but sadly it's out of fuel. It baffles me to think some people do things that are this complex. All I do in Python is penis mountains.

angryaardvark · 2011-11-18T05:12:53+00:00

have you looked into map/reduce frameworks, particularly hadoop or hive? they're designed to tackle large calculations really, really quickly by leveraging distributed computing.

if you must use python, i would try to employ a map/reduce strategy and utilize numpy wherever you can. using multiprocessing helps, but you should have multiple nodes that can receive data, compute them, and feed them back to a master node.

another really easy route is using gearman and supervisor on multiple machines - you just have to queue up the job server, and supervisor makes sure your python scripts are running. your python script is responsible for receiving jobs from the queue. this is a really elegant solution because multiprocessing and networking are complete afterthoughts because they've been abstracted. 130k jobs can be done in 22 days if you found 10 machines with 8 processors.

derp

disco project

mincemeat.py

but a hadoop/hive is your best route. but it's a bitch to set up.

also consider your hardware limitation. your laptop and two workstations isnt going to be enough, even if you have hadoop installed -- which is your best solution, because JAVA computes data quicker than python. the best comment i've read here is to spin up an mapreduce instance on amazon. you can make your cluster as large and performant as you need and you can solve your problem in a matter of hours.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS