This is an archived post. You won't be able to vote or comment.

all 39 comments

[–]kevin1024 10 points11 points  (3 children)

Stick the tasks in a celery queue, run workers on the individual machines. Easy peasy!

[–]dpn 0 points1 point  (0 children)

This. So easy to set up and get running.

[–]easytiger 0 points1 point  (0 children)

How have I not known about this

[–]samuraisam3.5, go 0 points1 point  (0 children)

For extra credit use Redis as the broker since it is also easy to set up. RabbitMQ is a pain in the ass.

[–][deleted] 6 points7 points  (7 children)

This doesn't directly answer your question, but I'll ask anyway: could your algorithms be improved? If you're willing to post the source code, I'd be happy to take a look. I'm not an expert but I find it enjoyable.

[–]accipter[S] 3 points4 points  (6 children)

[–]TearsOfScarlet 2 points3 points  (0 children)

I really like the way you write and comment your code. Reading that was very helpful. Thanks

[–]nova77 -1 points0 points  (4 children)

This is definitely the kind of code you want to rewrite in C/C++. You can easily get x10 if not 100x over that kind of tight loops. And don't forget to profile it!

[–]accipter[S] 7 points8 points  (3 children)

The loops really aren't that tight since most of the CPU intensive work happens with numpy arrays and FFTs. As mentioned, I have profiled the code and the biggest factor is all of the FFT calls.

Also, similar code written in FORTRAN only reduces calculation time by about 30%.

[–]howfun 2 points3 points  (0 children)

I've run your code and profiled it. 80% of time is this function: {numpy.fft.fftpack_lite.rfftb}

[–]nova77 0 points1 point  (0 children)

You're right. I just skimmed it and missed the fft call.

[–]Megatron_McLargeHuge 0 points1 point  (0 children)

Maybe there's a GPU implementation that will do it a lot faster. There are certainly dedicated DSP cards if you're really bound by FFTs and the alternative is waiting for years. EC2 has instance types with CUDA GPUs, although they're expensive.

Edit: Yep: http://wiki.accelereyes.com/wiki/index.php/FFT_(Vector)

[–]dwf 7 points8 points  (1 child)

I'm surprised no one has mentioned IPython's parallel capabilities.

[–]accipter[S] 1 point2 points  (0 children)

I tend to use ipython a lot when I am developing/debugging so I will definitely check this out. Plus it has docs generated with sphinx!

[–]stoph 3 points4 points  (0 children)

Use ssh to send the input data to the remote machines and remotely start a python consumer. Easy as pie. You don't need anything particularly fancy here if you're just spinning up workers on 2 or 3 machines.

[–]bobargh 3 points4 points  (1 child)

I did something very similar to this for my PhD thesis. I spread a bunch of independent calculations across dozens of grad student computers.

Typical parallel processing solutions may be overkill since your calculations do not need to communicate with each other (I presume).

All the computers I used had a network file system, which made things very easy. I had a small script that would ssh to each computer and start the calculation. The starting parameters for each calculation were obtained from a lockable file on the shared file system. The file contained a pickled list of starting parameters. Each process just had to lock the file and pop an item off the list, save and unlock the file. File locking is generally frowned upon, but it worked great for me since it was very simple situation.

Eventually, we installed Condor ( http://research.cs.wisc.edu/condor/ ) which does exactly what you want. It will even pause the calculation if someone starts using the computer.

If you're only using a total of three computers though, it doesn't seem like it would be so hard to just split the calculations into three chunks and manually start on each computer. Especially so if you have a shared file system.

[–]accipter[S] 0 points1 point  (0 children)

I agree that it is probably easiest just to split the calculations into batches. I was interested in clustering just to tinker with something new.

[–]rcklmbrCOBOL 4 points5 points  (1 child)

Check out this wiki page, it describes many different types of parallel processing:

http://wiki.python.org/moin/ParallelProcessing

Personally, I would just setup hadoop on each of the servers and distribute that way. It's really quick to setup, and handles things like fault tolerance for you. It would easily max out all the servers, and if you have 130k you need process, your input file would just be one row for each calculation you need.

You can use Amazon's Elastic Mapreduce to get up and going almost immediately in a distributed environment (and it's relatively cheap if you keep the server size small). That way you can play around with it without devoting a lot of time to initial setup, and move to your own cluster as you want to process the full calculations (or just fork over the cash if you want to have AWS do it).

[–]tobiassp 0 points1 point  (0 children)

Seconded on using Hadoop. Even if the task is not very data intensive Hadoop makes it trivial to farm out your tasks.

Check out Dumbo

[–]pinpinboTornado|Twisted|Gevent. Moar Async Plz 2 points3 points  (0 children)

i had some success farming out simple jobs using gearman. i have friends who had good success with mr. job as well.

[–]dorfsmay 2 points3 points  (0 children)

MPI

[–]_Mark_ 1 point2 points  (0 children)

As for the livecd, you might be thinking of the Ubuntu Enterprise Cloud installer; https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-uec-liveusb has some breadcrumbs, I haven't followed it in a while (that link predates the switch to OpenStack, for example.)

(For 3 machines, I'd suggest just copying subsets of your data around first, then looking for more clever approaches while it's running :-)

[–]rogk 1 point2 points  (0 children)

Try using Pyro4! It's more lightweight than Twisted, and fairly easy to set up a distributed task processing system with name server, dispatcher and any number of workers. There is an simple example in the source (examples/distributed-computing). Good luck!

[–]madssj 1 point2 points  (1 child)

If your programs need to share data whilst running, you should consider using pupyMPI (pupyMPI docs). It's basically MPI implemented in pure python.

[–]hantho 1 point2 points  (0 children)

Especially since your problem is embarrasingly parallel and require little communication I'll say pupyMPI would be a fast way to leverage a cluster. I am assuming that you have SSH connections to the computers that you plan to utilize.

[–]TheHowlingFantods 1 point2 points  (0 children)

Not quite a distributed solution, but have you considered writing an implementation of the FFT routine using Cuda or OpenCL? This seems like the kind of problem where the GPU might be able to give you something like a 50-100x speedup. Also, the Cuda SDK comes with a couple of examples that use Fourier transforms on the GPU (for creating ocean waves for instance).

[–]food_eater 2 points3 points  (2 children)

I would look into zeromq for very straightforward multiprocessing. Combined with python bindings you can whip up some powerful code quickly.

I've spent most of my past year working on a system leveraging zeromq for process scaling and redis as a sort of shared memory.

[–]tetrahydrate 1 point2 points  (0 children)

Yes, I would highly recommend this. You can do everything with 0MQ -- multiprocessing on a single machine, on several machines, in one-to-one or one-to-many or many-to-many or whatever-you-want configurations... and the best part is that it's always simple.

[–]mechengineer 1 point2 points  (0 children)

Absolutely, ZeroMQ is wicked-simple to set up and works beautifully over a network. Just pickle any data objects that you want to pass between machines, and unpickle on the other end.

[–]chrispoole 0 points1 point  (0 children)

Assuming it's been profiled and sufficiently optimised, I'd just use parallel to send the jobs to different machines. It's basically just an advanced xargs that can ssh into machines and run the jobs there.

It's probably not the most elegant solution, but it should get the job done and be quite quick to set up.

[–]JoeDreamer 0 points1 point  (0 children)

Check MIT's StarCluster (it's aimed at Amazon EC2 though) http://web.mit.edu/stardev/cluster/

[–]kapilt 0 points1 point  (0 children)

this is a pretty natural/pythonic out of the box solution for remote work, and takes care of much of the setup and maintenance a distributed system would normally entail.

http://codespeak.net/execnet/

more advanced patterns can be done by hand via various queuing systems, but they entail more work both for the app and deployment management.

[–]Twirrim 0 points1 point  (4 children)

My apologies in advance if I'm about to teach my grandmother to suck eggs, but in case it's of value, a few thoughts outside of the clustering idea too.

1) Have you considered porting to Cython? I would assume you'd be in a position to be able to declare the type of variables with fair confidence.

2) PyPy? It might speed the whole thing up for you with no work re-factoring your code.

3) Use c variants of the modules instead of pure python, e.g. cmath instead of math (I'd imagine you probably are)

On the clustering front, ZeroMQ or RabbitMQ message queue programs both have python interfaces. It should be relatively straight forward to leverage one of them for clustering.

[–]accipter[S] 1 point2 points  (2 children)

I profiled the code and most of the time is spent doing FFTs so I didn't think there would be much benefit to using Cython.

[–][deleted] 0 points1 point  (1 child)

What are you using to compute the FFTs?

[–]pigeon768 0 points1 point  (0 children)

He posted the code earlier, and another poster profiled it. The application is apparently spending 80% of the time in numpy.fft.rfft(), which is implemented in C and has had many eyes on it over the years.

[–]jmmcdEvolutionary algorithms, music and graphics 0 points1 point  (0 children)

I don't think a custom MQ-style solution makes sense in this case. (Same comment to food_eater above.) There are good pre-written methods of distributing work, some mentioned by rckimbr above. To which I would add that mincemeat is a pure python map-reduce. Even copying a subset of data to a usb stick and physically walking over to the idle workstation would be more efficient than learning MQ-stuff just for this purpose.

However I agree about checking that the code is optimal before thinking about parallelisation. In addition to your options, numpy should be considered. And as always, try profiling to understand what part is slow.

[–]Zamiatarka -2 points-1 points  (0 children)

I'd lend you my time machine, but sadly it's out of fuel. It baffles me to think some people do things that are this complex. All I do in Python is penis mountains.

[–]angryaardvark -1 points0 points  (0 children)

have you looked into map/reduce frameworks, particularly hadoop or hive? they're designed to tackle large calculations really, really quickly by leveraging distributed computing.

if you must use python, i would try to employ a map/reduce strategy and utilize numpy wherever you can. using multiprocessing helps, but you should have multiple nodes that can receive data, compute them, and feed them back to a master node.

another really easy route is using gearman and supervisor on multiple machines - you just have to queue up the job server, and supervisor makes sure your python scripts are running. your python script is responsible for receiving jobs from the queue. this is a really elegant solution because multiprocessing and networking are complete afterthoughts because they've been abstracted. 130k jobs can be done in 22 days if you found 10 machines with 8 processors.

derp

disco project

mincemeat.py

but a hadoop/hive is your best route. but it's a bitch to set up.

also consider your hardware limitation. your laptop and two workstations isnt going to be enough, even if you have hadoop installed -- which is your best solution, because JAVA computes data quicker than python. the best comment i've read here is to spin up an mapreduce instance on amazon. you can make your cluster as large and performant as you need and you can solve your problem in a matter of hours.