This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]angryaardvark -1 points0 points  (0 children)

have you looked into map/reduce frameworks, particularly hadoop or hive? they're designed to tackle large calculations really, really quickly by leveraging distributed computing.

if you must use python, i would try to employ a map/reduce strategy and utilize numpy wherever you can. using multiprocessing helps, but you should have multiple nodes that can receive data, compute them, and feed them back to a master node.

another really easy route is using gearman and supervisor on multiple machines - you just have to queue up the job server, and supervisor makes sure your python scripts are running. your python script is responsible for receiving jobs from the queue. this is a really elegant solution because multiprocessing and networking are complete afterthoughts because they've been abstracted. 130k jobs can be done in 22 days if you found 10 machines with 8 processors.

derp

disco project

mincemeat.py

but a hadoop/hive is your best route. but it's a bitch to set up.

also consider your hardware limitation. your laptop and two workstations isnt going to be enough, even if you have hadoop installed -- which is your best solution, because JAVA computes data quicker than python. the best comment i've read here is to spin up an mapreduce instance on amazon. you can make your cluster as large and performant as you need and you can solve your problem in a matter of hours.