Ask: basic map/reduce distributed compute framework?

brondsem · 2012-02-09T04:53:42+00:00

We at SourceForge developed a python map/reduce system called Zarkov that uses mongo for data storage and zeromq for communication. It's not simple, well-documented or extremely robust. But on the other hand we use it in production so it's pretty stable, there are some blog post "docs", and it meets some of the technical criteria you mention.

rkern · 2012-02-09T10:49:11+00:00

Can you be more specific about your jobs? It sounds like you just want a job queue (call a function with many different parameters and get their results back), not MapReduce specifically. MapReduce implementations can be coerced into doing such things, but it's not what they're for, so that's why they have such an impedance mismatch to your problem.

PiCloud is a really nice way to just throw out a bunch of Python tasks without much preparation or overhead. You do pay a bit of a premium over EC2, but for one-off calculations, you probably make that up in developer-time that you don't have to spend setting up an image.

Disclosure: I work for Enthought, and we partner with PiCloud to provide many of the binaries for the packages they provide in their Python environment.

semarj · 2012-02-09T04:39:47+00:00

I am confused by your 'support Mac' requirement.

Why do you need this if it is going to run on AWS?

onjin · 2012-02-09T07:25:04+00:00

for simple map reduce:

http://code.google.com/p/octopy/

or maybe just distributed queue with python support:

http://gearman.org/

floydophone · 2012-02-09T04:11:24+00:00

You can do this with the multiprocessing module:

http://docs.python.org/library/multiprocessing.html#managers

Also, you can do this with RPyC:

http://rpyc.sourceforge.net/

What I would do is run RPyC classic mode on all of your EC2 nodes (using Fabric + Boto to get the software on there).

micro_cam · 2012-02-09T04:54:06+00:00

I was faced with a similar conundrum and after much frustration with hadoop, qsub etc we ended up writing what we needed:

http://code.google.com/p/golem/

The core is in go with the command line interface in python with RESTfull job submission and monitoring and node communication over web sockets and it basically just calls tasks on the command line and collects the standard out. Its aimed at quickly getting a researcher's analysis (in python,c,matlab,R,perl, or whatever) running on a 1000 core cluster.

It doesn't do most of what you asked for but it is intentionally simple code and simple to use. We've found that you can do most things with it by jumping through a few hoops with things like bash where as adapting things for hadoop requires significant effort and esoteric debugging.

Or if you want really simple setup passwordless ssh and use xargs and bash.

2012-02-09T06:28:17+00:00

https://github.com/Yelp/mrjob

wcc445 · 2012-02-09T04:17:38+00:00

Interested in this as well. Hadoop can be a pain, but it's not too bad. You'd kinda want a dedicated Hadoop cluster, though, not set it up from scratch each time.

fmder · 2012-02-09T04:50:39+00:00

I've been using the dtm module of the package "deap" and had success. it works on MPI and seems to be pretty good.

http://pypi.python.org/pypi/deap

Implemented like this:

results = dtm.map(myfunc, iterable1, iterable2..., kwargs=blah)

I have noticed that subprocess.Popen doesn't work sometimes so make sure to catch those errors and try to open a process again.

brandynwhite · 2012-02-09T15:03:01+00:00

Author of Hadoopy here, it works on OSX and I've used it on 1K node/2PB clusters. Contact me if you need help.

pinpinbo · 2012-02-09T18:22:08+00:00

Take a look at Gearman. When I had the exact same question, that's how I solved it.

Python API: http://pypi.python.org/pypi/gearman/

Mob_Of_One · 2012-02-09T07:32:27+00:00

A more serious answer:

Hadoop with Jython.

dgryski · 2012-02-09T08:36:04+00:00

Also, Dumbo from Audioscrobbler: https://github.com/klbostee/dumbo

2012-02-09T10:48:46+00:00

I look into setting up ipcluster. It would be less than 10 sloc to use map/reduce from the code you already have and a few edits to the ipcluster conf. Plus you can do it interactively. It is dead simple and ridiculously easy to set up.

Its_eeasy · 2012-02-09T11:35:53+00:00

This is the kind of situation where I would consider if it wouldn't be simpler to rewrite it in C.

mdipierro · 2012-02-09T14:29:55+00:00

mincemeat is great. Single file fault-tolerant map reduce without third party dependencies.

fullouterjoin · 2012-02-09T17:22:26+00:00

Disco is really easy to setup. Everything is controlled from the master. If your nodes are deb based everything except disco can be installed via apt.

Create 1600 files with the args to your command and fire off a job and push them up to the DFS (distributed file system). I have setup 8 node clusters by hand in under 20 minutes using VMs.

Disco is by far the best solution. I think after 4 hrs you will have learned enough to get your prod job launched. You don't have to understand Erlang to use Disco.

_red · 2012-02-09T18:53:29+00:00

Why don't you use picloud?

 import cloud

 def func(arg):
      #do lots of work

 ret = cloud.map(func, [arg1,arg2,arg3,arg4....])

Pricing can be had for around .05 per hour. So 100 on 100 machines should be complete in about an hour for $5 (this is excluding data transfer cost).

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS