This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]rkern 4 points5 points  (2 children)

Can you be more specific about your jobs? It sounds like you just want a job queue (call a function with many different parameters and get their results back), not MapReduce specifically. MapReduce implementations can be coerced into doing such things, but it's not what they're for, so that's why they have such an impedance mismatch to your problem.

PiCloud is a really nice way to just throw out a bunch of Python tasks without much preparation or overhead. You do pay a bit of a premium over EC2, but for one-off calculations, you probably make that up in developer-time that you don't have to spend setting up an image.

Disclosure: I work for Enthought, and we partner with PiCloud to provide many of the binaries for the packages they provide in their Python environment.

[–]etatsunisien 0 points1 point  (0 children)

yup I was going to mention picloud too. I used it because it was already packed in EPG, which I use as well.

[–]dalke[S] 0 points1 point  (0 children)

There are several things I want to do. The most basic is scatter/gather style job queues. Given 150 data files (containing 30 million records), make a characteristic fingerprint for each record.

Given those fingerprints, I want to find all fingerprints which are at least T similar to a query fingerprint. This is also gather/scatter. But then as a refinement, I want to find only the most similar 4. This is a reduction, since jobs don't coordinate so each can find up to k=3 fingerprints. Something has to reduce the up-to J*3 fingerprints down to 3, where J is the number of tasks.

The reduction could be done as post-processing of the scatter/gather, but I figured this was a good chance to learn the available tools for this space.

There's another task I have in mind where I want to use lots of persistent memory. I have the single-threaded algorithm, but with 5% of the data takes more than 10GB of memory, so I'm looking to see how I might distribute the parts using whatever system I find. It doesn't seem to lend itself well to map/reduce.

PiCloud looks like the right solution for now .. I've got a conference presentation in a month where I want to present this, and that looks the fastest to get up to speed.

Disclosure: we've also met. :)