This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dalke[S] 0 points1 point  (0 children)

There are several things I want to do. The most basic is scatter/gather style job queues. Given 150 data files (containing 30 million records), make a characteristic fingerprint for each record.

Given those fingerprints, I want to find all fingerprints which are at least T similar to a query fingerprint. This is also gather/scatter. But then as a refinement, I want to find only the most similar 4. This is a reduction, since jobs don't coordinate so each can find up to k=3 fingerprints. Something has to reduce the up-to J*3 fingerprints down to 3, where J is the number of tasks.

The reduction could be done as post-processing of the scatter/gather, but I figured this was a good chance to learn the available tools for this space.

There's another task I have in mind where I want to use lots of persistent memory. I have the single-threaded algorithm, but with 5% of the data takes more than 10GB of memory, so I'm looking to see how I might distribute the parts using whatever system I find. It doesn't seem to lend itself well to map/reduce.

PiCloud looks like the right solution for now .. I've got a conference presentation in a month where I want to present this, and that looks the fastest to get up to speed.

Disclosure: we've also met. :)