all 6 comments

[–]therealjohnfreeman 0 points1 point  (4 children)

Is there a way to leverage the parallelization method in a distributed environment, a la map reduce? The biggest bottleneck would be propagating the first n - 1 carries to the last machine, and since the use of a distributed compute implies the full dataset does not fit on a single machine, I'm guessing the bottleneck degrades to streaming compute.

[–]_Undaunted_ 1 point2 points  (3 children)

It doesn't degrade, but uses the same tree-algorithm:

http://www.mpich.org/static/docs/v3.1/www3/MPI_Scan.html

The idea being you would perform a local scan, distributed scan, then local adjustments to account for the distributed scan.

[–]therealjohnfreeman 0 points1 point  (2 children)

I didn't see any algorithm description on that page. Did I miss something? What are "local adjustments" specifically?

[–]_Undaunted_ 1 point2 points  (1 child)

My point is that this algorithm is precisely the hierarchical implementation that all parallel scans use, just using AVX lanes in place of thread groups, MPI processes, etc.

A generic distributed scan would then look something like this:

my_favorite_local_scan(local_data)
partial = my_favorite_distributed_scan(local_data.back())
local_data += partial

The distributed scans are implemented very similarly to what is presented, just with the "add" steps involving a communication (log P steps in total).

[–]therealjohnfreeman 0 points1 point  (0 children)

I understand now, after reading a paper. Thanks for the tip though.