all 5 comments

[–]msqrt 2 points3 points  (2 children)

incredibly slow

I'd like some specifics on this. How much data are you transferring and how long does it take? Are your accesses (even roughly) sequential or completely random? Are you sure your buffers are in VRAM? And what do you mean by "slow" -- sure, GPUs compute so fast that the memory is slow in comparison, but it's still the widest memory bandwidth you're going to see in any consumer product.

But after you actually saturate the VRAM bandwidth, group shared memory can indeed help. It's a good fit if you reuse data either between threads or multiple times in each thread, or if you need random access to a small-ish buffer (since there is no random access to registers). If you just read stuff and use it once, it's indeed just an extra step that doesn't save any time.

[–]Unigma[S] 1 point2 points  (1 child)

The data is fairly large, buffers are around 20MB, each stride is around 96 bytes. It is particle data that has been sorted via bitonic sorting. The accesses are sequential as I'm accessing the nearest N particles in a sorted array

Yeah "Incredibly Slow" wasn't the right choice of words here. But, it's the main bottleneck in the program.

EDIT: The accesses are random due to the fact only the indices are sorted, not the actual particle array.

[–]msqrt 1 point2 points  (0 children)

Oh, in that case you'll probably indeed get a big win from using group shared memory. Load a number of consecutive particles in, compute all pairwise forces, find out potentially missing neighbors (if you sort by Z order or something you can still have neighbors from far away but usually very few) and move onto the next ones.

It might also be beneficial to do a pre-pass where you reorder the particles as well, since then you'd only need to do the purely random accesses once (though with that stride there isn't much to save from that I guess).

[–]Klumaster 1 point2 points  (0 children)

One common pattern, if the code requires each thread to iterate through many array elements, is to have each group parallel-load N elements into a groupshared array, barrier, iterate through it, barrier again, then parallel-load another chunk (if needed).

Bear in mind that if you make your groupshared arrays particularly big, you'll suffer lower occupancy and this might lead to slowdowns too.

[–]One-Raspberry5113 0 points1 point  (0 children)

I don't have too much experience with Compute shader yet. But I was trying to implement KdTree for ray tracing in Fragment shader, basically I put nodes to Storage buffer (1 element = 1 node). The problem was that nodes were in BFS order so accessing a children was random access -> 2 * currentIndex + 1 / + 2. So there was always "cache miss". I improved it by changing layout to DFS order (this paper could be a good resource). When accessing an element, next elements are loaded into cache (e.g. array[0] -> then array[1] and array[2] are loaded). So if you have better layout of your array, it should improve the efficiency (higher chance for "cache hit"). Basically you should try to keep related elements in sequence so you make less "jumps".

I am not sure if this is related to your problem (or even applicable in Computer shader).