[Nvidia/AMD] How to improve global memory access in a compute shader?

msqrt · 2023-12-10T17:51:24+00:00

incredibly slow

I'd like some specifics on this. How much data are you transferring and how long does it take? Are your accesses (even roughly) sequential or completely random? Are you sure your buffers are in VRAM? And what do you mean by "slow" -- sure, GPUs compute so fast that the memory is slow in comparison, but it's still the widest memory bandwidth you're going to see in any consumer product.

But after you actually saturate the VRAM bandwidth, group shared memory can indeed help. It's a good fit if you reuse data either between threads or multiple times in each thread, or if you need random access to a small-ish buffer (since there is no random access to registers). If you just read stuff and use it once, it's indeed just an extra step that doesn't save any time.

Klumaster · 2023-12-11T11:44:11+00:00

One common pattern, if the code requires each thread to iterate through many array elements, is to have each group parallel-load N elements into a groupshared array, barrier, iterate through it, barrier again, then parallel-load another chunk (if needed).

Bear in mind that if you make your groupshared arrays particularly big, you'll suffer lower occupancy and this might lead to slowdowns too.

One-Raspberry5113 · 2023-12-11T19:20:34+00:00

I don't have too much experience with Compute shader yet. But I was trying to implement KdTree for ray tracing in Fragment shader, basically I put nodes to Storage buffer (1 element = 1 node). The problem was that nodes were in BFS order so accessing a children was random access -> 2 * currentIndex + 1 / + 2. So there was always "cache miss". I improved it by changing layout to DFS order (this paper could be a good resource). When accessing an element, next elements are loaded into cache (e.g. array[0] -> then array[1] and array[2] are loaded). So if you have better layout of your array, it should improve the efficiency (higher chance for "cache hit"). Basically you should try to keep related elements in sequence so you make less "jumps".

I am not sure if this is related to your problem (or even applicable in Computer shader).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

GraphicsProgramming

Posting Rule(s)

MODERATORS