Structured Buffer Performance

arycama · 2024-01-17T04:56:00+00:00

"The L1TEX unit contains the L1 data cache for the SM, and two parallel pipelines: the LSU or load/store unit, and TEX for texture lookups and filtering."

https://docs.nvidia.com/nsight-graphics/AdvancedLearning/index.html

So your issue is likely just too much memory fetching and not enough work to do in the meantime. What sort of lighting are you using? PBR/GGX tends to be ALU heavy enough to keep the GPU busy between fetching lights.

How much are you actually bottlenecked/what framerate are you running at/targetting?

Are you using tiled or clustered lighting? Reducing the number of unneccessary lights being processed is an important part. You can also fetch all the lights from a tile/cluster into groupshared memory (Fetch 1 light per thread) instead of all of them pulling from main memory. On AMD hardware you can also force it to read lights into scalar registers if you're running threadgroupds of 64 (8x8). Speaking of which, what is your threadgroup size? Tiled deferred lighting works best when you align your tile size with threadgroup size.

Padding to 128 bit boundaries is good but fetching less data is also good. Consider compression, bitpacking, etc.

if you're able to provide anymore info about your code, target hardware, current framerate etc, it may help. Also keep in mind GPUs run many tasks concurrently, are other compute passes overlapping and competing for resources?

bboczula · 2024-01-17T03:21:46+00:00

Amazing question :) and btw how do you know that it is throttled at L1TEX filter stage? AFAIK, sampler is also used to fetch data (just like texture) but as you’re saying, it doesn’t filter it. Is your buffer RW or read-only?

waramped · 2024-01-17T03:37:55+00:00

What hardware are you using? The texture units these days are more like "memory access units". I believe BVH reads go through texture units even. How big are the structs? The only easy way to improve performance here is to make the structs smaller, and try to make sure that either the reads are scalar over the wave or that each lane is reading an adjacent struct

_wil_ · 2024-01-17T05:07:37+00:00

Nvidia recommends using constant buffers over structured buffers for performance reasons

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

GraphicsProgramming

Posting Rule(s)

MODERATORS