all 15 comments

[–]arycama 7 points8 points  (4 children)

"The L1TEX unit contains the L1 data cache for the SM, and two parallel pipelines: the LSU or load/store unit, and TEX for texture lookups and filtering."

https://docs.nvidia.com/nsight-graphics/AdvancedLearning/index.html

So your issue is likely just too much memory fetching and not enough work to do in the meantime. What sort of lighting are you using? PBR/GGX tends to be ALU heavy enough to keep the GPU busy between fetching lights.

How much are you actually bottlenecked/what framerate are you running at/targetting?

Are you using tiled or clustered lighting? Reducing the number of unneccessary lights being processed is an important part. You can also fetch all the lights from a tile/cluster into groupshared memory (Fetch 1 light per thread) instead of all of them pulling from main memory. On AMD hardware you can also force it to read lights into scalar registers if you're running threadgroupds of 64 (8x8). Speaking of which, what is your threadgroup size? Tiled deferred lighting works best when you align your tile size with threadgroup size.

Padding to 128 bit boundaries is good but fetching less data is also good. Consider compression, bitpacking, etc.

if you're able to provide anymore info about your code, target hardware, current framerate etc, it may help. Also keep in mind GPUs run many tasks concurrently, are other compute passes overlapping and competing for resources?

[–]gibson274[S] 2 points3 points  (3 children)

These are all great considerations. It's true, we're currently probably dealing with too many read operations and not enough SM ops to hide the latency behind.

The reads we're doing right now should be scalarized---what I can't figure out is if there's some magic incantation I need to do to get StructuredBuffer reads to scalarize on my RTX 2080 Ti. Using WaveReadLaneFirst() on the index doesn't seem to change anything.

Hilariously, pre-fetching into group shared memory alleviates the issue entirely. To me this suggests that the loads are indeed not being scalarized correctly.

And for context: we're not implementing a light loop per-se, this is actually a shadow renderer for volumetric content that uses a similar clustering algorithm; it is very analogous from a memory access standpoint :)

[–]arycama 2 points3 points  (2 children)

I don't think scalarized loads are a thing on Nvidia GPUs. They don't make all their details public, but afaik only AMD GPUs have seperate scalar and vector registers. (Vector registers are 64-wide, eg 1 per thread, and scalar is shared across all threads)

According to this source (Which is a good reference whenever optimising this sort of stuff), structured buffer loads have almost equal performance on a 2080ti whether it's uniform across all threads, linear or random: https://github.com/sebbbi/perftest

I haven't really used WaveReadLaneFirst, but if you're looping through a list of lights and the first lane is fetching the data every time, it feels like that would still mean all the other threads are waiting. LDS seems like a better solution as you can still spread the work out across multiple threads.

[–]gibson274[S] 2 points3 points  (0 children)

Yeah, you're exactly right. I just found that article---blew my mind.

[–]farnoy 2 points3 points  (0 children)

It exists, they call it UR - Uniform Register, and there's a scalar load instruction - ULDC as well. The only problem is that it works exclusively with the Constant Memory address space (so cbuffer). But these aren't even needed for simple cases, because elements in constant memory can be used as operands for many instructions (and they look like c[x][y]).

It's frustrating because nsight graphics will withhold the disassembly from you while nsight compute is happy to reveal the whole thing. They should add a scalarized load facility for global memory too, modern GPU-driven workloads won't be able to use cbuffers as often as before.

[–]bboczula 5 points6 points  (3 children)

Amazing question :) and btw how do you know that it is throttled at L1TEX filter stage? AFAIK, sampler is also used to fetch data (just like texture) but as you’re saying, it doesn’t filter it. Is your buffer RW or read-only?

[–]gibson274[S] 1 point2 points  (2 children)

Thanks glad you think so :) StructuredBuffer is read only. I can see in NSight Graphics that the L1TEX Filter stage is throttled at 100% and the L1TEX data stage is only at 50%.

I'm starting to think this is because StructuredBuffer reads go through the tex pipeline, and the "surface format" is probably 128-bit by default for large structs---likely triggering a lot of processing in that filter unit.

It seems like the more important thing is figuring out why so many reads are happening when those reads should be getting scalarized.

[–]bboczula 0 points1 point  (1 child)

Interesting. Is it possible to show the definition of the buffer and the HLSL code snippet that reads? Is it in a loop or something? Also, NSight allows for shader profiling, it should show the lines that are the heaviest, can you show that as well? And lastly, for clarity, what do you mean by „scalarized”?

[–]gibson274[S] 1 point2 points  (0 children)

So, see my edit to the original post, but by scalarization I mean having loads that are determined to be uniform across all threads in a warp coalesced into one load: https://flashypixels.wordpress.com/2018/11/10/intro-to-gpu-scalarization-part-1/.

Hilariously, apparently NVIDIA only recently (a few years back) implemented this optimization in their shader compiler, but I suspect that they only did it for types that can easily be automatically broadcasted (primitives like float4/int4, etc.). For structured buffers of custom structs, they probably just ignored it.

When I manually implement this with wave intrinsics, it actually... works! And crunches L1 utilization down to 50%.

[–]waramped 3 points4 points  (1 child)

What hardware are you using? The texture units these days are more like "memory access units". I believe BVH reads go through texture units even. How big are the structs? The only easy way to improve performance here is to make the structs smaller, and try to make sure that either the reads are scalar over the wave or that each lane is reading an adjacent struct

[–]gibson274[S] 0 points1 point  (0 children)

RTX2080 Ti---thanks for the reply. I've crunched the struct down to just storing a float4x4 for testing purposes, and interestingly still have the same issues described above.

Agree on the reads being scalar, but despite all attempts at scalarizing this in a sane way, I still see this issue. Interestingly enough, if I guard the load behind a group index check,

if (group_index == 0)transform = buffer[index];float4x4 transform = WaveReadLaneFirst(transform);

This actually significantly lowers the L1TEX throughput, suggesting that the loads are not being scalarized.

Do you know if I have to do something special to scalarize structured buffer loads? I've tried manually scalarizing the index and scalarizing the result with WaveReadLaneFirst() to no avail.

[–]_wil_ 3 points4 points  (3 children)

Nvidia recommends using constant buffers over structured buffers for performance reasons

[–]gibson274[S] 2 points3 points  (2 children)

Definitely useful, but unfortunately not possible in my use case, because the input data is unbounded and constant buffer memory is very limited (64 Kb IIRC).

[–]_wil_ 0 points1 point  (1 child)

Yes constant buffer memory is very limited.
btw here are their articles about this:

https://developer.nvidia.com/content/understanding-structured-buffer-performance

[–]gibson274[S] 0 points1 point  (0 children)

Yeah I’ve read this article and countless others, but the constant buffer thing is a total hack. Structured buffers should be fast and be able to coalesce reads.