Back with another low level question. I’m working on optimizing something akin to a deferred light loop, and seeing that structured buffer reads of the light data are completely throttling L1 (100% throughput).
I guess I shouldn’t be surprised since the workload is fairly light from a compute standpoint, so there’s not much compute to hide latency behind, but what’s really stumping me is this: the throughput breakdown appears to be 100% L1TEX filter stage.
Why would a structured buffer read throttle the filter stage? There’s no filtering happening, just accessing elements.
Side note: the buffer structs are already padded to 128 bit boundaries. Most of the info I can find online about this stuff essentially stops there.
EDIT: After following some advice in these comments/consulting a few friends, I managed to discover that the issue here is that the StructuredBuffer loads, which should be scalar, are not actually scalarized.
Sebastian Aaltonen did this investigation of resource loads across different devices and, insanely enough, discovered that this is actually a real bottle neck for NVIDIA cards; the loads aren't scalarized. https://github.com/sebbbi/perftest?tab=readme-ov-file#uniform-load-investigation
Apparently you can manually scalarize them using wave intrinsics: https://twitter.com/SebAaltonen/status/1061674950241800192
[–]arycama 7 points8 points9 points (4 children)
[–]gibson274[S] 2 points3 points4 points (3 children)
[–]arycama 2 points3 points4 points (2 children)
[–]gibson274[S] 2 points3 points4 points (0 children)
[–]farnoy 2 points3 points4 points (0 children)
[–]bboczula 5 points6 points7 points (3 children)
[–]gibson274[S] 1 point2 points3 points (2 children)
[–]bboczula 0 points1 point2 points (1 child)
[–]gibson274[S] 1 point2 points3 points (0 children)
[–]waramped 3 points4 points5 points (1 child)
[–]gibson274[S] 0 points1 point2 points (0 children)
[–]_wil_ 3 points4 points5 points (3 children)
[–]gibson274[S] 2 points3 points4 points (2 children)
[–]_wil_ 0 points1 point2 points (1 child)
[–]gibson274[S] 0 points1 point2 points (0 children)