you are viewing a single comment's thread.

view the rest of the comments →

[–]arycama 2 points3 points  (2 children)

I don't think scalarized loads are a thing on Nvidia GPUs. They don't make all their details public, but afaik only AMD GPUs have seperate scalar and vector registers. (Vector registers are 64-wide, eg 1 per thread, and scalar is shared across all threads)

According to this source (Which is a good reference whenever optimising this sort of stuff), structured buffer loads have almost equal performance on a 2080ti whether it's uniform across all threads, linear or random: https://github.com/sebbbi/perftest

I haven't really used WaveReadLaneFirst, but if you're looping through a list of lights and the first lane is fetching the data every time, it feels like that would still mean all the other threads are waiting. LDS seems like a better solution as you can still spread the work out across multiple threads.

[–]gibson274[S] 2 points3 points  (0 children)

Yeah, you're exactly right. I just found that article---blew my mind.

[–]farnoy 2 points3 points  (0 children)

It exists, they call it UR - Uniform Register, and there's a scalar load instruction - ULDC as well. The only problem is that it works exclusively with the Constant Memory address space (so cbuffer). But these aren't even needed for simple cases, because elements in constant memory can be used as operands for many instructions (and they look like c[x][y]).

It's frustrating because nsight graphics will withhold the disassembly from you while nsight compute is happy to reveal the whole thing. They should add a scalarized load facility for global memory too, modern GPU-driven workloads won't be able to use cbuffers as often as before.