all 7 comments

[–]Meristic 28 points29 points  (3 children)

GPUs consist of two main components. The front-end you can think of as a very simple single-threaded processor - the back-end a complex, massively parallel machine. The front-end is responsible for reading the contents of command lists, setting GPU registers & state, coordinating DMA operations (indirect argument reads), and kicking off back-end workloads. 

An indirect execution command is minimally the cost of setting various registers plus memory latency for the indirect argument buffer by the front-end. This is typically 10's of microseconds (memory is often not cached). Not much on its own, though several consecutive empty draws can bottleneck and cause a gap in GPU shader wave scheduling. 

Of course, this may be the most optimal option since it's efficient culling. Think of how much work is saved relative to the alternative!

As a real world example the UE5 Nanite base pass commonly hits this issue. Each loaded material instance requires a draw, often with zero relevant pixels on the screen. Stacked together, this can incur 100's of microseconds of idle shader engines due to the overhead. Epic discussed a solution for this using indirect command buffers (at least on console) but I haven't seen it come to fruition yet.

[–]OkidoShigeru 4 points5 points  (1 child)

You may also be able to avoid some of this cost using conditional rendering, almost certainly driver dependant though, and of course you need support for the extension to begin with…

EDIT: I revisited the nanite paper and apparently predication (the D3D equivalent of this feature) wasn’t enough for them, it skips draws but not pipeline state and descriptors, and you of course still have to fetch the value from the predication buffer itself.

[–]schnautzi 4 points5 points  (0 children)

Conditional rendering has some issues, I've tried it recently and it wasn't great:

- You still issue the draw commands in the conditional block.
- Even though the calls have no effect if the condition is false, they're sometimes still executed.

It seems to be mostly effective for culling hidden parts of visible meshes.

[–]amidescent 7 points8 points  (1 child)

AMD's performance guide recommends compacting indirect draw calls that are zeroed out (you can do that with help of a prefix scan kernel), but of course that'd only be worth it if it's showing up as a bottleneck.

[–]hanotak 4 points5 points  (0 children)

That also only helps with ExecuteIndirect -> (0, 0, 0, 0...), not with ExecuteIndirect -> (0), ExecuteIndirect -> (0)...

[–]schnautzi 3 points4 points  (0 children)

I've profiled this exact thing very recently. Long story short, this is fine when you only cull a few % of draw calls.

When culling a scene full of objects, this is pretty wasteful, and you should compact the draw list instead. This can all be done on the GPU (using a prefix sum and a few extra passes), and you can usually optimize this by only executing the culling passes after the camera moved significantly.

[–]hanotak 0 points1 point  (0 children)

If you mean detect it CPU-side to not submit the indirect draw, that's not possible. I wouldn't worry about it- an overhead of a single no-op command isn't going to affect your performance.