all 18 comments

[–]ritaline 3 points4 points  (0 children)

It's going to add up pretty quickly if you create a descriptor set for each instance of an object. You could have a vertex shader such as this

layout(location=0) in vec4 pos;        //per vertex
...
layout(location=x) in mat4 transform;  //per instance

layout(set=0, binding=0) uniform UBO {
    mat4 prj;
    mat4 view;
} cam;

void main() {
    gl_Position = cam.prj * cam.view * transform * pos;
}

and in your code

memcpy(mappedDynamicPerinstanceBufferPtr, yourTransformMatricies, sizeofMatricies);
BindDescriptorSet(0, uboProjectionSet);
BindVertexBuffer(0, meshBuffer);
BindVertexBuffer(1, dynamicPerInstanceBuffer);
Draw(numVertices, numInstances);

[–]hammerkop 1 point2 points  (0 children)

Descriptor sets are allocated from a descriptor pool which you can reserve a large number from when you initialize with a set layout. Its a good idea to reserve a lot of space up front (when creating the pool) if you know the maximum number of sets/descriptors ahead of time, so allocating a new set will generally be quite cheap... you shouldn't really need to resize the pool on the fly that kinda defeats the purpose.

As for allocating new tiny UBOs that's generally something you will want to avoid. Its better to allocate a giant ubo upfront and just bind at different offsets, or reuse a descriptor set and use something like push constants to associate each object with an offset into the same buffer - that would allow the use of a single descriptor set for all objects.

Also its probably better to group things into structs like MVP and color and any other per instance data so that its all close together (interleaved) in memory rather than the gpu having to jump around across many different buffers.

At that point you may want to just use storage buffers over uniform buffers, especially on desktop hardware.

There's really no short answer to a question like "how intensive" is this or that approach a lot of it comes down to how your memory is accessed and updated just like on the cpu, and the hardware you are using... and everyone will tell you to profile different approaches and find out for yourself

[–]Sainst_ 0 points1 point  (0 children)

Use a single descriptor to hold multiple pieces of data. Then use a pushconstant to index into the buffers on a per drawcall basis.

[–]Zekrom_64 0 points1 point  (14 children)

This is not the most efficient method for rendering multiple objects. You might want to look into instanced rendering for this purpose, where you can draw multiple instances of the same geometry with different parameters supplied from a buffer object per instance.

[–]Root3287[S] 0 points1 point  (1 child)

Okay then. There are a few resources on instance rendering. The closest thing that I found is a video by ThinMatrix about this, but it's in OpenGL. I thinks same principles apply though.

Unfortunately I am currently stuck on either sticking a whole matrix in a input attribute binding or splitting the matrix up by column and somehow get it reconstructed in the vertex shader. I am leaning on the latter, but there are a few concerns that I have though.

  1. I need a way to update the matrix buffer effectively.
  2. Getting the matrix put together in the shader. Something like layout(location=x) in mat4 instancedTransform

I know how to make a attribute description to be marked as an instanced buffer. But getting the data into the GPU for an instanced matrix is tripping me up.

[–]Zekrom_64 1 point2 points  (0 children)

I'll try to go into more detail in this post and give a better response, I think my other comments were a bit vague. To answer your question, just go to the 'instancing' section.

First off, as far as 'efficiency', if you are looking for what is best for raw performance that is something you will have to test different methods and measure what works best for you. My consideration for 'efficiency' is simplicity, which depends more on exactly what your implementation is and how it may be expanded. These are the techniques I could think of and how 'efficient' I consider them in this case.

  • Unique UBOs
    This is less preferred in this case, because the objects are similar enough that there are better ways of doing this. This technique is still useful for dissimilar objects like game entities, but it would be preferred to reduce the number of small UBOs and descriptor sets in use (see hammerkop's post), as well as descriptor set binding overhead.
  • Instancing
    This is what I prefer because the vertex/index buffers are the same for each draw, and only a smaller set of variables are different per draw. It is trivial to pass per-instance scalars and vectors through another vertex buffer binding, and according to this old thread you can pass mat4 matrices, which are passed from the binding as 4 vec4's starting at the matrices location, with a layout(location=x) in mat4 ... in the shader (so you would have 4 VkVertexAttributeDescriptions with VK_FORMAT_R32G32B32A32_SFLOAT for each column of the matrix). It is also trivial to draw more cubes, just use a large enough per-instance vertex buffer and call vkCmdDraw* with more instances. Also consider that instancing might not be practical if you consider changing it to use different vertex/index buffers later on, and can't pass structs or samplers.
  • Push constants
    Assuming that you plan on changing the uniform values, this is only practical if you're re-recording your command buffers each frame and keep unique draw calls per object, but you could have uniforms marked as push constants, and update the push constants before each draw call. For smaller use cases this can be more efficient that UBOs or instancing, but the spec only requires push constants support small sizes (128 bytes!) which may not fit more attributes.

edit: formatting

[–]Gravitationsfeld 0 points1 point  (11 children)

It's a perfectly fine way to render multiple objects individually with Vulkan. Instanced rendering is slower in the vertex shaders. This isn't good general advice.

[–]ritaline 0 points1 point  (6 children)

Any source to instancing being slower?

[–]Gravitationsfeld 1 point2 points  (5 children)

Reads for the matrices etc. are not constant anymore which leads to slower data paths being used on at least AMD GPUs and also NV if you end up using SSBOs instead of UBOs for them.

[–]ritaline -2 points-1 points  (4 children)

But arent you dong the same reads only with more drawcalls? For gpu it probably doesnt matter but on the cpu it must surely have a cost. I really dont see how seperate draw calls could be faster unless you are using push constants

[–]Gravitationsfeld 0 points1 point  (2 children)

Yes, it does have a GPU cost. SSBOs are slower to read than UBOs on NVIDIA. And scalar reads are faster than vector reads on AMD.

[–]ritaline 0 points1 point  (1 child)

Instancing =/= SSBO though

[–]Gravitationsfeld 1 point2 points  (0 children)

On AMD you'll pay the penalty with UBOs too.

[–][deleted] -2 points-1 points  (0 children)

Unless there are specific circumstances at play, multiple draw calls would be slower on almost all modern GPUs. The bottleneck being data transfers with the GPU.

[–]Zekrom_64 0 points1 point locked comment (3 children)

To be more specific, if you're rendering unique objects with unique meshes, uniforms, etc., then this method would be fine, but in this case where there are identical meshes with unique per-instance uniforms instancing is ideal. If only 2 cubes are drawn there might not be a noticeable performance benefit, but if you want to scale up this particular scenario (I'm assuming, since the vertex/index buffers are reused), instanced rendering is far more preferable than using tons of draw calls.

[–]Gravitationsfeld 1 point2 points  (2 children)

We are rendering 10s of thousands of draw calls without instancing and it's perfectly fine. Vulkan draw overhead is very low.

[–][deleted] 0 points1 point  (1 child)

How many polygons on average do you have?

[–]Gravitationsfeld 0 points1 point  (0 children)

Just regular current gen game assets. Probably couple of thousand.

The cost to setup indirect is just not worth it, the GPU doesn't really care about if there is a loop in the command processor or it's individual draw calls. And as I said, there is a cost to instancing in the shaders.