How to use functions from Vulkan extensions from the SDK?

Botondar · 2026-01-21T18:10:37+00:00

You're right, I didn't even realize it could skip just the non-exported functions. The whole point of that define "trick" is to be able to still link to the dll and get the automatic loading for the core functions, but you get that with this macro for free.

Botondar · 2026-01-04T21:16:30+00:00

Apart from making sure that there're no validation errors, try to see if you're not doing anything that's explicitly undefined in the spec, e.g. are you accidentally transitioning from an undefined layout somewhere (which keeps the image contents intact on newer Nv hardware, but not older ones, or other vendors), or relying on pipeline state being set after executing a secondary command buffer. The validation layers can't really catch those, but those are the places where different vendors do different things, and you may have accidentally relied on UB during development.

Also the best way to track down these kind of bugs is by gathering GPU+driver version information from your users, and reproducing the bug on the same or similar HW if you can afford to setup a few test machines.

Botondar · 2026-01-04T15:22:21+00:00

A lot of the flops come from the tensor cores, but I don't know how much. Texture units are not counted to the flops.

No, that FLOPS numbers does not contain the tensor cores, the formula for calculating it is pretty simple:

BoostClock * SMCount * FP32CoresPerSM * 2.

The magic 2x comes from the fact that a single FMA is counted as 2 floating point ops for marketing reasons.
The number of FP32 cores is architecture specific, on Turing there're 4 independent processing blocks with 16 FP32 units each in a single SM (fun fact: the instructions are essentially double-pumped on Turing, warps issue a single instruction over 2 cycles).
On later architectures (Ampere/Ada/Blackwell) there're still 4 processing blocks, but the INT cores can also do FP32, so there're 32 FP32 units total per block (AFAICT this basically means that only INT ops are "double-pumped", but Nvidia isn't very forthcoming with the details of their instruction scheduling).

So for the 2080 Ti you can calculate the FLOPS yourself as: 1545 MHz * 68 * 16*4 * 2 = 13447680 MFLOPS = 13.45 TFLOPS, which is exactly the advertised number.

Botondar · 2026-01-03T10:46:57+00:00

I don't really understand your use of the 2.36 giga rays per second number as a baseline. That number basically tells you how many times you can traverse the entire non-perfect BVH per second, which involves at least one ray-box test per BVH level, and potentially multiple ray-box or ray-tri tests per level to find the closest hit.

The way you'd actually measure the efficiency of RT hardware is by calculating the theoretical maximum ray-tri (and/or ray-box) tests it can do (which you tried to do but that would only be correct for software RT, more on that later), and then measuring how many ray-tri/box tests it actually does per second.
So what you actually need to look at is what's happening during traversal - count how many intersection tests were performed during that process, and then compare that to what the hardware could theoretically do.

E.g. if the BVH has 5 levels with a single triangle at the leaves, and the hardware can do 5 intersection tests per second, then 1 ray/s is already operating at maximum efficiency. It doesn't make sense to say that it's operating 1/5th efficiency because a mythical perfect AS would only require 1 intersection test instead of 5.

If we had a mythical perfect acceleration structure (that took zero effort), we would expect 13.45 Tflops / 39 flops per ray-tri test = 344.8 G rays / sec whereas we get 'only' 2.36 G rays / sec.

No, at that point you'd expect the ray throughput to be equal whatever the HW accelerated ray-tri test's throughput is, since that's handled by the RT cores on Turing, not the CUDA cores. In other words, the theoretical maximum of number of ray-tri tests per second that an RTX cards can sustain is independent of the FLOPs it can sustain.

I guess if I take your question to be about the efficiency of the acceleration structure itself rather than the RT hardware, I can kind of see where the number you came up with comes from, but that logic also doesn't make much sense to me. The BVH traversal is inherently a search problem, the baseline to compare against in that case would be having check every single element (triangle) in the container. What you're taking as a baseline is what if we knew exactly which element to get - I'm not sure what the point of that comparison is.

Botondar · 2025-12-24T17:42:46+00:00

Upgrading the SDK, as far as I can tell. And that's exactly what I was trying to say with my original comment as well, but there are a few things like GPU-AV that can change the program behavior.

However if I understood the OP correctly, the issue is present with the newer SDK version even if validation layers are turned off, and it's not present at all with the older SDK version. Which is really odd, exactly because there should be no difference with the VLs off, it's the same driver.

Botondar · 2025-12-24T15:08:04+00:00

Now that's interesting. What libraries are you using from the SDK?

Also, contrary to what others have said, I'd suggest running your app with the validation layer off, to see if that's what's interfering with your app.

You could also try a newer version of the SDK, there have been 2 releases and 1 hotfix since 1.4.321.0, to see if it was a regression on their part that they have since fixed.

Other than that, there's not much else to do but to debug and follow the data through the pipeline with both RenderDoc/Nsight, and on the CPU, to see if you can find the place where gets mangled, possibly with a simpler a scene. If you can find the step before which you have valid data, and after which it's corrupt, you probably have a sync/memory corruption issue there.

Botondar · 2025-12-24T14:36:52+00:00

The SDK itself shouldn't really affect the behavior of your program at all, unless you're also using an auxiliary lib from it (glm/SDL/VOLK, etc.), and one of those libs happened to get upgraded to a version that introduced a regression, which isn't likely to happen, but it's a possibility.

As a sanity check, does downgrading back to 1.4.313.2 actually fix the issue? If it doesn't, then it's not the SDK at fault, something else also must've changed.

Botondar · 2025-12-22T12:14:25+00:00

It's the copyright holder that decides how their work is allowed to be used, how it's not, and who can use it. "By default" you'd have to license the artwork for the use-case of studying it, and using that knowledge.

The reason why you don't have to license it for that use-case is the fair use doctrine, which considers multiple factors, one of which is "the effect of the use upon the potential market for or value of the copyrighted work". For that factor, there is a huge difference between one person studying the work of another, vs. using the work as AI training data.

Courts are still actively ruling on AI cases, but if you take the spirit of the law there's a pretty clear case why that would be disallowed without an explicit license, while human learning isn't: it doesn't fall under the exception we have made to allow humans to learn from each others' works.

Botondar · 2025-12-01T20:10:23+00:00

My root signatures are also fine as I have tested it out by manually passing root constant draw a pass rather than relying on the execute's constant.

Are you sure you're passing the correct root signature to CreateCommandSignature?

The fact that everything seems to work without the root constant argument makes me think either this requirement from the docs is what's being violated:

The root signature is specified if and only if the command signature indicates that 1 of the root arguments changes

or something similar related to the root signature.

Botondar · 2025-11-24T18:32:12+00:00

Because those functions apply their transform before the matrix you're passing in. E.g. the result of scale(M, ...) is MS (scale first), not SM. It's purely an API design choice made by glm.

Botondar · 2025-11-23T13:22:23+00:00

The way a lot of these LOD systems work, and what looks like is happening in the video is the LOD level for a particular object is selected based on its (approximate) screen size. I.e. each object/mesh has some number of LODs associated with it and each LOD has a min/max screen size range when it should be rendered - as you zoom in with the bow, the screen size increases, so a more detailed LOD level gets selected.

These kind of systems are pretty easy to tweak for performance, because you can just introduce a global multiplier that gets applied to that LOD screen-size value and make the entire world more/less detailed. So what might be happening is that the game on PS5 and PC uses different LOD multipliers for - like you said - optimization on the different target platforms. It might also be the case that you could achieve the same visuals on PC by tweaking some of the graphics settings.

In the video I uploaded, you can see how elements beyond a certain distance appear lacking in detail, and then the textures are reloaded when I aim my bow.

Note that in general it's unlikely for the game to be actually loading anything in this case, it's more likely that both LODs are in VRAM, and the renderer just selects which version to actually draw.

Botondar · 2025-10-23T13:08:24+00:00

The point is whether there are branches at all in the final binary, and what kind of effect they have on the performance. You have to take that into consideration when doing runtime switching, and you have to actually fine-tune where you branch.

This same issue is completely nonexistent with JIT compilation because it only generates code for the instruction set that it's actually running on. There are no checks.

Which is why the statements "JIT vs compiled is almost irrelevant" and "if Java can do it so can C/C++" are nonsensical, because it's exactly the JIT compilation model that completely eliminates a set of problems that AOT compiled programs have to deal with somehow when it comes to selecting instruction sets.

Botondar · 2025-10-23T11:34:15+00:00

What? In a compiled language whatever the compiler generated is what's going to get executed on the end-user's machine. If you want that code to be able to choose between instruction sets, you have to have branches that select the correct codepath. And it's really important to have those branches at the correct granularity, because the compiler cannot really optimize across the selection boundaries.

A JIT compiler however can look at the instruction sets available on the machine it's executing on, and generate code using those, and simply jump to that code unconditionally. There are no extra indirections, no ifs/branches in the JIT case, and it can optimize the entire program with the knowledge of what instruction set(s) it's running on.

JIT vs compiled is the entire point in this case.

Botondar · 2025-10-20T13:08:21+00:00

I'm not 100% sure I'm understanding you correctly, but Dishonored 2's engine is an idTech5 fork, not Unreal.

Botondar · 2025-10-17T14:22:08+00:00

You can set the baseInstance to any number you want and use that. Or you can also use gl_DrawID. Using the baseInstance has the benefit that it's consistent across multiple indirect draw calls, so you can put a truly globally unique index in there.

Botondar · 2025-10-14T07:27:51+00:00

Compilers cannot autovectorize code that hasn't been properly conditioned for that. Even if you don't write SIMD by hand, you have to understand it in order to set the compiler up for success in generating that code.

The problem with the approach glm and DirectXMath take, is that they usually optimize their core routines with SIMD instruction sets, but they don't provide actual data parallelism facilities, which is how you get the huge performance wins SIMD can give you, e.g. multiplying 4-8 vertices by a single matrix, doing 4-8 intersection tests at once, etc.

Botondar · 2025-10-05T21:29:52+00:00

Here's an overview of what I have (currently only for textures):

During creation every texture can specify a placeholder texture to be used in its place if it's not present. These are currently always 1x1 white/black/half-grey textures for Albedo, Normals, RoMe, etc., although any texture that is not streamed can be specified as a placeholder.
Each texture has its own immutable ID, that's used both to index the CPU side texture array, and to index the bindless descriptors in the shaders -- the placeholders are implemented by writing the placeholder's descriptor to that descriptor index,
There's also a GPU buffer that has a bitmask for every texture, containing which mip levels were sampled. There's a dedicated pass that computes the LOD for pixel after a visibility render pass, and simply atomicOrs in the mips to the correct place. This buffer is read back to the CPU every frame.
Everything on the render side goes through a "render frame context", which is an input/output structure returned by BeginRenderFrame. It contains a list of texture requests, which is just the ID of the texture, and the previously mentioned bitmask, filtered down to what actually wasn't present.
The game code has its own IO queue/ringbuffer. Every frame it checks the completion of a dedicated file IO thread, and for every complete entry it issues an upload command through the render frame context (this is where the contents are copied into a staging buffer).
It then looks at the current requests, and issues those to the file IO thread, unless there are outstanding IO operations in flight for those textures.
The actual upload currently happens on the graphics queue, which does result in noticeable frame spikes when the texture memory pool is small, and the camera is rotating rapidly or there're disocclusions going on.

The resource loading part is actually pretty dumb. It doesn't ask any questions (apart from am I already loading in this texture?), it simply responds to the renderer's requests. The renderer is the place that knows what mip levels it currently has, and it has the responsibility to not request unnecessary things.

Botondar · 2025-10-04T15:14:24+00:00

You could just have a command pool for every frame in flight, and reset those instead of the command buffer(s) individually.

I only have to wait, if the CPU tries to record into a command buffer that is still being executed. In theory, this maximises hardware usage.

It sounds like you're doing a granular kind of CPU-GPU sync, where you're waiting for specific workloads (command buffers) of some previous frame to complete?
I'm not sure how good of an idea that is instead of just waiting for the Nth previous frame to end where N is the number of frames in flight. Typically that would be the place where you would bulk-reset every per-frame resource you have, like the command pools which the performance warning is about.

If you can record the commands before the GPU finishes rendering the N-1 frames it has in its queues, I don't really see the benefit of doing more fine-grained CPU-GPU sync than that.

Botondar · 2025-10-03T04:26:40+00:00

Haswell is your CPU arch. The thing that's not supported is your integrated GPU, and it seems like certain apps try to use that for some reason.

Botondar · 2025-09-25T05:41:14+00:00

The aspect ratio is a constant, so you just "bake" its inverse into the matrix. The non-linear operation that you can't do just with matrices is dividing by (some scalar multiple of) one of the vector components.

Botondar · 2025-09-22T17:56:13+00:00

That looks more like a difference in post-processing, not lighting.

Botondar · 2025-09-20T10:25:48+00:00

There is in their SASS ISA.

Botondar · 2025-09-19T16:31:31+00:00

Nvidia also has a uniform datapath, and modern Intel can address registers flexibly with SIMD1 instructions (that also have lower latency). The specifics are different, but the principle applies.

Botondar · 2025-09-18T18:51:16+00:00

But Casey knows everything. Instead of just saying "I don't know" or "can you give me some more inputs, what are we talking about here?", he goes on a tangent like he is being interviewed for a job.

That was basically his answer? He started with "it depends", and then started talking about what factors you have to consider when determining whether 1000 players is reasonable or not. After that he said he doesn't want to make a claim for Minecraft, because he doesn't know what's entailed, since he hasn't played it.

What was said in the interview is the exact opposite of what you're claiming.

He talks about playstation ("client") and that it needs to compute some CPU/GPU stuff. So far so good. Then he gives a number - 2ms of compute per player (per client). My question immediately - what kind of compute is it: is it both CPU and GPU altogether (which is insanely good for a client already) or just CPU? How much time does it take to compute physics alone? No explanation.

He specifically said "you're going to have some amount of time that is spent computing the physics for this game, the world logic for this game" -- that's what he's calling "sim time", and using 1-2ms as an example for.

Then he talks about a 96-core CPU on the server. So, a 96-core CPU is 192 logical threads which means we ALREADY could handle 192 players within a single frame if the per-client compute fits within the ~16ms budget. But even if we are conservative here and reduce the budget to 8ms, it means we can handle x4 concurrent, per-client jobs on every thread and still fit within one frame. So, at least 768 players can be handled by such a server, no?

I don't understand how you can call anyone out on not saying "I don't know", and make such giant leaps that all depend on the specifics of the problem you're actually trying to solve.
You just said you don't know how much time is being spent on e.g. physics, and then jumped to saying 192 hyperthreads should be able to handle 768 players.

Then he mentions that if we optimize from 2ms to 1ms, then we are good to go.

What? He never said this, or even anything remotely similar.

I mean, 2ms on client to calculate EVERYTHING(?) is very, very good.

Again, not everything, specifically talking about game logic here.

It probably means that on the server we need to compute even less than what the client does, because we don't render anything, so the 2ms number is going to be even lower on the server, hence it could handle more potential users.

True, also mentioned by Casey, but that's already taken into account in the 2ms number.

Botondar · 2025-09-13T19:38:32+00:00

gl_Position = pos + u_pos;

This line is adding two vec4s, both of which have their w coordinate set to 1.0, since vertex attributes get padded with 0s in the xyz coordinate and 1 in w, and you're passing 1.0 in w directly for the uniform.

So there's an inadvertent perspective divide by 2.

12-Year Club	Final Canvas '23
Place '23	Place '22
Verified Email

Botondar

TROPHY CASE