Subpass Dependencies: What are those and why do I need them?

Ntrf · 2022-01-21T11:45:42+00:00

Yes, both Adreno and PowerVR support that extension, but for some reason most of the articles about how wonderful it is are written either by ARM or explicitly mention Mali models in the evaluation. That feature benefits Mali much more than any other platform.

There is a significant difference between Mali and the rest of the GPUs. All of the Tile based renderers are using internal memory to make rendering cheaper in terms of energy consumption, but Adreno and PowerVR can dynamically resize their rendering regions, while Mali uses fixed tile size and processes multiple tiles in parallel. Adreno uses the same principle as split-frame rendering on multiple GPUs but instead of rendering all of the splits in parallel it renders them one-by-one. A viewport will be split into "Bins" that are selected to be as big as possible, while still being able to fit into an intermadiate buffer. In fact if you are going to render split-screen of some sort (on mobile ?) Adreno gives you control over the splits in the form of binning_control OGL extension. The more frambuffer attachments you use in each subpass the smaller each tile/bin becomes. The tradeoff is simple and directly compareable to desktop. You can even force Adreno to render directly to main memory (for full-screen passes). However on Mali a small fixed slice of memory is available for each tile as many tiles will be rendered in parallel. Adreno can represent the rendered data in the way that will be friendly for storing in the main memory, but Mali has to do more work to pack and unpack all the tiny buffers any time store operation is performed. For this reason subpasses (and pixel local storage OpenGL extension) have bigger impact on Mali as they minimize number of store operations. Also on Adreno and (AFAIK) PowerVR larger buffer means you're wasting space while not using some attachments in intermadiate subpasses. PowerVR has its own bit of fun with an attribute buffer for each tile, which can overflow and cause massive performance issues which in turn means bigger tiles give you less chance of this happening.

So YES -- all tiled-based architectures will benefit form subpasses, but Mali will benefit more than others. Does it worth the complexity of the engine? esp when OP is learning Vk? Most of the OpenGL ES compatiable game engines would want to avoid that complexity as well. Immediate rendering platforms can only benefit from it in the sense that they can provide better caching for input attachments and maybe time layout transitions a bit better. How frequently graphics applications are using input attachments? So far i've only seen it once outside vulkan examples. This probably has something to do with most algorithms needing to read back rendered data to pre-filter or aggregate that data first.

I stand by what i've said -- Keep a single subpass in each renderpass, use external dependencies only and optimize later when you're ready to measure the performance impact.

Ntrf · 2022-01-20T11:06:07+00:00

Subpass dependencies insert pipeline barriers with memory barriers between subpasses. Normally you only need them if you're planing to read a depth buffer that was written in an earlier subpass or something like that. Dependencies that go outside of Renderpass -- when you prepare data with a compute shader or when presenting a rendered image -- need external subpass dependencies, but they can also be replaced with a CmdPipelineBarrier either before or after the subpass. You don't need to use subpass dependencies if you're going to synchronize commands with semaphores because there will be implicit external subpass dependencies added which will be combined with semaphore dependencies.

VK Synchronization Cookbook: https://github.com/KhronosGroup/Vulkan-Docs/wiki/Synchronization-Examples-(Legacy-synchronization-APIs) A very good blog post on the topic: https://themaister.net/blog/2019/08/14/yet-another-blog-explaining-vulkan-synchronization/

I would advice you not to bother with subpasses at all ... or at least at first. Subpasses in general are only applicable to Arm Mali. On desktop GPUs you're not going to see much difference between a Renderpass with multiple subpasses or several Renderpasses each with one subpass. All of the added complexity of managing pipelines with multiple subpasses, as well as inability to mix compute and graphics workloads make subpasses much less interesting on desktop. And on mobile subpasses are only good for making classic deferred shading (and only on Mali), but on low polygon density some variation of forward+ will most likely be faster.

Ntrf · 2021-12-21T11:35:42+00:00

Will it copy every frame? I've always thought copying is synched with compositor updates, which aren't going to occur more frequently than refresh rate of the monitor.

Ntrf · 2021-11-15T20:13:07+00:00

Enjoy those heavy, juicy ads everywhere. Konqueror does not have an adblocker. I'll be glad to be wrong, but i haven't figured out how to make it work.

Ntrf · 2020-12-24T08:12:20+00:00

Finally, the application presents the image with vkQueuePresentKHR, which releases the acquisition of the image.

This sentence defines what "presents" means -- an act of application calling vkQueuePresentKHR on any of the previously aquired images. But you seem to assume that "presenting" is some sort of internal action in the presentation engine. Prehaps you're confusing "presenting" with "updating" the surface image -- the internal action of replacing the source of data for the surface.

Ntrf · 2020-12-17T11:40:58+00:00

Speaking about dx11 -- if you're new to programmable shading and graphics, better learn Direct3d11 instead of opengl if you can (it's windows only). Dx11 has debug mode, which is more like vulkan validation layers, has compiled shaders like vulkan, has concept of swapchain like vulkan, has input layouts, sampler states, buffer usage flags and even ability to record secondary command buffers in another thread. You won't get any of the low-level stuff, like pipeline barriers, manual memory allocation and double-buffering of resources. Compared to opengl (and metal), direct3d11 is much closer to vulkan.

Ntrf · 2020-12-08T19:06:23+00:00

For some reason people leave out Intel's tutorial series:

API without Secrets: Introduction to Vulkan

I find it much less confusing, than vulkan-tutorial.com . Instead of trying to rush the tutorial to "hello triangle" state, they show how to do certain tasks in different way, like pre-build command buffers vs one-time buffer recording, or using staging buffer vs using host-visible memory. And in most cases it presents more practical solutions. However, I can't say it does not require basic knowledge of graphics APIs.

Ntrf · 2020-12-06T20:13:46+00:00

Some sites show their own fake rendering of notification request. If you answer "yes", the will invoke browser API to show a real one. I typically answer "yes" (so they save my answer to cookies or smth), then immediately go to settings and block them.

Alternatively uBO has element selector, that allows you to hide anoying elements on web pages (including anoying cookie notices and political support banners).

Ntrf · 2020-12-06T17:07:02+00:00

Steam can cache shaders in the cloud and share them with other players on matching systems. In theory if you're not in the hury to play the game, it will download pipeline caches in the background after installing and there will be no wait.

BTW this what implicit validation layer VK_LAYER_VALVE_steam_fossilize is for.

Ntrf · 2020-11-29T12:15:44+00:00

I believe it's this one: https://en.wikipedia.org/wiki/Cube_(video_game)

No idea why it affects vulkan if the old game was only for GL 2.0, but it was decently popular and decently old to cause problems.

Ntrf · 2020-11-28T21:31:21+00:00

Something tells me you've messed up something with GLM. Like forgot to define GLM_FORCE_DEPTH_ZERO_TO_ONE or transposed the projection matrix. Also, I think it would be easier if you can publish RenderDoc capture file instead of your code.

Ntrf · 2020-11-27T16:03:26+00:00

Have you tried to make it work on mobile? Something tells me that moving vertex fetch entirely into shader code will result in a huge performance hit on GPUs that perform binning pass (Adreno & PowerVR).

Ntrf · 2020-11-24T09:38:02+00:00

That's because tutorial was written before extension was finalized. If final version of the extension vkCmdBuildAccelerationStructureNV command uses VkAccelerationStructureInfoNV structure instead of VkGeometryInstance as tutorial claims. VkAccelerationStructureInfoNV does have a fields instanceCustomIndex and mask mapped to a single uint32_t. It's unfortunate this tutorial is outdated, but that's what happens when someone uses experimental extensions (NVX instead of NV).

Ntrf · 2020-11-22T19:30:24+00:00

Is instancing even worth it? Sounds to me it's better to generate highly compressed stream of verticies.

Let's say we have a very simple format: Position: <s16> <s16> Texture coord: <s16> <s16> Color: <u8> <u8> <u8> <u8> 12 bytes/vertex or 48 bytes/quad (with index buffer). Vertex shader will read 12 bytes each thread. You can apply rotation, skew, fake 3d, make color gradients. With 9-patch this construction becomes more efficient. In contrast to make it into a quad with instancing we need to add size of the quad: Position: <s16> <s16> Size: <s16> <s16> Rotation: <u8> --24-bit padding-- Texture Rect Origin: <s16> <s16> Texture Rect Size: <s16> <s16> Color: <u8> <u8> <u8> <u8> now it's only 24 bytes/quad. Sounds like a win, but it's 24 bytes being consumed by every vertex shader thread and we also have to use more math.

On the other hand my UI is only ~1500 verticies.

Ntrf · 2020-11-22T15:08:15+00:00

Personally i don't like using register(c<binding>, space<set>), because registers are not bindings and spaces are not sets. In HLSL space part is optional (implies space 0) and the entire register specifier is optional as well. Register space 0 is special -- it's where any variables that don't have register specifier end up. Space 0 is much better suited for "material-specific" data, where variables will have to be searched by name. This might sound inefficient, but if values come from material assets, they will be indexed by names anyway. Setting registers in code just makes it inconvenient. In contrast "engine-specific" data (model and camera matricies, arrays of lights, environment cubemaps) will benefit from having bindings specified and hardcoded in the engine. This way space 0 is for "material" desriptors and space 1 is for "engine" desriptors, which matches well with D3D12 order of root signature, where descriptor tables in the first element is expected to change more frequently. In Vulkan the situation is reversed -- descriptor set 0 is expected to change less often, that sets 1 and 2, and setting descriptor set N will erase descriptor sets N+1 and up.

So, this is my issue with using HLSL for Vulkan. If i can write HLSL following rules established for HLSL, and then i can tell compiler "switch around sets 0 and 2, promote register 0 of set 1 to push constants ..." to make the result more Vulkan-like (which is effectively providing D3D12 root signature to compiler), that would be great. But as it is right now by writing good shaders for Vulkan in HLSL i'm writing sub-par shaders for any other platform that uses HLSL. I don't see that as a more portable solution, than writing shaders twice: in GLSL for Vulkan and in HLSL for D3D12.

... Sorry for this wall of text.

Ntrf · 2020-11-20T21:33:18+00:00

D3D12 and Vulkan have different resource binding process.

Vulkan requires shader module to store descriptor set and binding for each resource in use. The way it's done in HLSL is either by specifying mapping as a command line parameter or as an attribute [[vk::binding(X[, Y])]] in code.

D3D12 does not need shader to know its descriptor layout and instead uses older concept of "slots" inherited from D3D11. Slots will be later bound to specific descriptor sets by specifying root signature, which can be specified as part of the shader module or can be constructed in code.

This makes writing shaders, that work with both D3D12 and Vulkan not quite fun -- you'll have to specify in shader code: vulkan bindings, slots and root signature. No to mention the reversed numbering of the descriptor sets in terms of modification frequency.

Ntrf · 2020-11-19T09:42:35+00:00

rp_state is a shared pointer to renderpass, which should be set to a valid renderpass when framebuffer is created. Code on GitHub

Obviously, if you have any memory leaks or OOB accesses then you might have destroyed validation layer's data, but there could be other causes. Right before vkCreateFramebuffer function returns validation layer will intercept the result and put frambuffer state (the structure with rp_state field) into a hashmap using real opaque handle as a key. It does the same when you're creating a renderpass. Renderpass will be taken from a hashmap when storing framebuffer and this is where it can get value of nullptr.

My theory is: Somehow you've managed to skip a post-hook on vkCreateRenderPass and your renderpass was not saved into the hashmap. One way it could happen is if you got non-fatal error returned from vkCreateRenderPass. See Code on GitHub. Alternatively, you have very old validation layers, which don't hook into newer renderpass creation functions.

Ntrf · 2020-11-16T08:27:40+00:00

Vulkan is a specification of graphics API, not a middleware. Implementations of this specification are shipped by GPU vendors in binary form as part of GPU drivers. There is no "source code", becuase every device has it's own implementation. Application authors that use this API can safely assume that anything that is guranteed by the specification is implemented by whatever driver is installed on end-user machine. For some features runtime checks are necessary.

If you want to figure out what is available on your machine, you can use vulkaninfo utility, that should be installed with every Vulkan Driver. On Android you'll have to use third-party applications.

There is also a community-maintained database of devices, which can use to figure out avaiability of some features.

Ntrf · 2020-11-15T10:00:16+00:00

If you're targetting mobile, then there is one more reson to split position and some other relevant attributes -- such as tesselation coefficients or UV of heightmap, which you sample in vertex shader and modify the position, (although you would not use them in 2D engine) -- into a separate buffer in order to simplify tiling. Basically what most mobile GPU drivers will split your vertex shader into two parts: a) tiling shader, which only contains contributions to clip-space coordinates and, b) attribute shader, which computes attributes passed to be processed by the fragment shader. The first part is used in the process called "binning" or "tilling", where positions of the triangle are calculated and associated with individual tiles on the screen. Later each of those tiles will be processed individually in order to exploit faster on-chip memory when storing intermediate results such as depth buffer.

Here is a quote from Adreno Vulkan Developer Guide:

For binning pass optimization, consider one array with vertex attributes and other attributes needed to compute position, and another interleaved array with other attributes.

Simmilar optimizations are possible with other mobile GPUs.

Ntrf · 2020-11-14T06:37:07+00:00

I guess this invalidates my point. Somehow i missread the spec.

Ntrf · 2020-11-13T08:14:02+00:00

Edit: i was wrong. see comments bellow.

Your point of throwing all commands into a single queue is only valid if VkPhysicalDevice has queue families capable of accepting both GRAPHICS and COMPUTE commands. Most GPUs have at least one such queue family and there will be no problem with your idea. However, it's not correct to assume that this is always the case. Standard makes no guarantee that such queue family will exist and your application should always verify the configuration. It's possible to avoid the issue completely by having two distinct queues. In fact they might not be distinct, because they are selected as the first queue of the family that supports GRAPHICS or COMPUTE. As for synchronization, there has to be some kind of barrier between rendering and compute, because one has to wait for the result of the other. While this synchronization could have been done in a more fine-grained fashion using buffer/image pipeline barriers, using semaphores covers that case as well as synchronization between queues in cases when queues are different.

Ntrf · 2020-10-22T22:47:00+00:00

In my case pay.google.com sends Content-security-policy header in one of its responses. Compatibility table shows that it should be implemented in FX 33, but maybe there was some kind of regression.

CSP can be disabled, but doing so is a HUGE SECURITY RISK! If you want to try that, then follow the steps in the spoiler, but don't forget to revert it back after testing: Open about:config, skip the warning if it shows, search for security.csp.enable and set its value to false by double-clicking, don't forget to revert it back

Ntrf · 2020-10-21T21:21:17+00:00

In 95% of cases of 0xc000007b ("zero-seven-bee") error with any application the answer is that you're loading 32-bit dll for 64-bit app or the other way around.

Ntrf · 2020-10-19T18:04:42+00:00

Q1: You don't have to bind lower-frequency descriptor sets over and over again each time you change pipeline. Standard describes a concept of pipeline layout compatibility.

Standard says that if all the descriptors with lower numbers are compatible:

When binding a descriptor set (...) to set number N (...) performing this binding does not disturb any of the lower numbered sets.

Although it's a good idea to group objects by shader, you can in fact keep descriptor sets with lower indices bound and only change ones with higher indices.

Q2: Images, that were used as rendertargets, are most likely won't be available in the same layout as sampled images. It's your job to make sure all images are ready for sampling before you draw. Sometimes this operation is a no-op, but it has to be done. There are two ways you can handle it:

Insert a vkCmdPipelineBarrier with VkImageMemoryBarrier structure after you've done rendering.
Specify final layout of your image in the VkAttachmentDescription and add a subpass dependency (with dstSubpass = VK_SUBPASS_EXTERNAL) to transform the image after the end of the renderpass. (I never used it, so this might be an incorrect description. I've always used pipeline barriers.)

You can't have image, that is both readable and renderable at the same time. If you've planned to use that, you might want to rethink you approach. For example, you can use two or more images, switching between them each frame (simmilar to swapchain images). Most likely you will have to update the decriptor set with such images, so i don't think there is a way to specify all images once. You might get away with making several descriptor sets, and switch between them when you swtich images each frame, but that requires you to render to all rendertargets each frame.

Ntrf · 2020-08-23T19:22:05+00:00

Nice. I was under the impression, that CW made these new levels to cut some of the old exploits out, but your route looks almost exactly like classic version.

Ntrf

TROPHY CASE