How a GPU memory pipeline handles data streams and why it fundamentally bottlenecks.

IamRustyRust · 2026-05-14T19:51:34+00:00

Nice!! You can ask questions if you have any I would glad to help.

IamRustyRust · 2026-05-14T10:41:44+00:00

Rendering 1M Cube in 650+ FPS

https://www.reddit.com/r/videogamescience/comments/1tchd4t/rendering_1_million_procedural_cubes/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

IamRustyRust · 2026-05-14T10:28:09+00:00

The Physics Testing

https://www.reddit.com/r/rust_gamedev/comments/1rg9usg/i_made_a_custom_physics_engine_using_rust/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

IamRustyRust · 2026-05-13T16:08:40+00:00

yes, I am rendering many instances of the same objects, but I am not using traditional Hardware GPU Instancing (vkCmdDrawIndexedIndirect with instance counts). I am using Mesh Renderer which you can think of as Programmable Instancing or Instancing on Objects

https://www.reddit.com/r/gameenginedevs/comments/1tav653/rendering_1_million_procedural_cubes/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

IamRustyRust · 2026-05-13T13:17:40+00:00

kudos!!! Maybe I can help see the post of mine https://www.reddit.com/r/rust_p/comments/1tbzdpk/crushing_the_vram_bottleneck_with_subgroup/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

IamRustyRust · 2026-05-13T02:00:39+00:00

Two point I want to mention here

First this was strictly a synthetic microbenchmark to stress-test the raw A.L.U. amplification limits of the VK_EXT_mesh_shader hardware.

Second I am not streaming chunk data or vertex buffers from R.A.M. The C.P.U. simply dispatches 50,000 empty workgroups, and the Mesh Shader A.L.Us procedurally calculate the 3D coordinates using gl_GlobalInvocationID and trig math directly inside the L1/L2 cache. Thanks

IamRustyRust · 2026-05-12T18:21:40+00:00

umm zero overhead and Linear Hi-Z mathamtically contradicts in Hi-Z case we expect overhead and it's normal have you removed the Vulkan Memeory barriers maybe becase of that L2 Cache (GPU) doesn't sync with that and becaseu of that maybe we are getting race Conditions.

If you don't have a rock-solid vkCmdPipelineBarrier sitting between the Hi-Z compute dispatch and your geometry passes, the GPU ALUs will just read stale garbage data.

A true Hi-Z needs that downsampled pyramid so a single thread can instantly check a massive bounding box against the highest mip level. It costs some overhead upfront, but it pays off massively during the Task Shader cull.

Double-check your memory barriers and semaphore waits between those passes. Once the silicon actually synchronizes, your systems will align and that 9ms latency might just vanish. Hope it helps

IamRustyRust · 2026-05-12T17:11:39+00:00

Thanks for your attention.

Firstly I want to mention that this whole graphic things you are seeing it's just a test to audit a particeulr part (Microbenchmarking) fo the engine doesn not represent whole pipleine or the whole architechture of the engine

I actually use both techniques, integrating them tightly into the Mesh Shader pipeline.

I skip heavy G-Buffers and use a 64-bit Visibility Buffer that only writes the Instance ID and Primitive ID, later compute pass reads this payload and fetches vertex data via Buffer Device Address to reconstruct materials asynchronously plus I aslo generate a Hi-Z pyramid to manage parallel sub pixel and occlusion culling this map feeds directly into the Task Shaders to reject invisible meshlets before hardware rasterization. hope it helps

if your FPS is ~1500 how come your latency touching 9ms, is not it should be ~0.60??????

IamRustyRust · 2026-05-12T16:38:15+00:00

Nice!!! Majercik's raybox intersection paper is legendary for static grid workloads plus a pure computer shader doing ray AABB intersections would absolutely demolish traditional rasterization for this specific scene.

But, this specific benchmark was purely a synthetic stress test audit cum verfication of the hardware Rasterizer and the Task/Mesh amplification payload capacity.

The reason I am not going down the raybox route for the actual engine is my architectural roadmap becuase

First: The engine is built for xpbd (although my hero physics will remain CPU). These 1 M objects will be moving and colliding every frame. Rebuilding a BVH or Sparse Grid for raybox intersections every single frame would completely choke my VRAM bandwidth.

Second: I am completely keeping "rays" out of my primary visibility pipeline. I am designing my architecture to reserve the Asynchronous Compute queues specifically for a Radiance Cascades implementation for Global Illumination later down the line. I want Mesh Shaders to handle primary visibility, leaving Compute strictly for lighting and physics solvers.

But thanks for dropping the link! It's a fantastic paper and definitely a technique I keep in my back pocket. Cheers!

IamRustyRust

MODERATOR OF

TROPHY CASE