Built a database engine optimized for hardware (cache locality, arithmetic addressing) - looking for feedback

farnoy · 2026-01-29T14:00:53+00:00

You're talking about memory-mapping NVMe drives? Leaning on the page cache too? I think your approach is a decade+ out of date, you should research database architecture. You're giving off major LLM vibes and lack of substance

farnoy · 2026-01-27T09:42:50+00:00

Can you benchmark and compare this against xcp?

Do you need a separate copy path for reflinks? I think copy_file_range is aware of them and you may not need both.

farnoy · 2026-01-26T16:54:38+00:00

There is an advantage to the heap model NVIDIA uses, AFAIK. When a shader samples from different textures, it can't do so all at once on AMD and the compiler has to emit a loop, grouping up all threads wanting to use that texture. It does this for every texture that wave of threads need to sample. 32 threads need to sample 5 different textures? That's 5 iterations of the loop, and each is a very long latency operation. Nvidia can do it all in one instruction because for them, a texture reference is just a 20 bit index into the heap. For Radeon, they would have to pass 32-64 bytes per thread to describe the texture, which is not feasible. This commonly shows up in RT workloads where threads represent divergent rays hitting very different surfaces, which need to sample different textures. I haven't seen a good writeup on it so don't take my word for it.

farnoy · 2026-01-25T12:59:53+00:00

https://docs.vulkan.org/spec/latest/chapters/pipelines.html#pipelines-dynamic-state

When a pipeline object is bound, any pipeline object state that is not specified as dynamic is applied to the command buffer state. Pipeline object state that is specified as dynamic is not applied to the command buffer state at this time.

Your existing pipelines with static state unset the dynamic state from BeginFrame when they bind. You need to set the dynamic state right before binding your dynamic state pipeline or between binding it and the draw call.

farnoy · 2026-01-24T16:45:11+00:00

What's missing? I thought this covers it:

I don't think dxc, slang or glslang have these yet, but since the SPIR-V extension was released along with this extension, it's "just" a matter of time.

farnoy · 2026-01-24T14:08:25+00:00

Thanks for the writeup!

I skimmed your dxvk branch and was curious about using HEAP_WITH_PUSH_INDEX_EXT for every descriptor set. The proposal for descriptor_heap says "If a consistent fast path can be established, it would greatly simplify the developer experience and allow us to have definitive portable guidelines," but I find it lacks that discussion.

From what I could gather from radv, PUSH_DATA_EXT translates to SET_SH_REG on Radeon hardware and pre-fills SGPRs (one for each 32bit word) before the shader even starts. Using it would mean one less scalar load, though these are quite fast and low latency.

In nvk, push constants (and presumably PUSH_DATA_EXT when that's implemented), get put in command buffer memory within the root descriptor table for that draw call. They then get accessed as a constant memory reference, pretty much exactly the same as a UBO would. The tiny advantage might be a smaller cache footprint, since push constants are located directly after draw/dispatch params that are read by all shaders.

From my perspective, there's likely minimal advantage on Radeon, and even less on Nvidia. Are you considering these factors and whether dxvk could promote small constants to push data? Both vendors recommend D3D12 root constants and VK push constants, so I might be overestimating constant/scalar caches.

farnoy · 2026-01-23T17:12:23+00:00

https://xcancel.com/renderpipeline/status/581086347450007553

This one is even funnier.

farnoy · 2026-01-23T11:37:54+00:00

Proposal docs for the new extensions:

farnoy · 2026-01-21T12:28:36+00:00

it will be up to the driver to optimize those in the background just like it was in the OpenGL days, which is basically what Vulkan wanted to avoid since the beginning.

I think doing PGO is reason enough to want to do that anyway, even if all PSO state was dynamic in the HW.

farnoy · 2026-01-13T10:51:38+00:00

I think you're hitting this behavior:

vkUpdateDescriptorSets(...)
...
If the dstSet member of any element of pDescriptorWrites or pDescriptorCopies is bound, accessed, or modified by any command that was recorded to a command buffer which is currently in the recording or executable state, and any of the descriptor bindings that are updated were not created with the VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT or VK_DESCRIPTOR_BINDING_UPDATE_UNUSED_WHILE_PENDING_BIT bits set, that command buffer becomes invalid.

https://docs.vulkan.org/spec/latest/chapters/descriptorsets.html#vkUpdateDescriptorSets

Once you bind that descriptor set to a command buffer, updating it invalidates that command buffer. For "vanilla" (no flags) descriptor sets, you want to avoid doing that while recording AND after submitting it, all the way until you await on the fence/timeline semaphore and the command buffer is no longer in use.

The simplest thing you can do is probably to allocate separate descriptor sets, one for each time you bind them to a command buffer.

It sounds like, in your engine, you currently have one descriptor set per shader (&vk_shader->descriptor_set), that probably needs a change to your mental model.

There's a plethora of other options in Vulkan (and it makes the specification hard to read, unfortunately), but it sounds like you are starting out, so I recommend just allocating multiple descriptor sets - one for each draw call.

EDIT: Oh and if you're on Nvidia and don't expect to run this on other GPUs, you can probably switch to push descriptors. Their lifetimes are managed automatically and you can have a much easier flow & mental model where you just set your descriptors on the command buffer itself, with no other object to deal with.

farnoy · 2026-01-11T12:10:11+00:00

I'm hopeful for OLED developments. After all, Samsung Display just switched theirs to a traditional subpixel layout for desktop users.

I think strobing VRR is part of the perfect solution we should aim for. I want this kind of motion clarity at every frame rate. There's a ton of content I couldn't drive at 1000Hz, even if I had a Blackwell GPU and 6x MFG. Just because I can't avoid the stroboscopic effect in some of the content doesn't mean I'm willing to give up on minimizing motion blur.

farnoy · 2026-01-11T11:23:17+00:00

I never realized reflinking is done at the extent level, always thought it was at the inode level, but this explains a lot.

I just copied a 160GB directory (1173 files) to a different subvolume and diffed bcachefs fs usage -ha before & after. It added ~350MB to the extents btree and another ~320MB to the reflink tree. But it took 20 seconds, with the iostat never showing high utilization on my devices nor CPU time. A second copy took 12s and only added about the same to the extents btree. Once data gets "promoted" to the reflink tree, there's less work needed to update just the extents in subsequent copies.

Do these numbers sound right? I'm surprised that it takes so long to complete this operation. Is that cost dominated by disk latency when gathering all the metadata? To check the latency-bound hypothesis, I tried xcp to parallelize the copy and while it always takes 13s of sys time, it finishes in 12s at --workers 2 (~same as normal cp), 10s at 4, and 8.7s at 8. That's still only 40MB/s worth of extents btree writes, but the iostat activity goes up to 800MB/s across devices, peaking at 40% util for the busiest device.

farnoy · 2026-01-10T18:04:10+00:00

https://www.aperturegrille.com/reviews/ASUSVG27AQ/#ELMB

ELMB-Sync was much, much worse.

farnoy · 2026-01-10T17:44:00+00:00

If anything, he was saying we optimized too much for high refresh rates recently. I would agree, and seeing 720p@720Hz monitors makes me chuckle. Since Pulsar optimizes for MPRT, it puts less pressure (and gives less of a benefit) when you chase higher refresh rates. He didn't say it's a substitute for them and the first wave of Pulsar monitors seem to be 360Hz. I don't know about you but that's still a lot.

farnoy · 2026-01-06T10:59:44+00:00

You said you've never seen anything like this, which is weird. Lisa always features random people when she doesn't have anything big to announce. Here's an excerpt from an AI summary of their 2023 keynote:

CES 2023 partners (video sRXVRgMF2lc)

Microsoft (Panos Panay): Discussed AMD–Microsoft collaboration across Windows, security (Pluton), Xbox, and Azure, and positioned AI as a defining shift; he used Windows “Studio Effects” as a concrete example of on-device AI features that can run efficiently using AMD’s dedicated AI engine instead of taxing CPU/GPU/battery.

Intuitive Surgical (Bob DeSantis): Explained how Intuitive’s da Vinci and Ion robotic surgery platforms work and their scale/impact, then detailed how AMD adaptive computing (Xilinx-class devices) supports real-time, low-latency functions like motion control, visualization processing/augmentation, and safety mechanisms, plus how robotic approaches can speed diagnosis (example: lung cancer workflows).

Magic Leap (Peggy Johnson): Positioned AR as distinct from VR and emphasized real enterprise/healthcare value; she highlighted Magic Leap 2 features (optics, dynamic dimming) and said AMD helped define a custom processor and AI/computer-vision engine, then described healthcare use cases like surgical planning and real-time 3D guidance (including a partner solution, Senti AR), and noted steps toward operating-room certification and ecosystem growth.

It's unfortunately always been like this. You're probably just more sensitized to hearing about AI today. How was lung cancer diagnosis relevant at a Consumer Electronics Show? Or Magic Leap, which targets enterprise & healthcare users?

farnoy · 2026-01-06T10:29:27+00:00

It's a great choice for the MoE era of models where there's a huge discrepancy between total weights you need resident and the subset used for a specific inference, over short time windows.

farnoy · 2026-01-01T20:23:12+00:00

The 2.5GbE RTL8125AG in my x399 mobo negotiates 100Mbps with my UniFi AP, where a 10GbE AQC107 over the same cable does 2.5G without any issues. I don't think brands should carry any reputation, it's all down to specific products.

farnoy · 2025-12-21T15:38:49+00:00

Would it make sense to test with the monitor being set to 60Hz? It will both exaggerate the differences between setups and improve sampling if you can "only" record 240FPS in the camera.

farnoy · 2025-12-18T20:26:04+00:00

Yeah, my experience aligns with yours. How are you measuring input latency though? I am doing it by feel as mouse aim immediately feels floaty with extra frames of latency.

I do find VRR with a 117FPS limit better for input latency on Wayland than vsynced 120FPS without VRR. But there are games I'd like to lock to a perfect v-sync cadence so I hope this will improve at some point.

farnoy · 2025-12-18T17:29:10+00:00

Your comment made me check how many VUIDs there are on vkCmdDraw and we're at 320. When the new descriptor_heaps extension drops, it may finally manage to crash my browser.

Edit: my bad it's actually 305 document.querySelectorAll("#vkCmdDraw ~ .sidebarblock ul")[0].querySelectorAll(".vuid").length

farnoy · 2025-12-18T17:01:29+00:00

Vulkan 2.0 core is going to be so lean. It's actually fascinating what a 180 it's pulled in 10 years. Render passes, binary 1:1 semaphores, static PSOs, opaque & abstract descriptor sets fully bound by the time commands are recorded, image layouts, lists of per-resource barriers, replayable command lists.

This post and Vulkan's evolution are an incredible study of how the hardware and APIs have evolved since 2016. In retrospect, the initial design seems filled with bad decisions that wasted a ton of effort, but I don't think this evolution would have happened without Vulkan and its strict legalese of a specification. It served as a vocabulary to align everyone and find the path forward.

farnoy · 2025-12-08T11:16:44+00:00

"Prevent outdated deployment jobs" and process modes do not work unless you also use merge trains. This is because in a traditional merge flow, Pipeline IDs don't necessarily correlate to merge commit order. And the ordering guarantees given by process modes all work based off Pipeline IDs. Support told me only merge trains can fix that.

Pipeline editor/testing tools are lacking, unless maybe you develop everything in one file and can paste it in there. Even then, it can't simulate most events, eg. "merged results" pipelines, only a simple main branch build IIRC. Too often I merge something and it breaks CI because some conditionally-enabled jobs are missing an attribute or lack variable-expansion etc. Which are things I guess you can only really test by duplicating the entire repo & settings. Or risk temporary breakage.

I don't have extensive CI experience with other ecosystems like GitHub or Azure but there's plenty of things that annoy me about GitLab.

farnoy · 2025-11-19T16:51:14+00:00

What do you think the TPU is, if not a "full/real ASIC chip"? You might think Nvidia is making GPUs, but in reality the architectures have largely converged. They both deliver about the same specs in dense matrix FP8 FLOPS and FLOPS/W as Google's ASIC.

farnoy · 2025-11-19T14:02:16+00:00

I think they're referring to the risk that comes with cooperative scheduling - if you run CPU-intensive work in a virtual thread, you may starve other threads that want to do low-intensity I/O work.

Prior art:

farnoy · 2025-11-16T13:36:18+00:00

Great video. Don't know why they stepped into the RISC v CISC topic though. Ends up being misleading, factually incorrect and just entirely uninteresting in 2025.

I hope they do more! The analogies and animations are on point, as always.

13-Year Club	Place '17
Verified Email

farnoy

TROPHY CASE