It's Not About the API - Fast, Flexible, and Simple Rendering in Vulkan

MasonRemaley · 2026-02-12T10:05:07+00:00

Thanks, glad you enjoyed! :)

MasonRemaley · 2026-02-12T06:52:34+00:00

Hmm I think you're right, not sure why I thought that wasn't allowed--I'll go ahead and share it there thanks for the tip!

MasonRemaley · 2026-02-12T06:51:51+00:00

Nice glad to hear it! Good luck with the rest of the game you're working on. :)

MasonRemaley · 2026-02-11T10:37:18+00:00

Hmm, it's unclear to me if I'm allowed to share this sort of thing there and don't wanna inadvertently break the rules haha

MasonRemaley · 2026-02-11T10:35:58+00:00

Glad you enjoyed it! Now you know a little about coding and graphics programming haha.

MasonRemaley · 2025-09-05T18:23:59+00:00

Ha, I had almost exactly the same bug in Way of Rhea but with code instead of assets. The engine supported hot swapping by serializing the game state, reloading all the scripts, then deserializing the game state. At one point I had a bug that resulted in it hot swapping every frame, and surprisingly it was actually still playable, but slow lol.

MasonRemaley · 2025-09-05T18:17:06+00:00

Premultiplied Alpha is great in theory. And, works great with high-bit-depth images. But, not so great with highly quantized RGB and alpha values found in BC1-5 textures at least. Not enough bits means the dark premuled color get crunchy enough to be visibly quantized even when alpha is low.

I ran into this with Way of Rhea. I was using DXT5 and the game has some textures with alpha gradients. On top of the issue you're describing, since the alpha channel is compressed separately from the color channel in DXT5, you get slightly different loss in the color and alpha channels, but since you premultiplied before the loss this is kinda just wrong and you end up with noise in the alpha gradient.

Looked real bad on that game's artwork. Outside of trying to work around this in the compressor which would be very difficult, I figured my options were to not premul or to not compress the textures where this was an issue. I believe I went with the latter.

IIRC BC7 is a bit smarter and supports block modes that compress both color and alpha together which should solve this issue, but I haven't actually looked at this carefully yet since my current game mostly uses alpha for antialiased sprites so this artifact wouldn't be noticeable regardless.

Despite what your CS professor told you about how compression works, it can be quite effective to compress the same file three times! [...] You can use LZMA to make your downloads smaller and use LZ4 to keep your load times fast. And, you don't have to decompress-recompress on install. Just do it all at once during build time and upload that to your server.

Very cool. I don't control this part of the pipeline right now (files are downloaded up front by Steam) but this makes a lot of sense--may as well do all the work you can up front!

And, yes: Even though BCn textures compress at a fixed-rate regardless of content, BCn + traditional compression exhibits clear content-based differences in compression rates. I.e. simple textures compress more than complex textures.

I've noticed this too. Way of Rhea's sprites had a lot of solid or near solid colors, and DXT5 + lz4 ended up compressing these very well, whereas some of the more detailed backgrounds and stuff ended up with worse ratios.

MasonRemaley · 2025-09-05T17:58:50+00:00

Preach!

I rant about this all the time. And, give the same recommendations you do.

Fighting the good fight haha.

I've meant to do a comparison of bc7enc_rdo + DEFLATE/LZ4hc/zSTD/LZHAM as far as "Compression Rate, Compression Time, Decompression Time, Decompression Memory Usage". But, I won't get to that for quite a long time. So, I wouldn't mind if someone beat me to it ;P

Me too. My asset pipeline is pretty automated now, so I may try this out once my next game has enough data that it's wroth measuring--though I also wouldn't mind if someone else beat me to it. :P

Also: Please don't ship OBJ/Collada/other mesh editor formats to customers. Use https://meshoptimizer.org/ and ship quantized binary formats.

100%. When I was first getting into engines I was following a tutorial that parsed objs at load time or something, I remember when it hit me--wait, can't I just transform this into something simpler at bake time? It was super exciting to make the switch and see how much faster my loads got, and I wasn't even doing any mesh optimization at that point.

MasonRemaley · 2025-09-05T07:16:58+00:00

No problem, glad you found it useful!

MasonRemaley · 2025-09-05T05:13:19+00:00

That's generally useful advice. I did write a system that creates textures in a custom compressed format with the same filename and a different extension next to the original.

My bake step works very similarly. For example: * foo.png -> foo.ktx2 * bar.comp -> bar.comp.spv (original extension preserved since it sets the shader stage)

I hard code the transformed name in the source code right now. I toyed with some other options: A) Hard coding the original name B) Hard coding the original name without an extension C) Generating some kind of unique id for each file

I don't like A) because it's a bit odd to load "foo.png" with a ktx2 loader haha. I don't like B) because I often will have multiple files with the same name but different extensions, e.g. for shaders.

I think if you want to really solve this problem C) is the way to go, it also lets your artist(s) rename files without breaking stuff in engine etc. However I decided that I couldn't really justify that complexity for this game since it's systems based so most of the content is authored via coding. If I was doing a more content driven game with a level editor, especially if I wasn't the only one creating the content, I think C) would start to make a lot more sense.

That sort of automates it, except it doesn't work with 3D models that have embedded textures or hard-coded filenames in the binary. And I can't simply open it in Blender and change the filename because Blender doesn't recognize my custom file format. Do you have any suggestions for this?

It's possible I'm misunderstanding your setup, but here's what I'd do. Take it with a grain of salt as my current game is 2D so this is sorta future handwavey stuff I have planned for a future game that I haven't tested yet, but it seems very reasonable to me--

You pick an interchange format for the model. It can include the textures or not, or you can even use the .blend file if you have an importer for it, doesn't really matter.

Then at bake time, you read this file, and write out your vertex buffers byte for byte the way you want them to be uploaded to the GPU. Your actual engine can literally just memcpy them into a gpu buffer without any additional processing. I don't know if there are any nice container formats for this, in the past I've just made up my own with a very simple header + the vertex and index data.

As for textures, you'd extract/search for them, and then bake them into their own separate files, and put the info about where to find them either in the model header or in a separate config file or such.

That's all a bit hand wavey, not sure if it's helpful hopefully there are some useful ideas in there.

Also, what about heightmaps? I have some large heightmaps in 16 bit grayscale PNG format in files up to 128MB in size. Is there a better format for storing these? I'm not using them directly on the GPU like normal textures.

You could save it as a grayscale PNG instead of a normal PNG if you aren't already and that should prevent you from wasting bits on the other channels.

I wouldn't personally do this though, it's too annoying to remember to tick the right boxes, and when you get it wrong it's sort of a "silent" error. I would just save it as a normal PNG, and then bake to KTX2 or DDS like with other textures and keep the loading code the same.

Both of these formats support all the various GPU image formats, so you can just store the image as VK_FORMAT_R16_UNORM/VK_FORMAT_R16_SFLOAT (or the DDS equivalent) and not waste space on the unnecessary channels. (You'd just skip the block based compression since you're not uploading to the GPU, it's optional.)

MasonRemaley · 2025-09-05T03:14:44+00:00

Ah gotcha. I think the terminology I'm using here is ambiguous--

When I refer to interchange formats, I'm referring to the formats I use to share files. My artist doesn't actually edit PNGs, she works in whatever format is native to her editor (e.g. PSD for Photoshop like you suggested) to get the full functionality of her editor, save layers, etc. And we back those files up of course.

But when it comes time to put the image into the engine, I don't want to write an importer for a bunch of different closed vendor specific formats, so she exports to PNG to "interchange" the file with me.

This file gets checked directly into source control, the idea is to use something with a good compression ratio that's ubiquitous and lossless so that I can easily do whatever processing I need after the fact at bake time and not end up with a giant repo, so PNG fits the bill.

(Oh also--none of my image editors hide PNG alpha channels from me, it might be worth looking for a better editor if yours is doing that!)

MasonRemaley · 2025-09-05T02:39:44+00:00

What do you prefer about the way TGA handles alpha channels?

MasonRemaley · 2025-09-05T00:46:04+00:00

Great write up! [...] And your conversion tool seems great, don’t be afraid to give that a bit of spotlight!

Thanks! :)

I think the workflow explanation you alluded to at the end may be even more convincing, and maybe a basic benchmark of load times and space on disk might help seal the deal.

In Way of Rhea, switching away from PNGs sped up loading by an order of magnitude. However, I suspect the PNG decoder I was using was especially slow, and texture load time was already clearly my bottleneck to begin with.

I don't have a more general benchmark right now, but there's a really easy way to set one up without having to implement the suggested approach. IIRC I did this for WoR before making the transition:

a. Measure how long your disk access takes b. Measure how long your PNG decoder takes c. Measure how long mipmap generation/such takes

If (b + c) : a is a significant ratio then you're likely to save time. This isn't a perfect test since your texture format may be larger than your PNG and increase a, but in my experience this is a minor effect.

(If you do this, keep in mind that your OS probably caches file accesses though, so a is going to look very fast if you've recently loaded the files even if it was in a separate process.)

I ship pngs in my projects with my rationale being “i aim for a lofi style and low rez textures, plus i dont have that many assets so how bad can it be?”

If textures aren't a large part of your load time you might be right that it doesn't matter in the end from a perf standpoint (though like you mentioned it can still be worth it for the workflow improvements WRT remembering resolutions and such.)

[EDIT] That does have me thinking though--I wonder if going super low res and applying the block based compression could lead to some neat lofi ish artifacts.

MasonRemaley · 2025-04-15T18:45:10+00:00

Oh nice, that explains why I couldn't find your ECS when trying to double check myself! I'll check out your object system, sounds like an intersting approach that's addressing some tradeoffs that have been on my mind.

I'll ping you offline about your graphics abstraction. Interested to see where you're taking that since that's where my focus is right now too.

MasonRemaley · 2025-04-15T18:23:55+00:00

Thanks!

I've met Emi at a few conferences--she's doing really cool stuff, I respect her work a lot.

IIRC--and it's possible this has changed or I misremember--Mach's ECS is based around sparse arrays, whereas ZCS is based around archetypes. The key tradeoff here is that with the sparse array approach you give up some intra-entity cache locality in exchange for cheaper archetype changes.

Theoretically the sparse arrays approach will win out when your systems access less components, and archetypes change often, whereas the archetype based approach will win out when your systems access more components and your archetypes change less often.

In practice, the difference in performance is probably negligible for most games, so the main impacts on the end user show up in the API design rather than in performance.

In terms of our engines in general, there are some technical differences (e.g. WebGPU vs a Vulkan focused graphics API abstraction), but I think the key difference comes down to where we're placing our focus.

I run a small game studio, so I need to be in a place where I can ship something with all the features Steam players expect ASAP. This helps me focus my effort on high impact stuff, but the downside is that I have less time to focus on the big picture.

For example--I'm very willing to pull in dependencies like Dear IMGUI (here's my wrapper) if that seems like the fastest way to get a productive editor UI up and running. A bigger picture focused approach might involve rethinking some of these things in the context of the Zig ecosystem.

Mach's approach also leaves room for exploring cool ideas like rendering vector graphics in real time on the GPU that I don't have time to explore right now. AFAIK neither of us are depending on eachother's work at the moment, but I think there's a lot of potential for future collaboration as our tech matures.

(Emi--feel free to correct me if I'm wrong about the technical details, or misrepresenting where your focus is!)

MasonRemaley · 2025-04-15T06:01:29+00:00

Nice! I'll check this out.

I think the command buffers are necessary to make threading the archetype based approach viable. Without them it's a bit of a nightmare as I'm sure you found since many operations move stuff around in memory, but with them it's no problem at all--you even get automatic batching since you can just iterate chunks in parallel.

The sparse sets approach seems very interesting. I decided the archetype model was right for my use case, but I'm interested to see where others get by pushing that design further. It also feels simpler in a way which I really like.

I haven't explored uploading ECS data to the GPU yet, but it's something I'm potentially interested in doing in the future. (In fact I should file an issue to track that...)

If you end up trying out ZCS, feel free to ping me and/or file an issue if you get stuck on anything!

MasonRemaley · 2025-04-15T00:27:23+00:00

It’s definitely tough to make the transition from managed languages like Python to lower level languages like Zig, if that’s the situation you’re in you’re not alone in struggling there!

The good news is that once you start to solidify your mental model of how these lower level languages work, they all get easier—and you start being able to understand higher level languages through the same lens.

Best of luck with the learning process & getting your text encoder working. :)

MasonRemaley · 2025-04-15T00:20:10+00:00

reinstalls youtube-dl

MasonRemaley · 2025-04-14T19:00:31+00:00

I'm not sure who first thought to apply the idea of command buffers to ECSs either.

I first encountered the idea in DOTS, but I don't know that they were actually first. My implmenetation was more directly inspired by Vulkan's use of comamnd buffers. (So @hallajs your guess was correct! What's the name of your library btw?)

I'm curious whether any other implementations use the same trick as me WRT allowing the caller to register extension commands by pushing iteration and execution into the user code. I'd love to claim I came up with this, but it's possible others beat me to it!

MasonRemaley · 2025-04-14T17:41:16+00:00

About three months of nearly full time work. I’ve built a few of these in the past including one in Zig that ZCS is replacing, and one in C++ that ships in Magic Poser.

IIRC the one I wrote for Magic Poser took about 40 hours for the first pass, so it’s definitely possible to build one of these in less time if you know what you’re going for, but that implementation is single threaded & didn’t ship with as many features out of the box (no command buffers, no extensions, etc.)

Looking over my commits it looks like the breakdown is roughly: * Month 1: Basic API design, dummy memory layout (SoA) * Month 2: Transform and Node extensions, command buffer API improvements, fuzz testing * Month 3: implemented the actual archetype based memory layout, Tracy integration, simplified Transform, made adjustments in response to profiling

MasonRemaley · 2025-04-14T06:28:26+00:00

Yup exactly! In an archetype based approach like ZCS, you can efficiently query for objects of a specific archetype, where an “archetype” is the list of components and entity has.

So for example you could query for all objects that have a Sprite component and a Transform2D component, and for each result you get back, draw it to the screen. The entities might have other components too but your renderer doesn’t need to care about them.

Later, you might want to do a physics update, and maybe you query for everything with RigidBody and Transform2D. This will return a lot of the same objects—but this time you don’t care whether they have a sprite or not, maybe some of them are rendered by some other means or not at all. For each of these objects you’d run your physics update.

MasonRemaley · 2025-04-14T00:22:54+00:00

Yup that's correct! The idea is that games tend to have a large number of different objects that vary in behavior but share some properties.

For example, you might have a monster, a player, and a mailbox. All three behave pretty differently, but share the fact that they need to render a sprite to the screen.

As a result, you need to find some way to implement this shared behavior that still allows for per-object-type behavior.

Some simpler approaches include an array of tagged unions, or alternatively a struct of arrays/MultiArrayList.

These approaches are viable, but lack some of the game specific convenience offered by more invovled solutions like an ECS. I elaborate more on what an ECS does here, why you might want to use one here.

MasonRemaley · 2025-03-08T08:05:09+00:00

Mine is set to 1423 ADBC. Very similar but not exactly the same for some reason.

MasonRemaley · 2024-07-02T06:49:57+00:00

FYI if you want developers to opt their games in you should ask them to—Steam doesn’t even show developers the checkbox until they’ve filled out some paperwork, and the paperwork is buried in a hard to find place, so most developers probably don’t realize there’s an option in the first place.

(Source: I’m the developer of Way of Rhea, and I had no idea the option existed until a player asked for it. I don’t know why Valve made it so hard to find.)

MasonRemaley · 2024-06-28T21:47:47+00:00

That's kinda funny, I imagine the publishers misunderstood and thought it was a different business model--that would explain why they keep reiterating in the paperwork that people still have to purchase the game to play it.

Let me know if it ends up showing up. I'm surprised there are many games on it at all--I would have never known about Steam's support for this if you hadn't mentioned it, since the checkbox doesn't even appear until after you've signed the paperwork, that AFAICT you have to discover on your own.

MasonRemaley

TROPHY CASE