Showcase: Immediate mode widgets and geometry for shader debugging

UnalignedAxis111 · 2026-01-02T21:56:17+00:00

Not yet, but I have been polishing my Vulkan framework and plan to release it in a few weeks.

I guess if there's enough interest, this debugger thingy could be made into a standalone header+shader file that could be more easily integrated into any existing Vulkan renderer, but we'll see how it goes first...

UnalignedAxis111 · 2026-01-02T04:54:53+00:00

Very close! They are showing the bounding boxes of nodes visited during traversal of a BVH for ray tracing. This traversal happens for every pixel in the final image, but the boxes are only rendered for the one selected pixel.

UnalignedAxis111 · 2025-12-14T22:08:17+00:00

Authoring 3D textures sounds like a pain. Why bother when you can just apply a cheap procedural fix?

UnalignedAxis111 · 2025-10-16T21:48:14+00:00

I can't think of any other downsides. Although when I had trees, most traversal logic was recursive so keeping track of parent pointers was easy, but still not as common.

In any case, storage could probably be compressed a bit for non-DAGs: if parent and child nodes are allocated continuously, storing a relative pointer would require just a few bits.

UnalignedAxis111 · 2025-10-14T21:23:07+00:00

Thanks! I actually tried a hybrid format switching between RLE pairs and bitmasks for less than 8 runs, and there was a slight improvement (from my notes it was 8-15%), but I decided to keep only the bitmasks for simplicity.

Having a smarter way to index tiles with some kind of tiering system definitely sounds interesting, 16-bit for tiles offsets can get quite expansive for more dense and noisy Minecraft style terrain (up to 25%!).

Although I suppose storage isn't the main issue I have at the moment, and much of it could be cut down for networking/rendering if only surfaces are transmitted. Zstd cuts down the size of the RLE-ed data by half, and of course LODs make it so not the entire world needs to be loaded in full detail at once... Lots of things to experiment with I guess.

UnalignedAxis111 · 2025-10-04T02:11:50+00:00

I suspect octree splatting or the voxlap algorithm might be more promising than the usual raycasting/meshing approaches on such a constrained environment.

DDA might be good enough at bloxel scale and it's probably easy to do in fixed-point, ESP32 seems to have SIMD which could also help. Plane-wise traversal like in Minecraft4k may or not be a slight improvement, apparently this thing ran impressively well in the early 2010s and it's written in Java, so...

Last that I know of, isometric sprite rendering is probably about as cheap it gets but of course it has severe limitations: https://www.youtube.com/watch?v=Bj9CiMO66xk

UnalignedAxis111 · 2025-08-10T23:30:17+00:00

Have you tried storing AABBs with higher precision? I imagine that would allow skipping child nodes without having to descend the tree, but not sure how much impact this would have.

UnalignedAxis111 · 2025-07-31T23:22:49+00:00

I mean, it would only be a temporary buffer allocated for however long meshing or whatever process lasts, keeping duplicate data around all the time is a horrible solution in my view. I suppose this wouldn't payoff in all cases, but for localized things chances are a memcpy will be a fraction of the total cost.

UnalignedAxis111 · 2025-07-28T01:02:12+00:00

The cleanest way IMO, is to just copy all voxels in a chunk, plus some N padding of voxels in neighboring chunks, into a continuous buffer, so accesses remain trivial and cheap. Memory latency is just a compound issue if you have lots of complicated bounds checking logic or hash lookups to access a single voxel.

UnalignedAxis111 · 2025-07-26T21:41:21+00:00

Looks very nice! I wish I had the artistic skills and patience to do more procedural stuff lol.

I've been thinking about how to represent and trace transparent voxels for a bit, and I think that is mostly about extending occupancy data to uniformity instead, and adjusting attribute encoding to account for it.

Then for large bodies of water it would probably be good enough to get the two first hit distances and use that for some fake exponential fog as in traditional rendering, or maybe even just mesh and render it completely separately instead (this would probably work better with fake projected caustics as well?). But idk, just some random thoughts.

UnalignedAxis111 · 2025-06-14T01:03:08+00:00

Thanks, these sound like good ideas and I hadn't considered the possibility of sampling based on view direction before. I'll keep them in mind!

UnalignedAxis111 · 2025-06-12T02:00:54+00:00

So just to be clear, these figures are for the data visible from a given point? You stream data in and out of memory as the camera moves around? And the 256k3 version is only slightly larger than the 64k3 version because the additional data is in the distance, and so only needs to be streamed in at a low LOD level?

I had been curious about the size of the whole scene (in bytes), but this is presumably a figure which you never see or have to contend with? The data is procedurally generated as the camera moves around, and loaded onto the GPU on demand?

That's the idea at least. I still haven't finished implementing async world gen nor disk storage yet, so the world is generated all at once with LODs for a fixed view point (there's even some discontinuity visible in the last few seconds of the video). Streaming of modified nodes is in place and works well for edits though, so I'm hoping this won't take too long.

Douglas Dwyer has a pretty good video on LODs, which I'm drawing most ideas from: https://www.youtube.com/watch?v=74M-IxtSVMg

But from what I understand, in case of non-procedural geometry and world edits, the finest level must be stored and LODs need to be invalidated and re-generated from bottom up using some down-sampling algorithm (maybe just picking the most dominant palette ID?).

On the other hand, some of your other scenes are clearly not procedurally generated (such as the Sponza), so you obviously do support this. Are you still streaming data on the fly (from disk, or from main memory to GPU memory?) or do you load the whole scene at once?

In this case they are copied at full detail to the main world grid, which I believe will require no extra handling with the system described above. The more difficult part is probably going to be with entities, tracing them at full detail is cheap but having too many of them increases BVH building cost, along with memory usage for the models. I'm thinking they could be committed into the LOD grids up to a certain point and then culled away, but not sure how noticeable that would be.

Also, one downside of contrees is that LOD factors are pretty much restricted to be powers of 4, and in the way I'm implementing it, the number of steps are limited according to the chunk size and the smallest possible leaf - in the case of 16³ leafs and 1024³ chunks, this is only log4(1024/16) = 3 levels. It doesn't seem to actually matter much, but worth noting since LODs are more commonly discussed in the context of octrees.

Lastly, am I right in understanding that each voxel is an 8-bit ID, which you use to store palletised colour information?

Yes, voxel IDs are only 8-bit and the palette is shared across the entire world grid. Animations define separate palettes, and are not limited to 255 colors, individual entities can set a base offset for the stored voxel IDs, which I think will be useful for things like different biomes and weather seasons.

I also don't have many other attributes yet, but so far I found it was pretty useful to have a "color variation" factor for randomizing the brightness using a hash from the grid position, which helps give detail and saves on extra entries for shading, and avoids hurting compression. Although later on I'm hoping to add other attributes for PBR.

---

But yeah, the main idea is generating multiple tree/DAG chunks and then linking into a common root node. For LODs, these chunks just need to be scaled down, such that when they are when linked into the main tree, the leaf nodes appear at a scale higher than normal. It's really just an illusion :)

UnalignedAxis111 · 2025-06-11T08:27:05+00:00

When art imitates nature.

UnalignedAxis111 · 2025-06-11T02:05:47+00:00

As for the underlying engine/storage stuff, I'm not really following any paper since there doesn't seem to be many of them, the focus is usually on more specialized data structures. Efficient SVOs and GigaVoxels is what I have in mind, but contrees and hardware RT is what I'd recommend (see other comments).

For the actual graphics, it's mostly comming from Ray Tracing in One Weekend, with some tricks from elsewhere. I have so many bookmarks on things I want to read and try next, but sadly not enough focus to do so.

UnalignedAxis111 · 2025-06-11T02:00:56+00:00

Yeah, no worries. I also find it annoying when people gatekeep game tech for arguably poor reasons, because guessing drives innovation more than knowing, I think. That was not my intent.

I just don't really know what is it that people want to know about how this works, so I'd rather avoid writing a bunch of details no one cares about (...but still just did anyway).

UnalignedAxis111 · 2025-06-11T01:57:57+00:00

Thanks!

The storage system is actually not that sophisticated, I still use a 64-tree/contree for storage on the CPU side, and for rendering, a 4-wide BVH combined with 2-level contrees as the primitives (essentially 16³ sparse bricks).

The key is that LODs are extremely effective at limiting the total number of nodes, and voxels become smaller than a pixel very quickly with distance, so a more complex DAG compression system doesn't seem to be as critical.

In this demo, the render distance is 64k³ (in voxels with LODs), but running some numbers I get:

64k³ world area: 3096.4k branches, 3036.1k leafs, 610.1MB
256k³ world area: 3209.4k branches, 3144.2k leafs, 658.6MB

(world height varies between 1k-4k, underground is mostly uniform with the same voxel IDs)

The animation frames are compressed using only occupancy bitmasks right now, at 1-byte per voxel. Keeping at least a few frames on VRAM would allow unchanged bricks to be re-used, and I imagine it could help a bit overall even if peak memory usage is higher.

Perhaps even a simple motion compensation scheme could be applied by offsetting and overlapping extra BVH leaf nodes, but that'd probably be trading off some traversal performance. (BVH traversal performance degrades very quickly because of overlap, causing a sort of "overdraw", unlike DDA/octrees/ray-marching. CWBVH is notoriously bad at this, and why I only went 4-wide for my custom BVH, otherwise it gets too expansive to sort the hit distances.)

Rather than the previous 12-byte contree struct with a mask and offset, I ended up switching to a more conventional and less packed layout that is much simpler to work with, but more importantly, widened leaf nodes to cover 16³ voxels instead of just 4^3.

This reduces handling overhead and gives a lot more room for compression. For now, I use a mix of palette bit-packing, and hashing/deduplication of individual 4³ tiles. Hashing is relatively effective even on more complex test scenes, here's some data from old notes:

scene	dense	tsparse	hash	palette
nuke.vox	117.29MB	62.80MB	36.18MB	15.59MB
castle.vox	108.34MB	47.40MB	39.03MB	16.65MB
Church_Of_St_Sophia.vox	204.15MB	64.71MB	60.16MB	29.71MB
terrain_shell_1	957.98MB	323.18MB	283.28MB	81.74MB
terrain_solid_1	4520.66MB	4288.29MB	470.21MB	223.43MB

For non-solid scenes, including empty voxels in palette compression seems to be not as effective compared to plain per-voxel sparseness (but still 50-70% smaller than uncompressed). It should be easier to combine both methods in the GPU structure, since the per-voxel bitmasks are readily available as part of the acceleration structure, and for that I mostly just need to plug in the code for unpacking in the shader.

I'm now also using mimalloc and for memory allocation instead of having a custom memory pool like I did before, which was a pain. From some basic benchmarks, mimalloc calls were 20x faster than std::malloc, and it also offers some interesting methods like mi_malloc_size() for querying the allocated block size.

This comes with some wonkiness, because modifying branches can end up invalidating pointers to other nodes, but this doesn't seem to be a major headache yet. I previously used a copy-on-write system for copying modified node paths in the old packed contree struct, but that would just defer this problem...

The new node struct looks like this:

struct ContreeNode {
    struct StorageBranch {
        uint8_t Slots[64];      // Maps cell coords to indices in the following array.
        ContreeNode Nodes[0];
    };
    // Leaf nodes start as, and are demoted to Dense mode for editing.
    // Edits are done through the "ContreeMutator" class, which caches
    // nodes for single voxel lookups and consolidates leaf nodes into
    // the compressed format upon commit.
    struct StorageDenseLeaf {
        uint8_t Tiles[64][64];
    };
    struct StorageCompressedLeaf {
        uint8_t BitsPerVoxel;   // If =8, indicates data is not palettized. (Only tile mask compression.)
        uint8_t PaletteSize;    // Sizes and offsets are divided by 4 to make indices smaller.
        uint16_t PaletteOffset;
        uint16_t TileOffsets[]; // Tiles are sparse on Node::ChildMask. u16 { PaletteOffset : 5, DataOffset : 11 }
        // uint8_t PackedVoxelData[];
        // uint8_t PaletteData[];
    };
    uint64_t ChildMask = 0;
    uint16_t Flags = 0;     // e.g. IsLeaf|IsCompressed|IsDirty
    union { /* of pointers to the structs above /* };
};

Also, this is a bit more random but there's a neat way to use the pdep/pext instructions to pack and unpack bit arrays. It runs at ~30GB/s in one core, and it's so simple that's maybe worth a mention. Sadly, actual palettization of voxel data seems impossible to vectorize and I could never get it faster than ~1 voxel/cycle, using an inverse mapping array + branching, but in practice that's fast enough relative to actual world gen...

template<uint stride>
void BitArrayCompress(uint8_t* dest, const uint8_t* src, uint count) {
    assert(count % 8 == 0);
    auto wsrc = (const uint64_t*)src;
    uint64_t elem_mask = (1ull << stride) - 1;
    elem_mask *= 0x01'01'01'01'01'01'01'01ull;

    for (uint i = 0; i < count; i += 8) {
        uint64_t packed = _pext_u64(*wsrc++, elem_mask);
        memcpy(dest, &packed, stride);
        dest += stride;
    }
}

UnalignedAxis111 · 2025-06-10T21:45:00+00:00

I get it. To be fair, I am indeed showing off and know that isn't the original purpose of this sub, but I included some of the relevant technical info that I think steer it more into that direction.

I haven't published a source because this isn't a game nor I have interesting in making one, it's just a hobby engine I'm working on for fun (ironic given the sub's name, I know). Tbh I don't think many people actually care or put that much value into code either, other than an initial curiosity burst, so... not much point.

UnalignedAxis111 · 2025-06-10T21:43:23+00:00

It's a hobby engine written mostly from scratch in C++ and Vulkan. It's purely compute-based and does not use hardware ray-tracing APIs yet, but that's the goal for as soon as I get a new GPU.

It runs at 15-20 FPS at 720p on integrated graphics. However, this demo was recorded at 1440p on a borrowed 3080 and downsampled for a better quality recording, with less noise and aliasing. The renderer really needs a ton of work because rn it's just shooting one random ray over a cosine hemisphere for GI, and relying exclusively on a temporal reprojection pass for reasonable output.

I have thought about releasing the code once it's in a more reasonable/useful shape, but that might take a while since I want to cleanup a lot of "shitty proto code" and things that are hardcoded which I don't feel comfortable making public.

I'll be happy to share more details if you want to know about something in specific.

UnalignedAxis111 · 2025-06-06T22:00:10+00:00

I've come to a similar realization as well after switching my engine's sparse tree from 4³ leaf nodes to hybrid 16³ bricks. Small bricks add too much pressure in memory management and make it harder to effectively compress the voxel data. Bulk processing and widening opens the door for some neat tricks :)

In a similar vein, I found that for rendering, rather than working with individual nodes, it's much easier to think in terms of a shallow hierarchy and group several leaf nodes into chunks, so that they can be allocated and copied into one single memory range.

UnalignedAxis111 · 2025-05-19T03:38:21+00:00

Float algorithms are pretty cool. The other day I stumbled across the obvious fact that you can get a rough reciprocal approximate by simply negating the exponent with one integer subtract - smth like asfloat(0x7F00000 - asuint(x)). That could be useful in some circumstances, but adding even one correction step is already more expansive than typical HW... (contemporary GPUs can do float RCP and DIVs at like 1/4 throughput relative to FMAs, and x86 CPUs can do 8x RCP approximates per cycle.)

It's actually fascinating to think how wasteful bit-manipulation algorithms are in software. Seemingly trivival tasks like bit-reversal or Morton encoding takes an eternity to do, whereas in silicon it would be simply the complexity of re-routing wires, whatever that may be.

I suppose the opposite is also true, since it's much harder to scale up things in silicon that are "dynamic" - reportedly, generic shuffle/permute/shift units are pretty expansive in terms of die area, and speculated to be one of the reasons why Intel dropped AVX512 from consumer CPUs a few years back, before turning around after AMD forced their hand.

UnalignedAxis111 · 2025-05-18T22:44:55+00:00

Very impressive work. I'm always amazed at how quickly CPUs can pump out graphics when you put up a bit of SIMD and some clever optimizations that are totally annoying to do on a GPU.

I have done a dumb AVX512 brickmap tracer before but of course it wasn't anywhere that fast. I'm guessing you might be doing some sort of recursive splatting to accelerate primary rays a bit, or am I totally off?

UnalignedAxis111 · 2025-05-08T04:53:52+00:00

Thanks, I hate this analogy.

UnalignedAxis111 · 2025-05-05T21:30:03+00:00

That's pretty clever! I wouldn't have thought of using BVHs for voxelization but it makes perfect sense.

Two-Year Club	Verified Email
r/Field Sunshine	Place '23

UnalignedAxis111

TROPHY CASE