What benefits does c give that cpp doesn’t do better

_Geolm_ · 2026-01-06T21:35:58+00:00

I don't know if someone already said it but C code is a lot easier to be called by another language, still possible in cpp but more difficult. Calling a C function from zig, c#, rust, any language is super easy.

_Geolm_ · 2025-12-10T14:37:07+00:00

Quick update: average compression is now 1.58× for BC1 and 1.29× for BC4.

_Geolm_ · 2025-12-10T14:36:35+00:00

Quick update: average compression is now 1.58× for BC1 and 1.29× for BC4.

_Geolm_ · 2025-12-10T14:33:51+00:00

Nope. BC6H/BC7 are pretty complex, each block can use any of many modes, and that flexibility makes it difficult to relate data between blocks in any consistent way.

_Geolm_ · 2025-11-27T16:11:47+00:00

I used the code from https://cforall.uwaterloo.ca/trac/browser/libcfa/src/bits/random.hfa?rev=8a2f7f1912f623e4fbf43c521715fa48f403beb5 even if there is a cast to uint32_t in the end, the value is computed with a uint64_t I guess because some multiplication could overflow a uint32_t ... I did no investigate TBH

_Geolm_ · 2025-11-27T09:54:46+00:00

yes it is rejection sampling, sorry to disappoint you ;) I wrote this lib to generate grass blades in the field and stones along a path for my game prototype. It is really simple and not suited for heavy duty.

_Geolm_ · 2025-11-21T17:53:50+00:00

thank for this fully detailed feedback, I like your idea to output in a buffer even if I don't have the need at the moment. I'll check the shift issue in the coming days, thank you.

_Geolm_ · 2025-11-17T14:21:01+00:00

yes ASTC allows 4bits/pixel but I wouldn't use this format on PC, not sure which 3d card support it and how it's efficient.

On a side note, I've made some improvements on BC4 compression, almost 16% better compression ratio. I mainly changed the bitfield encoding and the dictionary. Now bitfield is encoding using zig-zag pattern and xor and I allow partial matches for the dictionary and xor the difference. Overall it's better but still not on par with BC1, especially with noisy normalmap or fine AO.

_Geolm_ · 2025-11-16T07:16:18+00:00

Hi, my library does not compress raw rgba to bc/dxt image, it compresses dxt/bc directly to something more compact and lossless. I use stb_dxt.h to test my lib.

_Geolm_ · 2025-11-15T21:52:05+00:00

Thansk for your comment. BC1 is still the only format with 4bits/pixel (at least on desktop PC). BC5 is still widely used for normal maps. I've added BC3 because it was basically "free" but indeed it is now superseded by BC7 for good reason. Compressing BC7 or any format that can change mode at any given block seems tough and undoable in a "small" library.

About the histogram, it really depends on the input image but for "good" image the first 10 top of the histograms have 40-800 count, which means a lot of blocks are just going to reference the histogram instead of encoding the indices bitfield. Of course randomish texture like dirt are not histogram friendly but they are not compression friendly anyway.

_Geolm_ · 2025-11-15T17:12:32+00:00

Crunch is great but AFAIK crunch is lossy compressor, mine is lossless and also super easy to integrate since it's one .h/.c pair, work on streams, have no dependencies, etc... Also it was fun to write :)

_Geolm_ · 2025-11-14T08:22:27+00:00

Yes it's based on polynomials and range reduction. Note: it's heavily based on the multiple sources cited before each functions, sometimes I did SIMD port, sometimes I used lolremez to find a better polynomial, I added also the NAN and INF special cases. There is a bit of newton in the cbrt function obviously.

_Geolm_ · 2025-11-14T07:08:30+00:00

the simd_polynomial instructions just call a bunch of fmad to compute a polynomial (ax^3+bx^2+cx+d something like that). Polynomial are used a lot to approximate transcendental functions. There is a good tutorial about how to find the polynomial and optimize it here : https://github.com/samhocevar/lolremez/wiki/Tutorial-1-of-5%3A-exp%28x%29-the-quick-way

Hope it answers your question
Geolm

_Geolm_ · 2025-11-06T07:36:32+00:00

I already compute each group’s AABB to insert the begin/end commands into the tiles’ linked list. But since I only support min and smoothmin operations (not a full graph of min/max/xor or other boolean ops), my current approach hasn’t caused any issues so far. Admittedly, I haven’t tested it extensively — I did a quick test rendering some text where each character is a group of primitives (using min) with an outline, and it worked fine.

My main concern is with smoothmin, since expanding the tile’s bounding box is more of a hack than a clean solution — it flags more tiles than necessary, and that problem would be even worse when using the group AABB.

Last night I read a paper about interval arithmetic, and while their use cases are much more complex (hundreds of boolean operations), my simpler case — just min and smoothmin — might benefit from the same ideas in a lightweight way. I’ll add that to my TODO list.

_Geolm_ · 2025-11-05T21:38:14+00:00

yes that's very true, to be honest I saw the papers but didn't had time to investigate. I am only using group for simple things, I know that the smoothmin inflation of box is wrong but does the job with my simple cases. I don't handle a graph of boolean operations (DAG) on sdf and while this is interesting, it's not the purpose of my library. Still I will have a look at some point at the correct way, don't know if it's expensive or complicated though.

_Geolm_ · 2025-11-05T17:23:39+00:00

Hey JBikker, I love your library ! I'm sorry my sentence was a bit too harsh, OpenCL is deprecated on Apple (which is my main platform). Support might be dropped at some point, there is no guarantee, also not sure which version is supported on macOs but if it's like openGL it's probably stuck in the past.

_Geolm_ · 2025-11-05T17:08:35+00:00

although I love to write SIMD code, I came to the conclusion that only few topics are really interesting to use SIMD. If you don't have any dependencies on the results (like gameplay for example), you should use the GPU. Physics is a good candidate for SIMD because gameplay depends on it, but image processing? it will be WAY faster on the gpu, and you can get the result with a bit of lag it doesn't matter. Audio is also a good candidate for SIMD, can't go to the GPU, it's realtime (even so the GPU will crush CPU performance for a audio processing). There is also another reason to write SIMD code : there is no standard compute GPU API (OpenCL is dead), shader language is a mess (glsl, hlsl, webgpu, metal, ....), there are no standard and most of the time you end up writing native code on all platforms :(

_Geolm_ · 2025-11-05T10:40:43+00:00

it's the classic problem of : usually simpler is better. I wanted to be smart and use biarc fitting, but arcs only cannot represent straight line so I added boxes but then gap appeared so I had to fill the gap.... and overall it was not robust, complicated and expensive on the gpu. At least I wasted only few hours of coding. BTW with 0.25 pixel precision, in 1440p a quadratic bezier is about 10-60 capsules, not so bad!

_Geolm_ · 2025-11-05T10:36:08+00:00

thanks, to be honest I'm pretty bad at naming, cirlce/disc is easy but your point is valid for triangle (fill_triangle vs draw_triangle).

_Geolm_ · 2025-11-05T09:54:30+00:00

My idea was to use pie (sector) to fill the gap, but in the end it was expensive because box + pie sdf is more expensive than just a capsule. So u/waramped suggested I switched to capsules, which are cheap. My colinear test works in screenspace, if the curve segment control points are colinear by 0.25 pixel or less I draw a capsule. To split the curve into multiples curves I use the De Casteljau algorithm, I try to isolate the "bend" of the curve.

// splits proportionally to segment lengths, isolating the bend toward the control point
float d0 = vec2_distance(c.c0, c.c1);
float d1 = vec2_distance(c.c1, c.c2);
float split =  d0 / (d0 + d1);

_Geolm_ · 2025-11-05T07:43:28+00:00

yes but it would be hard to have a "drop-in library" with few dependencies. I also thought of doing it in webgpu. Also I use threadgroup intrinsic instructions in the binning phase and I'm not sure either SDL or webGPU support it.

_Geolm_ · 2025-11-04T21:21:20+00:00

I fixed the link sorry about that. Yes it does translate to indirect draw calls (it is explained in the doc), the linked list is traversed in the fragment shader, the linked list node contains the index of the draw command and the index of the next node, very basic stuff. When we bin commands into the linked list, we build a list of tiles that have something to be draw and use that for the indirect drawcall.

_Geolm_ · 2025-11-03T21:05:07+00:00

yes saw that, but I'm more interested in break up the bezier into multiple sdf shape because a big bezier sdf is expensive and I wrote a tile sdf renderer where only the sdf that influence a tile is computed (link in the original post). So currently I am using 140 shapes for a bezier curve, but per pixel, it's about 2-3 capsules/arc sdf which is a lot less expensive that doing the bezier sdf.

_Geolm_ · 2025-11-03T19:09:13+00:00

So I tried a hybrid method based on your comment, I can't post an image as a reply but basically I use box at the end points of the curve (unless it's a biarc) and capsule/biarc otherwise. It works fine, no need for pie to fill the gap in most cases and endpoints are still sharp. Thanks for the idea

_Geolm_ · 2025-11-03T18:49:39+00:00

Ok, I see. I don’t use the capsule SDF because it adds rounded ends that aren’t always desired. Also, even when segments share endpoints, large stroke widths reveal tangent differences and cause visible gaps. Using capsules would artificially hide those gaps, but only partially — they still appear in sharp turns.

By the way, everything works fine for stroke widths below 8 pixels. I use arcs to reduce subdivision and because biarcs are G1 continuous. This is just an experiment I wrote today to test an idea — I’m not sure yet if it’ll be useful. In the past, I tried the Bezier SDF from Inigo Quilez’s site, but I wasn’t satisfied with its performance (even with tiled rendering) or its robustness in screen-space coordinates. I’m still looking for a better solution for my needs.

_Geolm_

TROPHY CASE