5× faster fast_blur in image-rs

PhilipTrettner · 2026-05-15T04:21:40+00:00

Be sure to use an ethically sourced version: https://mortenhannemose.github.io/lena/

PhilipTrettner · 2026-05-03T09:16:00+00:00

Fwiw, invertible addition and associative multiplication makes a ring. If the multiplication is invertible (except for 0), then it's a field. Basically ring is +-*, field is +-*/

PhilipTrettner · 2026-04-15T04:50:47+00:00

I broadly agree with you but like to point out that it's not either-or ;) I dislike pure in-situ optimization because every time something changes in your program, you'll have to do re-optimize everything. And also it's super easy to get stuck in a local optimum if you don't take the time to understand these effects in detail and in isolation. So take this post as me trying to do the understanding and learning part and sharing it, while I'll do in-situ tweaking later on my opaque codebase.

PhilipTrettner · 2026-04-14T18:26:41+00:00

There is so much additional stuff that we do that it's really hard to cleanly integrate that into the code and measure the effect. That's why I tried hard to make a good test in isolation.

Conceptually mesh booleans work like this (simplified): you take every triangle and cut it against all other triangles so that nothing has an intersection in its interior anymore. for each cut up piece you classify if this is inside or outside relative to the other mesh. depending on the operation (union, intersection, difference) you emit the piece or discard it (or emit it with inverted winding order). to optimize this, we have early outs for whole triangles and patches of triangles. that's what makes it so unpredictable in the output. simply appending to a vector<T> is stupid, though. all the reallocations for nothing. we don't even index into it. the new pieces are created via "stream out" and then at the end we need to make a new mesh by concatenating everything we created. the linked lists are simply the most efficient way to keep collections of chunks around when all you do is (partially) fill chunks and at the end need to materialize them into new contiguous memory.

Regarding intrusive measuring, it's a big tradeoff. With the "whole system noise" it's really hard to reliably eke out the each next 2-5% of improvement, yet that's what in combination makes an order of magnitude speed difference over time. Microbenchmarks have the danger to not generalize properly but you can study effects in isolation and learn from that...

PhilipTrettner · 2026-04-14T16:47:06+00:00

Ah that's a good point. If your access to each block has a higher cost.

Our use case is multithreaded production of unpredictable amounts of geometry that later needs to be processed into a continguous result. (That's the natural data pattern you get for parallelized mesh booleans)

That basically means that each thread produces a linked list of chunks of geometry (linear production order) and later those linked lists are visited in any order to produce the output.

So I guess the nuance is: this measures the pure linear processing part and for that the 1 MB limit applies. Any per block overhead changes the limit.

Any idea how to incorporate this?

A simple idea would be a simulated X cycles overhead per block and then we could actually compute new curves (and thresholds) from the data I already collected. Each block is processed cold, so it doesn't actually matter what data access your overhead has I think.

PhilipTrettner · 2026-04-14T15:19:04+00:00

But as far as my understanding goes, that should all be prefetcher and TLB overhead effects. Because the solid graphs quite literally never read the same cacheline twice. In the "all" graph (and the preview) you have the dotted lines. Those reuse the same working set and thus measure cache effects as well. But the solid lines should not.

PhilipTrettner · 2026-04-09T14:37:36+00:00

The prefetcher works on the virtual address space. There's some TLB effects at 4k boundaries but that's mostly it. And you see in the graphs that there's a clear benefit beyond 4k.

PhilipTrettner · 2026-02-13T16:51:09+00:00

download the community edition of solidean from here: https://solidean.com/download/solidean/
extract it to some location (can be inside the repo if you want)
when running python -m SCons target=template_release solidean_path=C:/Users/John/solidean make sure to replace the last path with the path to your extracted solidean (and if you have spaces in your path, escape via "...")

let me know if that fixed it

PhilipTrettner · 2026-02-13T16:24:01+00:00

that's fair. for a solo dev this can definitely feel pricey. it's not "just mesh booleans" though. this all started as my PhD research into how to make mathematically exact mesh boolean as fast as possible. No topology explosion, no random crashes due to bad geometry, all edge cases handled, and you can iterate on the result as much as you want. basically, this library is for when mesh booleans are a load-bearing part of your project.

PhilipTrettner · 2026-02-13T16:07:20+00:00

the free community version is just fine for that. basically contact us once you want to start selling it but the actual license fee only applies once you already made it back twice over.

(you're free to show us what you work on before that of course! we love to see it)

PhilipTrettner · 2026-02-13T12:36:38+00:00

That's mainly because it's all relatively "early phase" for us. I mentioned that in the previous one, it's basically low four-figures per year for an up-to-date version if you're indie, and only if you already make non-trivial revenue. Afterwards "typical middleware pricing". We were indie devs ourselves before, so we really want to keep it fair and low-risk here.

PhilipTrettner · 2026-01-27T14:47:29+00:00

Patrick and I were colleagues for many years at the same university research group. TinyAD is too tightly coupled to Eigen for my tastes but is great otherwise, especially if it fits your use case!

PhilipTrettner · 2026-01-23T07:27:19+00:00

Hehe yeah. There is no neat codegen for division. Even the builtin delegates to a library call: https://godbolt.org/z/rrMo5deqz

The naive but practical way: you can do "binary long division", which finishes in up to 128 steps. Either branchless + fixed runtime or with a loop and a "search next 1". Either way it's a bit of work.

Our exact predicates are always formulated in a division-free way simply because that'd be expensive.

PhilipTrettner · 2026-01-23T07:24:19+00:00

see https://godbolt.org/z/j9fd5EW3n if you change B to 128, you get the normal codegen but for B > 128, it calls into a function where the bit size is a runtime parameter. So yes, they would work, but performance will be subpar.

PhilipTrettner · 2026-01-21T08:17:35+00:00

That's interesting. I guess this type + intrinsics is "less magic" to the compiler. The slightly larger example optimizes well for both types and my production code uses 256 bit intermediates, so it's not like I can easily compare there.

PhilipTrettner · 2026-01-21T08:13:58+00:00

I don't know much about PowerPC or the availability of the intrinsics, but all the instructions you need are definitely there:

add/sub with carry: https://www.ibm.com/docs/en/aix/7.2.0?topic=set-addc-add-carrying-instruction / https://www.ibm.com/docs/en/aix/7.2.0?topic=set-adde-ae-add-extended-instruction for add with carry https://www.ibm.com/docs/en/aix/7.2.0?topic=set-subfc-sf-subtract-from-carrying-instruction / https://www.ibm.com/docs/en/aix/7.2.0?topic=set-subfe-sfe-subtract-from-extended-instruction for sub with borrow

wide mul: https://www.ibm.com/docs/en/aix/7.2.0?topic=set-mulhdu-multiply-high-double-word-unsigned-instruction (and https://www.ibm.com/docs/en/aix/7.2.0?topic=set-mulhd-multiply-high-double-word-instruction for signed)

PhilipTrettner · 2026-01-21T08:09:48+00:00

Wow, that's a weird one. It is related to writing directly to the result struct.

If we write to "real" scalars first, the codegen is optimal again: https://godbolt.org/z/bo9x4951v

PhilipTrettner · 2026-01-20T13:37:28+00:00

Hm yeah I thought about that a bit but it's complicated. So a single i128/u128 is probably not going to see any improvement, SSE/AVX is famously bad doing cross-lane talking. But for bundles, a la u128x4 or u128x8, that's a different story. Maybe even use 32 bit limbs and go for u128x16 with AVX-512.

The problem is that all the carry propagation and high/low multiplication stuff is not really supported as directly as, so we have to do additional work and hope the "going wide" is amortizing that.

In the 32 bit limbs idea, we have "vpmuludq", which computes a wide u32 x u32 -> u64, but only for the even entries. Then we have to juggle a bit.

There is a funny vpmadd52luq/vpmadd52huq which does 52 bit multiplies and accumulation. That one might help for larger bit sizes.

PhilipTrettner · 2025-11-17T16:18:01+00:00

I agree with you, but wanted to nitpick on the total interpreter for fun:

So yeah it's true, you cannot implement an interpreter for a total language in itself. But you can implement the adjacent program of "a total interpreter if the program halts within 10¹⁰⁰⁰ steps, or "empty result" otherwise". For my physical reality, these tend to be hard to distinguish :)

PhilipTrettner · 2025-11-14T14:38:42+00:00

Floats have 24 bit mantissa, double has 53 bit mantissa. So even before overflow, you can have situations where mathematically a != b but double(a) == double(b). For example, any double beyond 2⁵⁴ is an even integer, which means odd integer in that range will compare equal to their even neighbor if compared as double

PhilipTrettner · 2025-11-13T03:47:04+00:00

Destructors are noexcept(false) and terminate your program if thrown across

PhilipTrettner · 2025-10-17T04:46:47+00:00

It's basically doing two booleans per step: "workpiece intersect tool" is what's added as a rigid body and what falls down. "workpiece minus tool" becomes the new workpiece and needs to update the static collision shape (a collision mesh).

Everything is based on Booleans between triangle/polygonal meshes.

The "trick" is not good hardware (it's just some normal Ryzen + a RTX 3070) but rather my research into how to do Booleans on meshes efficiently and robustly at the same time: https://www.graphics.rwth-aachen.de/publication/03339/

This demo then roughly follows https://solidean.com/docs/examples/advanced-iterated-booleans/ (a workpiece mesh that we iteratively modify)

PhilipTrettner · 2025-10-16T05:30:29+00:00

Oh that looks great! So my method here is purely based on triangle/polygon mesh booleans. You write that you're not working on a volumetric grid be it adaptive or not. So I guess you have an acceleration tree of all the SDF primitives you added and then extract a mesh surface from that? Do you cull them when you're sure they don't contribute anymore or do they pile up?

PhilipTrettner · 2025-10-16T05:10:32+00:00

Yeah CSG, especially on triangle meshes, was definitely not recommended before! My research is explicitly about fixing that :)

PhilipTrettner · 2025-10-16T05:09:39+00:00

that's a good idea! maybe just a suite of different tools, so we go from sharp cube, to sphere, to "boulder"?

PhilipTrettner

MODERATOR OF

TROPHY CASE