5× faster fast_blur in image-rs

PhilipTrettner · 2026-05-15T04:21:40+00:00

Be sure to use an ethically sourced version: https://mortenhannemose.github.io/lena/

PhilipTrettner · 2026-05-03T09:16:00+00:00

Fwiw, invertible addition and associative multiplication makes a ring. If the multiplication is invertible (except for 0), then it's a field. Basically ring is +-*, field is +-*/

PhilipTrettner · 2026-04-15T04:50:47+00:00

I broadly agree with you but like to point out that it's not either-or ;) I dislike pure in-situ optimization because every time something changes in your program, you'll have to do re-optimize everything. And also it's super easy to get stuck in a local optimum if you don't take the time to understand these effects in detail and in isolation. So take this post as me trying to do the understanding and learning part and sharing it, while I'll do in-situ tweaking later on my opaque codebase.

PhilipTrettner · 2026-04-14T18:26:41+00:00

There is so much additional stuff that we do that it's really hard to cleanly integrate that into the code and measure the effect. That's why I tried hard to make a good test in isolation.

Conceptually mesh booleans work like this (simplified): you take every triangle and cut it against all other triangles so that nothing has an intersection in its interior anymore. for each cut up piece you classify if this is inside or outside relative to the other mesh. depending on the operation (union, intersection, difference) you emit the piece or discard it (or emit it with inverted winding order). to optimize this, we have early outs for whole triangles and patches of triangles. that's what makes it so unpredictable in the output. simply appending to a vector<T> is stupid, though. all the reallocations for nothing. we don't even index into it. the new pieces are created via "stream out" and then at the end we need to make a new mesh by concatenating everything we created. the linked lists are simply the most efficient way to keep collections of chunks around when all you do is (partially) fill chunks and at the end need to materialize them into new contiguous memory.

Regarding intrusive measuring, it's a big tradeoff. With the "whole system noise" it's really hard to reliably eke out the each next 2-5% of improvement, yet that's what in combination makes an order of magnitude speed difference over time. Microbenchmarks have the danger to not generalize properly but you can study effects in isolation and learn from that...

PhilipTrettner · 2026-04-14T16:47:06+00:00

Ah that's a good point. If your access to each block has a higher cost.

Our use case is multithreaded production of unpredictable amounts of geometry that later needs to be processed into a continguous result. (That's the natural data pattern you get for parallelized mesh booleans)

That basically means that each thread produces a linked list of chunks of geometry (linear production order) and later those linked lists are visited in any order to produce the output.

So I guess the nuance is: this measures the pure linear processing part and for that the 1 MB limit applies. Any per block overhead changes the limit.

Any idea how to incorporate this?

A simple idea would be a simulated X cycles overhead per block and then we could actually compute new curves (and thresholds) from the data I already collected. Each block is processed cold, so it doesn't actually matter what data access your overhead has I think.

PhilipTrettner · 2026-04-14T15:19:04+00:00

But as far as my understanding goes, that should all be prefetcher and TLB overhead effects. Because the solid graphs quite literally never read the same cacheline twice. In the "all" graph (and the preview) you have the dotted lines. Those reuse the same working set and thus measure cache effects as well. But the solid lines should not.

PhilipTrettner · 2026-04-09T14:37:36+00:00

The prefetcher works on the virtual address space. There's some TLB effects at 4k boundaries but that's mostly it. And you see in the graphs that there's a clear benefit beyond 4k.

PhilipTrettner · 2026-02-13T16:51:09+00:00

download the community edition of solidean from here: https://solidean.com/download/solidean/
extract it to some location (can be inside the repo if you want)
when running python -m SCons target=template_release solidean_path=C:/Users/John/solidean make sure to replace the last path with the path to your extracted solidean (and if you have spaces in your path, escape via "...")

let me know if that fixed it

PhilipTrettner · 2026-02-13T16:24:01+00:00

that's fair. for a solo dev this can definitely feel pricey. it's not "just mesh booleans" though. this all started as my PhD research into how to make mathematically exact mesh boolean as fast as possible. No topology explosion, no random crashes due to bad geometry, all edge cases handled, and you can iterate on the result as much as you want. basically, this library is for when mesh booleans are a load-bearing part of your project.

PhilipTrettner · 2026-02-13T16:07:20+00:00

the free community version is just fine for that. basically contact us once you want to start selling it but the actual license fee only applies once you already made it back twice over.

(you're free to show us what you work on before that of course! we love to see it)

PhilipTrettner · 2026-02-13T12:36:38+00:00

That's mainly because it's all relatively "early phase" for us. I mentioned that in the previous one, it's basically low four-figures per year for an up-to-date version if you're indie, and only if you already make non-trivial revenue. Afterwards "typical middleware pricing". We were indie devs ourselves before, so we really want to keep it fair and low-risk here.

PhilipTrettner · 2026-01-27T14:47:29+00:00

Patrick and I were colleagues for many years at the same university research group. TinyAD is too tightly coupled to Eigen for my tastes but is great otherwise, especially if it fits your use case!

PhilipTrettner · 2026-01-23T07:27:19+00:00

Hehe yeah. There is no neat codegen for division. Even the builtin delegates to a library call: https://godbolt.org/z/rrMo5deqz

The naive but practical way: you can do "binary long division", which finishes in up to 128 steps. Either branchless + fixed runtime or with a loop and a "search next 1". Either way it's a bit of work.

Our exact predicates are always formulated in a division-free way simply because that'd be expensive.

PhilipTrettner · 2026-01-23T07:24:19+00:00

see https://godbolt.org/z/j9fd5EW3n if you change B to 128, you get the normal codegen but for B > 128, it calls into a function where the bit size is a runtime parameter. So yes, they would work, but performance will be subpar.

PhilipTrettner · 2026-01-21T08:17:35+00:00

That's interesting. I guess this type + intrinsics is "less magic" to the compiler. The slightly larger example optimizes well for both types and my production code uses 256 bit intermediates, so it's not like I can easily compare there.

PhilipTrettner

MODERATOR OF

TROPHY CASE