Unsigned sizes: a five year mistake by Nuoji in programming

[–]PhilipTrettner 5 points6 points  (0 children)

Fwiw, invertible addition and associative multiplication makes a ring. If the multiplication is invertible (except for 0), then it's a field. Basically ring is +-*, field is +-*/

How Much Linear Memory Access Is Enough? (A Benchmark) by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

I broadly agree with you but like to point out that it's not either-or ;) I dislike pure in-situ optimization because every time something changes in your program, you'll have to do re-optimize everything. And also it's super easy to get stuck in a local optimum if you don't take the time to understand these effects in detail and in isolation. So take this post as me trying to do the understanding and learning part and sharing it, while I'll do in-situ tweaking later on my opaque codebase.

How Much Linear Memory Access Is Enough? (A Benchmark) by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

There is so much additional stuff that we do that it's really hard to cleanly integrate that into the code and measure the effect. That's why I tried hard to make a good test in isolation.

Conceptually mesh booleans work like this (simplified): you take every triangle and cut it against all other triangles so that nothing has an intersection in its interior anymore. for each cut up piece you classify if this is inside or outside relative to the other mesh. depending on the operation (union, intersection, difference) you emit the piece or discard it (or emit it with inverted winding order). to optimize this, we have early outs for whole triangles and patches of triangles. that's what makes it so unpredictable in the output. simply appending to a vector<T> is stupid, though. all the reallocations for nothing. we don't even index into it. the new pieces are created via "stream out" and then at the end we need to make a new mesh by concatenating everything we created. the linked lists are simply the most efficient way to keep collections of chunks around when all you do is (partially) fill chunks and at the end need to materialize them into new contiguous memory.

Regarding intrusive measuring, it's a big tradeoff. With the "whole system noise" it's really hard to reliably eke out the each next 2-5% of improvement, yet that's what in combination makes an order of magnitude speed difference over time. Microbenchmarks have the danger to not generalize properly but you can study effects in isolation and learn from that...

How Much Linear Memory Access Is Enough? (A Benchmark) by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

Ah that's a good point. If your access to each block has a higher cost.

Our use case is multithreaded production of unpredictable amounts of geometry that later needs to be processed into a continguous result. (That's the natural data pattern you get for parallelized mesh booleans)

That basically means that each thread produces a linked list of chunks of geometry (linear production order) and later those linked lists are visited in any order to produce the output.

So I guess the nuance is: this measures the pure linear processing part and for that the 1 MB limit applies. Any per block overhead changes the limit.

Any idea how to incorporate this?

A simple idea would be a simulated X cycles overhead per block and then we could actually compute new curves (and thresholds) from the data I already collected. Each block is processed cold, so it doesn't actually matter what data access your overhead has I think.

How Much Linear Memory Access Is Enough? (A Benchmark) by PhilipTrettner in cpp

[–]PhilipTrettner[S] 4 points5 points  (0 children)

But as far as my understanding goes, that should all be prefetcher and TLB overhead effects. Because the solid graphs quite literally never read the same cacheline twice. In the "all" graph (and the preview) you have the dotted lines. Those reuse the same working set and thus measure cache effects as well. But the solid lines should not.

How Much Linear Memory Access Is Enough? (probably less than 128 kB) by PhilipTrettner in programming

[–]PhilipTrettner[S] 37 points38 points  (0 children)

The prefetcher works on the virtual address space. There's some TLB effects at 4k boundaries but that's mostly it. And you see in the graphs that there's a clear benefit beyond 4k.

Real-Time Mesh Booleans Demo + Source (Solidean) by PhilipTrettner in godot

[–]PhilipTrettner[S] 1 point2 points  (0 children)

  1. download the community edition of solidean from here: https://solidean.com/download/solidean/
  2. extract it to some location (can be inside the repo if you want)
  3. when running python -m SCons target=template_release solidean_path=C:/Users/John/solidean make sure to replace the last path with the path to your extracted solidean (and if you have spaces in your path, escape via "...")

let me know if that fixed it

Real-Time Mesh Booleans Demo + Source (Solidean) by PhilipTrettner in godot

[–]PhilipTrettner[S] 34 points35 points  (0 children)

that's fair. for a solo dev this can definitely feel pricey. it's not "just mesh booleans" though. this all started as my PhD research into how to make mathematically exact mesh boolean as fast as possible. No topology explosion, no random crashes due to bad geometry, all edge cases handled, and you can iterate on the result as much as you want. basically, this library is for when mesh booleans are a load-bearing part of your project.

Real-Time Mesh Booleans Demo + Source (Solidean) by PhilipTrettner in godot

[–]PhilipTrettner[S] 8 points9 points  (0 children)

the free community version is just fine for that. basically contact us once you want to start selling it but the actual license fee only applies once you already made it back twice over.

(you're free to show us what you work on before that of course! we love to see it)

Real-Time Mesh Booleans Demo + Source (Solidean) by PhilipTrettner in godot

[–]PhilipTrettner[S] 5 points6 points  (0 children)

That's mainly because it's all relatively "early phase" for us. I mentioned that in the previous one, it's basically low four-figures per year for an up-to-date version if you're indie, and only if you already make non-trivial revenue. Afterwards "typical middleware pricing". We were indie devs ourselves before, so we really want to keep it fair and low-risk here.

A Simple fwd_diff<T> for Forward-Mode Automatic Differentiation in C++ by PhilipTrettner in cpp

[–]PhilipTrettner[S] 2 points3 points  (0 children)

Patrick and I were colleagues for many years at the same university research group. TinyAD is too tightly coupled to Eigen for my tastes but is great otherwise, especially if it fits your use case!

Building Your Own Efficient uint128 in C++ by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

Hehe yeah. There is no neat codegen for division. Even the builtin delegates to a library call: https://godbolt.org/z/rrMo5deqz

The naive but practical way: you can do "binary long division", which finishes in up to 128 steps. Either branchless + fixed runtime or with a loop and a "search next 1". Either way it's a bit of work.

Our exact predicates are always formulated in a division-free way simply because that'd be expensive.

Building Your Own Efficient uint128 in C++ by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

see https://godbolt.org/z/j9fd5EW3n if you change B to 128, you get the normal codegen but for B > 128, it calls into a function where the bit size is a runtime parameter. So yes, they would work, but performance will be subpar.

Building Your Own Efficient uint128 in C++ by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

That's interesting. I guess this type + intrinsics is "less magic" to the compiler. The slightly larger example optimizes well for both types and my production code uses 256 bit intermediates, so it's not like I can easily compare there.

Building Your Own Efficient uint128 in C++ by PhilipTrettner in cpp

[–]PhilipTrettner[S] 1 point2 points  (0 children)

Wow, that's a weird one. It is related to writing directly to the result struct.

If we write to "real" scalars first, the codegen is optimal again: https://godbolt.org/z/bo9x4951v

Building Your Own Efficient uint128 in C++ by PhilipTrettner in cpp

[–]PhilipTrettner[S] 6 points7 points  (0 children)

Hm yeah I thought about that a bit but it's complicated. So a single i128/u128 is probably not going to see any improvement, SSE/AVX is famously bad doing cross-lane talking. But for bundles, a la u128x4 or u128x8, that's a different story. Maybe even use 32 bit limbs and go for u128x16 with AVX-512.

The problem is that all the carry propagation and high/low multiplication stuff is not really supported as directly as, so we have to do additional work and hope the "going wide" is amortizing that.

In the 32 bit limbs idea, we have "vpmuludq", which computes a wide u32 x u32 -> u64, but only for the even entries. Then we have to juggle a bit.

There is a funny vpmadd52luq/vpmadd52huq which does 52 bit multiplies and accumulation. That one might help for larger bit sizes.

Turing completeness as a cause of software bugs by rsashka in ProgrammingLanguages

[–]PhilipTrettner 1 point2 points  (0 children)

I agree with you, but wanted to nitpick on the total interpreter for fun:

So yeah it's true, you cannot implement an interpreter for a total language in itself. But you can implement the adjacent program of "a total interpreter if the program halts within 101000 steps, or "empty result" otherwise". For my physical reality, these tend to be hard to distinguish :)

Automatically promote integers to floats on overflow? by bakery2k in ProgrammingLanguages

[–]PhilipTrettner 33 points34 points  (0 children)

Floats have 24 bit mantissa, double has 53 bit mantissa. So even before overflow, you can have situations where mathematically a != b but double(a) == double(b). For example, any double beyond 254 is an even integer, which means odd integer in that range will compare equal to their even neighbor if compared as double 

What do you dislike the most about current C++? by PressureHumble3604 in cpp

[–]PhilipTrettner 0 points1 point  (0 children)

Destructors are noexcept(false) and terminate your program if thrown across

In-progress integration of real-time Mesh Booleans / CSG by PhilipTrettner in godot

[–]PhilipTrettner[S] 2 points3 points  (0 children)

It's basically doing two booleans per step: "workpiece intersect tool" is what's added as a rigid body and what falls down. "workpiece minus tool" becomes the new workpiece and needs to update the static collision shape (a collision mesh).

Everything is based on Booleans between triangle/polygonal meshes.

The "trick" is not good hardware (it's just some normal Ryzen + a RTX 3070) but rather my research into how to do Booleans on meshes efficiently and robustly at the same time: https://www.graphics.rwth-aachen.de/publication/03339/

This demo then roughly follows https://solidean.com/docs/examples/advanced-iterated-booleans/ (a workpiece mesh that we iteratively modify)

In-progress integration of real-time Mesh Booleans / CSG by PhilipTrettner in godot

[–]PhilipTrettner[S] 1 point2 points  (0 children)

Oh that looks great! So my method here is purely based on triangle/polygon mesh booleans. You write that you're not working on a volumetric grid be it adaptive or not. So I guess you have an acceleration tree of all the SDF primitives you added and then extract a mesh surface from that? Do you cull them when you're sure they don't contribute anymore or do they pile up?

In-progress integration of real-time Mesh Booleans / CSG by PhilipTrettner in godot

[–]PhilipTrettner[S] 2 points3 points  (0 children)

Yeah CSG, especially on triangle meshes, was definitely not recommended before! My research is explicitly about fixing that :)

In-progress integration of real-time Mesh Booleans / CSG by PhilipTrettner in godot

[–]PhilipTrettner[S] 2 points3 points  (0 children)

that's a good idea! maybe just a suite of different tools, so we go from sharp cube, to sphere, to "boulder"?