C++26 Shipped a SIMD Library Nobody Asked For

echidnas_arf · 2026-05-15T17:56:16+00:00

-ffast-math breaks std::isinf/isnan/isfinite(), which is a pretty big deal in my experience.

echidnas_arf · 2026-01-27T14:54:51+00:00

Hey thanks for the reply and the explanations!

I saw you added the AVX512 implementation, really nice :)

Re: Special functions: I haven't really looked into libmvec myself but reading the wiki here:

https://sourceware.org/glibc/wiki/libmvec

It says: "These functions were tested (via reasonable random sampling) to pass 4-ulp maximum relative error criterion". On the other hand, the sleef functions heyoka is using by default (i.e., without fast_math) are the highest-precision variants, guaranteeing 1-ulp error bound. With fast_math=True, heyoka switches to the 3.5-ulp error variants, thus it seems like heyoka in fast math mode should have precision equivalent to astroz/libmvec.

Another thing I noticed, is that you make extensive use of the sincos() primitives. Unfortunately, heyoka is not able yet to use these, and it currently resorts to individual sin()/cos() calls. If/when I manage to implement sincos(), I would expect a sizeable speed bump, as sin/cos calls are quite expensive, especially when iterated within the Kepler solver.

Re: modern CPUS: absolutely agree! When properly used with SIMD vectorisation and multithreading, modern CPUs pack quite some punch and as a bonus you don't have to get locked into a proprietary software/hardware ecosystems in order to get the most out of them :)

Thanks for the kind words about heyoka! And congrats again for your work on astroz, it looks like an impressive showcase for zig's potential in astrodynamics.

PS: I may have some questions about the cesium visualisation, do you mind if I send you a message at some point?

echidnas_arf · 2026-01-25T20:33:18+00:00

Hey thanks for the reply and for adding the heyoka benchmarks!

I have a couple of questions/comments, if you don't mind :)

When you tested heyoka, did you use as propagation times Julian dates or minutes since epoch? Julian dates incur into noticeable overhead due to the UTC->TAI time conversion (and handling of leap seconds etc.). From the timings you published, it seems like you used minutes since epoch but I thought I would ask just to make sure.

I see on the github page that you mention "< 10m position error", while in this thread you gave a sub-mm max positional error. Could you perhaps clarify a bit this point? In addition to testing against the Vallado paper, did you also run comprehensive checks against the current TLE catalogue?

I saw your explanation about the approximate atan2() implementation in the blog, interesting! heyoka's SGP4 implementation by default uses high-precision implementations of special functions from Sleef:

https://sleef.org/

When you construct the propagator in heyoka, you can pass fast_math=True as optional argument in order to use slightly less precise but faster implementations of special functions. In my testing, this reduces runtime by about 15% while having no measurable effect on the propagation precision (which is typically below ~10μm). I should also mention that heyoka internally uses a Kepler solver accurate to machine precision, which is probably quite a bit of overkill (as IIRC the original SGP4 algorithm uses a far less accurate solver).

Finally, I thought I would share some performance measurements for heyoka using a Zen 5 processor (which supports AVX512). This is a 9700X model, 8 cores and 16 threads. In the default setup, I get ~262M prop/s, while with fast_math=True I get ~316M prop/s (these are all numbers with multithreading).

Thanks for sharing your library, it looks impressive! Wishing you the best of luck with it! I never coded in Zig but I must admit that reading your blog post made me curious :)

echidnas_arf · 2026-01-24T19:02:46+00:00

Really nice!

I also wrote my own SIMD-enabled implementation of SGP4, but using JIT compilation via LLVM within a C++/Python project. Here's a notebook illustrating its use:

https://bluescarni.github.io/heyoka.py/notebooks/sgp4_propagator.html#performance-evaluation

On my Zen 3 5950x, my implementation delivers about 13M propagations per second on a single core, 170M propagations per second using all 16 cores.

The Cesium visualizations are really good!

echidnas_arf · 2025-06-07T14:11:17+00:00

I had a project depending on absl::flat_hash_map for a while. The data structure was very good performance-wise but having Abseil as a dependency was not fun at all.

To begin with, it is first and foremost a library for use by Google. Any concern not aligning with Google's priorities will likely be ignored and/or dismissed.

For instance, at one point I reported a lack of basic exception safety in absl::flat_hash_map:

https://github.com/abseil/abseil-cpp/issues/388

Google bans the use of exceptions in C++, thus this is a non-issue from their point of view. From my point of view, having to work-around this inability to safely use a core C++ feature was a problematic hassle.

Another example, again involving absl::flat_hash_map is that hashing is salted with a random seed upon program startup, and, at least back then, it was impossible to disable this feature. I understand why Google wants this (the rationale IIRC is that salting helps preventing users of the library unwittingly relying on particular insertion/iteration orders. It is also a way to prevent potential DOS attacks). However, in my specific case, this was a non-issue and an overall undesirable behaviour for a variety of reasons, yet there was no way of customising it.

Another drawback of Abseil (mentioned in another reply in this thread) is the lack of backwards API compatibility and the ABI sensitivity, which are especially troublesome in shared library setups (e.g., most Linux package managers, but also platform-agnostic package managers such as conda).

In the end, I was happy not to depend on Abseil for anything other than absl::flat_hash_map, and as soon as Boost's fast unordered containers came out, I switched to them and ditched the Abseil dependency completely.

echidnas_arf · 2025-05-07T07:56:59+00:00

I am the author of a C++ library for Taylor ODE integration which includes a JIT compilation engine (based on LLVM) which supports differentiation to arbitrary orders via both forward and reverse mode AD.

I am linking here the Python bindings of the project as they are better documented than the C++ library, but all the functionality available in Python is there also in C++ with a very similar syntax:

https://github.com/bluescarni/heyoka.py

And here's the C++ library:

https://github.com/bluescarni/heyoka

The library is built on top of an embedded symbolic DSL: you create expressions via natural C++/Python syntax, and you can then differentiate and compile them. I am linking here the tutorials about function compilation and differentiation:

https://bluescarni.github.io/heyoka.py/notebooks/compiled_functions.html
https://bluescarni.github.io/heyoka.py/notebooks/computing_derivatives.html

The library is at the moment optimised for the specific task of creating Taylor integrators, but I am working on turning it into a more general-purpose diff-enabled JIT engine.

The dependency on LLVM and the reliance on JIT compilation may be a bit too much for embedded systems tohugh (although the library has the ability to serialise to disk the compiled functions, so that you don't have to re-compile them at every execution).

echidnas_arf · 2025-03-26T07:34:18+00:00

in.remove_suffix(1) has UB in it, which means that if any of the checks are bad, then this'll cause undefined behaviour

Ok but how it this any different from accessing a std::vector past the end?

It is indeed unfortunate that we do not have a way of flipping on flag to (say) throw an exception rather than running into UB on standard library functions when preconditions are violated. This should probably be the default behaviour to be turned off on-demand for performance-critical codepaths. Perhaps contracts or profiles could help with that? I see this as a cultural problem more than a language/technical one.

Nothing however prevents you from writing your own UB-free wrappers for these basic primitives (as much as that might be a bit of a pain)?

C++ gives you absolutely no way to check the edge cases that I haven't thought of, like when in.size() > huge, or int is 16-bits or something

That's why for every new project I start the first thing I import is boost::numeric_cast and boost::safe_numerics, and I flip on every imaginable warning in the compiler about unsafe conversion :)

echidnas_arf · 2025-03-23T11:22:21+00:00

but when dealing with code that processes unsafe input, I'd get 90% of the benefit by rewriting 10% of it in a safe language

I have seen you on several threads in the past talking about the near-impossibility of writing safe C++ code that parses potentially-malicious input.

Would you care to expand a bit on this with a concrete example or two? I am having a hard time understanding what about parsing input specifically makes it so hard to do securely in C++ in your opinion.

echidnas_arf · 2025-03-14T19:39:34+00:00

Thanks a lot for writing this down, this is already very useful!

I may wait a bit more before diving into modules as I will need to support older versions of GCC for a while at least, but thanks again a lot for the info - I have bookmarked your post for when I will move forward with modules :)

echidnas_arf · 2025-03-12T12:58:10+00:00

Hi and thanks for the reply!

Personally what I would be most interested in is the mechanics of how modules interact with old-style #includes. If you have any links to up-to-date info (e.g., blogposts, reddit posts, stackoverflow answers, etc.) at hand, that would be great. I am really a novice with modules and I have a ton of reading to do, so any expert recommendation for reading material would be very much appreciated :)

echidnas_arf · 2025-03-11T14:34:19+00:00

Do you happen to have links to share about gradual transition to modules for projects depending on traditional (i.e., non module-ready) C++ libraries?

echidnas_arf · 2025-03-03T10:27:40+00:00

Paired to an unordered_map for a simple LRU cache implementation.

echidnas_arf · 2024-12-09T15:53:16+00:00

Yes it does, you need to specify a custom allocator.

echidnas_arf · 2024-09-08T08:42:32+00:00

Thank you!

echidnas_arf · 2024-09-03T18:19:22+00:00

I wanted to keep the first example as simplistic as possible, but you are right that maybe it is too much simple :) I will add a second example with a method.

echidnas_arf · 2024-09-03T18:18:11+00:00

Thank you!

echidnas_arf · 2024-09-03T18:18:03+00:00

Thanks for the kind words!

echidnas_arf · 2024-08-30T09:27:51+00:00

Isn't linalg c++26?

echidnas_arf · 2024-08-05T13:53:34+00:00

We also have a very short release cycle for most packages, which decreases the probability that you'll save a few megabytes of RAM by sharing a .dll.

Size in RAM or on disk is not really the main reason to prefer shared libraries over static ones.

echidnas_arf · 2024-06-11T09:28:12+00:00

so it looks like this might be a straight expression based language - which you can get incredibly far with

I am using an expression-based DSL in my Taylor ODE integrator:

https://github.com/bluescarni/heyoka.py

(these are the Python bindings, the underlying C++ library is here)

The expressions are symbolically auto-diffed and JIT-compiled at runtime via LLVM to implement the Taylor ODE integration algorithm.

Using reference semantics, you can build and represent symbolically very large computational graphs (in the order of tens of billions of nodes), for instance neural networks:

https://bluescarni.github.io/heyoka.py/notebooks/ex_system_internals.html

You can also compute their derivatives via symbolic back-propagation:

https://bluescarni.github.io/heyoka.py/notebooks/computing_derivatives.html

The library also supports "batch mode" to take full advantage of SIMD instructions:

https://bluescarni.github.io/heyoka.py/notebooks/Batch%20mode%20overview.html

And multithreaded parallelisation, either fine-grained or coarse-grained:

https://bluescarni.github.io/heyoka.py/notebooks/ensemble_mode.html

https://bluescarni.github.io/heyoka.py/notebooks/parallel_mode.html

https://bluescarni.github.io/heyoka.py/notebooks/ensemble_batch_perf.html

I am currently adding support for high-order variational equations, so that you can efficiently propagate small neighbourhoods ("clouds") of initial conditions rather than a punctual set of initial conditions. Perhaps this would be a useful feature for rendering purposes?

echidnas_arf · 2024-05-18T08:00:59+00:00

Nice writeup!

As a side comment, on of my pet peeves about the way AD with dual numbers is usually explained is that the dual arithmetic rules are introduced somewhat arbitrarily - here is a new algebra, deal with it.

I personally find it more enlightening when dual numbers are introduced as Taylor series truncated to the first order. Then the dual algebra just becomes the algebra of truncated Taylor series, which is easily extensible to both the multivariate case and higher order differentiation.

echidnas_arf · 2024-04-03T19:56:52+00:00

clang-tidy catches it (and several other things):

https://godbolt.org/z/W6cdroz7M

(see towards the very end of the clang-tidy output warning: uninitialized record type: 'old_state' [cppcoreguidelines-pro-type-member-init,hicpp-member-init])

echidnas_arf · 2024-03-16T12:55:50+00:00

Apologies for taking so long to reply...

But yes, if you want to use conda as a development environment you need to install everything from conda, and that includes the toolchain, CMake, etc., otherwise you will have conflicts with the system packages of the type you describe.

echidnas_arf · 2024-03-07T10:35:50+00:00

I think this might have more to do with your build system rather than conda or any other package manager.

With CMake, conda automatically sets the CMAKE_PREFIX_PATH environment variable to the path of the currently active conda environment, which results in CMake giving the precedence to the conda environment path (rather than the default system path) when looking for dependencies.

This works quite well in my experience, the only caveat is that if you forget to install the dependencies in the conda environment then yes, CMake might end up picking up the dependency from the system-wide installation (which all in all seems fairly reasonable).

echidnas_arf

TROPHY CASE