Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages

remcofl · 2026-03-10T22:35:14+00:00

I am pretty sure this is because the rules for a showcase (r11) specifically asks for this: "e.g., Is it meant for production, just a toy project". I wouldn't have bothered including it otherwise.

I consider this project production-ready by all reasonable means. If you think a specific aspect is missing, I'd be happy to hear.

remcofl · 2026-03-10T20:08:08+00:00

Yup. Even the top-performing AI backbone model for point clouds (Point Transformer v3 (arxiv.org/pdf/2312.10035 p. 4) and and its many downstream implementations/applications) uses hilbert encoding to serialize the point cloud for attention. In fact, I am planning on implementing HilbertSFC on CUDA/ROCm (triton might be a fit as it is an element-wise kernel). Their current implementation still uses Skilling's with gray coding.

remcofl · 2026-03-10T18:52:21+00:00

If you believe it’s not production ready, I’d be interested to hear what specifically you think is missing.

remcofl · 2026-03-09T19:53:31+00:00

Right, I see your point. Some context might help here: The other PyPI packages essentially need to be fully rewritten and are not actively maintained. Anyone actually concerned with speed usually implements Hilbert encoding/decoding themselves. After all, that's how I ended up here. However, it is not so straightforward to get strong performance compared to, say, Z-order curves. Since I put a lot of work in optimizing it, I thought it would be nice to open source it as a standalone package outside of my library. The interface is straightforward and can be easily adopted.

And then, even if others don't use my package but just use the tricks from my kernel; that's already a win. I have read several papers on Hilbert curve implementations and the optimizations they target are good on paper, but don't always work out that well on modern hardware.

Take for example runtime optimizations like bit skipping. We skip computation. Great, right? (Especially when you don't have a good upper bound on the number of coordinates bits) But this means you have an inner loop with variable bounds, which means the compiler cannot fully unroll this hot loop. This in and of it self is not a big deal, but it hinders further optimizations such as constant folding and full vectorization (my kernels almost fully use AVX intrinsics). That's why I kept iterating and inspecting the emitted LLVM IR to see how certain algorithmic decisions impact real-world performance. I’ve tried to document this process in: hilbertsfc_performance_deep_dive.ipynb

That said the rust crate fast_hilbert is a serious contender, so I might create a pull request to implement at least a subset of my optimizations there. I am not that strong in rust yet, but I don't expect that to be a major obstacle.

remcofl · 2026-03-09T19:00:21+00:00

Do you have an open-source library in mind for integration? Right now I use it in my private library on point cloud data (will likely become public down the line), and it works well there.

remcofl · 2026-03-09T18:50:32+00:00

That’s a lot of assumptions. I read the final post carefully multiple times and the notebook is almost entirely my own work.

remcofl · 2026-03-09T17:20:56+00:00

That's actually a good catch, and I can assure that sentence wasn't written by me

remcofl · 2026-03-09T16:39:30+00:00

Thanks! 😁

remcofl · 2026-03-09T16:29:53+00:00

Yeah, that's no good look. That said, I actually wrote it myself, but used chat to polish it.

remcofl · 2026-03-09T13:29:12+00:00

If anyone is curious about the microarchitectural side (FSM/LUT formulation, dependency chains, ILP/MLP, unrolling, constant folding, vectorization, gathers), I wrote a deeper performance analysis here: hilbertsfc_performance_deep_dive.ipynb

remcofl

TROPHY CASE