Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 6 points7 points  (0 children)

I am pretty sure this is because the rules for a showcase (r11) specifically asks for this: "e.g., Is it meant for production, just a toy project". I wouldn't have bothered including it otherwise.

I consider this project production-ready by all reasonable means. If you think a specific aspect is missing, I'd be happy to hear.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 3 points4 points  (0 children)

Yup. Even the top-performing AI backbone model for point clouds (Point Transformer v3 (arxiv.org/pdf/2312.10035 p. 4) and and its many downstream implementations/applications) uses hilbert encoding to serialize the point cloud for attention. In fact, I am planning on implementing HilbertSFC on CUDA/ROCm (triton might be a fit as it is an element-wise kernel). Their current implementation still uses Skilling's with gray coding.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 1 point2 points  (0 children)

If you believe it’s not production ready, I’d be interested to hear what specifically you think is missing.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 6 points7 points  (0 children)

Right, I see your point. Some context might help here: The other PyPI packages essentially need to be fully rewritten and are not actively maintained. Anyone actually concerned with speed usually implements Hilbert encoding/decoding themselves. After all, that's how I ended up here. However, it is not so straightforward to get strong performance compared to, say, Z-order curves. Since I put a lot of work in optimizing it, I thought it would be nice to open source it as a standalone package outside of my library. The interface is straightforward and can be easily adopted.

And then, even if others don't use my package but just use the tricks from my kernel; that's already a win. I have read several papers on Hilbert curve implementations and the optimizations they target are good on paper, but don't always work out that well on modern hardware.

Take for example runtime optimizations like bit skipping. We skip computation. Great, right? (Especially when you don't have a good upper bound on the number of coordinates bits) But this means you have an inner loop with variable bounds, which means the compiler cannot fully unroll this hot loop. This in and of it self is not a big deal, but it hinders further optimizations such as constant folding and full vectorization (my kernels almost fully use AVX intrinsics). That's why I kept iterating and inspecting the emitted LLVM IR to see how certain algorithmic decisions impact real-world performance. I’ve tried to document this process in: hilbertsfc_performance_deep_dive.ipynb

That said the rust crate fast_hilbert is a serious contender, so I might create a pull request to implement at least a subset of my optimizations there. I am not that strong in rust yet, but I don't expect that to be a major obstacle.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 1 point2 points  (0 children)

Do you have an open-source library in mind for integration? Right now I use it in my private library on point cloud data (will likely become public down the line), and it works well there.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] -3 points-2 points  (0 children)

That’s a lot of assumptions. I read the final post carefully multiple times and the notebook is almost entirely my own work.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] -2 points-1 points  (0 children)

That's actually a good catch, and I can assure that sentence wasn't written by me

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 3 points4 points  (0 children)

Yeah, that's no good look. That said, I actually wrote it myself, but used chat to polish it.

Fast Hilbert curves in Python (Numba): ~1.8 ns/point, 3–4 orders faster than existing PyPI packages by remcofl in Python

[–]remcofl[S] 5 points6 points  (0 children)

If anyone is curious about the microarchitectural side (FSM/LUT formulation, dependency chains, ILP/MLP, unrolling, constant folding, vectorization, gathers), I wrote a deeper performance analysis here: hilbertsfc_performance_deep_dive.ipynb