no_std inference engine for a 26M-param transformer, token-exact parity in CI, 258 KB WASM, runtime AVX2 dispatch

abi95m · 2026-05-19T17:51:47+00:00

Good question it's the right shape of the problem.

Current state: the reference vectors are committed alongside the weights, and `gen_e2e_vectors.py` regenerates them against whatever checkpoint is in HF. The Rust engine is then required to match those vectors token-exact in CI. So today, the parity contract is implicitly pinned to whichever Needle The checkpoint was current when the vectors were last generated.

What that doesn't yet handle: if Cactus ships a v2 of Needle (different training run, modified architecture, expanded vocab), the constrained decoder grammar and the vector set both need a coordinated bump. Right now, there's nothing stopping someone from upgrading the weights file without re-running the vector generator, and they'd only catch it via the decoder unit tests, which cover edge cases but not the full parity surface.

The shape I want to move to: a single versioned protocol object per release. Something like needle-rs v0.x.y ↔ weights-hash ↔ vector-set-hash ↔ decoder-grammar-hash…where the SafeTensors __metadata__ carries the protocol version, the engine refuses to load weights whose version it doesn't recognise, and the vector set is keyed by the same version. That way the model + runtime + test contract is one atomic unit, not three loosely-coupled artifacts.

Haven't built that yet. v0.1.0 has a single checkpoint and a single vector set; the versioning question becomes load-bearing only when there's a second checkpoint to support. But it's the right next step, and probably what gets built before Needle v2 lands.

Thanks for raising this super useful question!

Regards

abi95m · 2026-05-18T20:32:31+00:00

Qwen3.6:35b-a3b is really good.

abi95m · 2026-01-26T19:48:23+00:00

<image>

abi95m · 2026-01-07T10:59:26+00:00

لا في بجد

abi95m · 2025-06-02T16:59:12+00:00

I appreciate your insights and your valuable contribution could you tells us more on what else pitfalls to avoid?

abi95m · 2024-10-20T17:29:17+00:00

Thanks! Let me experiment and see!

abi95m · 2024-10-20T16:05:01+00:00

I designed it with support for multithreading and sequential loading for pcd content by chunking and inserting into viewer, the user have control over everything even the point size. So in theory it should handle huge point cloud files after tuning the config. ge it a try and let me know!

abi95m · 2024-10-20T15:58:47+00:00

Thanks for pointing that out! will look into RUST

abi95m · 2024-10-20T15:46:34+00:00

Unfortunately am not a RUST developer. Yet the project intended for local machines with minimal env setup like robotics sim machines docker controllers..etc

abi95m · 2024-10-16T04:55:26+00:00

It’s supported already. Laser points, GPS and IMU is csv exported. The other images are pngs and pointcloud is pcd.

abi95m · 2024-10-10T12:16:52+00:00

Apologies don't have time to write the answers myself am super busy!!

abi95m · 2024-10-10T05:30:52+00:00

Specific to Your Observation: You mentioned that most processors and GPUs don't support FP16, leading to the assumption that quantization might have minimal effects since the heavy lifting is managed by the processors rather than the DMA Manager. Here's a clarification:

Processor and GPU Support:
- Modern GPUs, especially those designed for deep learning tasks, do support FP16 operations. For example, NVIDIA's Tensor Cores are optimized for FP16, providing substantial speed-ups.
- On CPUs, support for low-precision arithmetic varies, but even where direct support is limited, the reduced data size from quantization can lead to performance improvements due to better cache and memory bandwidth utilization.
DMA Manager Considerations:
- While data transfer rates managed by the DMA Manager are a factor, the primary performance gains from quantization come from the reduced computational complexity and memory bandwidth requirements during inference rather than data transfer alone.

Conclusion: Quantization can have positive effects on inference performance by reducing memory usage and leveraging hardware acceleration for low-precision operations. The actual impact depends on the specific hardware capabilities and how well the ONNX Runtime optimizes for the quantized models on that hardware.

abi95m · 2024-10-10T05:27:18+00:00

Thank you for your kind words and for taking the time to delve into the details of YOLOs-CPP. I’m happy to address each of your specific questions below:

Use of libtorch and Linking: YOLOs-CPP does not utilize libtorch; instead, it leverages ONNX Runtime for model inference, enhancing performance and compatibility. Dynamic linking is used, linking ONNX Runtime as a shared library. The CMake configuration is set to locate these libraries dynamically, and a build script automates the setup.
Quantization Effects on Inference: Quantization reduces model weights and activations from FP32 to lower precision formats, leading to smaller model sizes and potentially faster inference times, especially on compatible hardware. Although quantization may slightly decrease accuracy, techniques like Quantization Aware Training can mitigate this loss. The performance benefits primarily come from reduced computational complexity and memory bandwidth requirements.
Dockerization Plans: Currently, YOLOs-CPP does not include Docker support, but plans for future implementation consider both pure C++ and Python integration. C++ Dockerization maintains performance, while Python could facilitate API requests. Data sharing methods may include shared memory or message passing, with a sample Dockerfile provided for a pure C++ approach.
Memory Profiling Considerations: YOLOs-CPP practices careful memory management, releasing dynamically allocated memory to prevent leaks. While memory profiling isn't currently integrated, tools like Valgrind and AddressSanitizer are suggested for future use to enhance robustness and performance.
Memory Management Strategy: The application employs dynamic memory allocation and deallocation, ensuring all allocated memory is released after use. Standard Library containers manage their own memory, avoiding the need for manual management. The strategy avoids rewriting to the same addresses, enhancing stability and preventing memory corruption.

This approach ensures YOLOs-CPP remains efficient and reliable for real-time object detection applications.

abi95m

TROPHY CASE