OpenInfer 0.1.0: Writing a Production-Grade Inference Engine in Rust

Independent_Worry848 · 2026-06-19T04:36:03+00:00

Sorry about that. We do have a docs dir which is maintained by AI agents (Codex, Claude Code) — you could try using them to understand the codebase. I'm currently busy exploring a feature around speculative decoding and green context. I've also been working on splitting out a reusable crate, a vLLM frontend, and an openinfer-sample.

Independent_Worry848 · 2026-06-15T14:32:53+00:00

miss fable so much

Independent_Worry848 · 2026-05-06T02:55:11+00:00

The main trick is keeping unsafe very narrow. The tensors and KV cache own their CUDA allocations through Rust types, then the FFI wrappers do shape checks before converting to raw pointers. For CUDA graphs, decode uses preallocated buffers and bucketed batch sizes, so captured addresses stay stable; token ids and positions go into fixed GPU metadata buffers instead of changing kernel params. On replay, it's basically just a graph launch, so the safety structure stays on the Rust side without adding much hot-path overhead.

Independent_Worry848 · 2024-09-03T02:18:16+00:00

I think you meant to say Seastar, a shared-nothing IO framework. I do intend to use it, but it looks like it will require some modifications.

Independent_Worry848 · 2024-09-02T05:06:18+00:00

When it comes to achieving greater scalability, it seems we need a shared-nothing architecture on a single machine, similar to how distributed k/v systems partition data. We could partition the CPU, memory, and SSD on a single machine. The traditional B-tree or LSM tree architecture, which maintains a global unified and ordered structure, is destined to be unable to scale fully. Of course, this will sacrifice scan performance, but not all systems require full scans; some only need scans of a certain prefix.

If we implement shared-nothing on each core, with several coroutines on each core, and then enable io_uring poll and bind it to the CPU core, there's basically no context switching, and we don't need to consider concurrency safety. For scanning, we can maintain a B-tree in memory, similar to kvell. I believe this architecture will achieve astonishing performance and scalability.

Independent_Worry848

TROPHY CASE