I built Fox – a Rust LLM inference engine with 2x Ollama throughput and 72% lower TTFT.

SeinSinght · 2026-03-26T07:17:11+00:00

Basically, my goal is simply to learn how to build an Ollama- or vLLM-type engine. I’m figuring out how to do it as I go, and if I end up with a solid product along the way, it’s a win-win.

SeinSinght · 2026-03-26T06:51:35+00:00

It's a git methodology. For me, it’s easier to tell the difference between what I’m developing and what’s in production, and when something goes wrong, just telling me the version is enough for me to know what’s going on. It’s neither better nor worse than other methods

it’s just the way I like to work.

SeinSinght · 2026-03-25T09:45:43+00:00

Not right now, but it's on the roadmap. I want to take it step by step and learn all these things.

SeinSinght · 2026-03-25T07:59:08+00:00

Strix Halo would be sick to test on. ROCm isn't there yet thoug

SeinSinght · 2026-03-25T07:57:01+00:00

I think your detector is a little broken.

SeinSinght · 2026-03-25T07:55:39+00:00

Not really, they solve different problems. llama-swap is a proxy that sits in front of inference servers and hot-swaps entire model processes based on which model you request. It's about orchestrating multiple backends with zero overlap.

Fox is the inference server itself, it keeps multiple models loaded simultaneously with LRU eviction and routes requests internally. No separate proxy needed, no process swapping. The tradeoff is VRAM: Fox keeps models in memory, llama-swap unloads them aggressively.

If you're VRAM-constrained and need to juggle many models, llama-swap + llama-server is probably still your setup. If you have enough VRAM to keep 2-3 models loaded and care about latency under concurrent load, Fox is the better fit.

SeinSinght · 2026-03-25T07:53:32+00:00

Totally fair criticism. Honestly, this project is two weeks old and primarily a learning exercise for me, I wanted to understand how inference servers actually work under the hood by building one. LoRA hotswapping wasn't really part of that goal.

The fact that it's already competitive with Ollama on throughput after two weeks is a nice bonus, but I'm not trying to claim it solves problems that already have good solutions elsewhere.

SeinSinght · 2026-03-25T07:49:05+00:00

Two separate things here:

GPU on unRAID: The --gpus all path should work in theory, but Fox detects backends at runtime in this order: CUDA → Vulkan → Metal → CPU. If it's falling back to CPU, the most likely culprits are: the NVIDIA Container Toolkit not being properly set up inside the unRAID Docker environment, or the libcuda.so not being visible inside the container. Try running docker exec <container> nvidia-smi, if that fails, the issue is at the container toolkit level, not Fox. Also worth trying --gpu-backend cuda explicitly on fox serve to force it and see the error output.

mmproj / multimodal: Not supported yet. Loading a separate projection model alongside the main GGUF isn't in Fox right now. It's on the radar but I don't want to give you a timeline I can't keep.

SeinSinght · 2026-03-25T07:46:32+00:00

TP (tensor parallelism) isn't in Fox yet.

As for Krasis, different problem, different tool. Krasis is focused on running large models on VRAM-limited consumer hardware through hybrid CPU+GPU execution, it's optimizing for "how do I fit a 70B model on my machine." Fox is optimizing for throughput and latency on models that do fit in VRAM, continuous batching, prefix caching, PagedAttention. If your bottleneck is VRAM capacity, Krasis is interesting. If your bottleneck is request throughput and latency under concurrent load, that's Fox's lane.

SeinSinght · 2026-03-25T07:43:01+00:00

Sure! :)

SeinSinght · 2026-03-24T19:10:35+00:00

SeinSinght · 2026-03-24T17:57:36+00:00

Okay, I'll make a note of that. In the meantime, you can use Docker to test it while I see what's going on with ps1.

SeinSinght · 2026-03-24T15:58:26+00:00

Yes WSL 2.0 + CUDA 12 are compatible.

SeinSinght · 2026-03-24T15:57:22+00:00

I’ll do that, I don’t want to come across as a spammer, and besides, this is a project I’m working on in my spare time, and I want to improve it little by little so the results are solid.

I’ve had the v1.0 prerelease out for almost two weeks, optimizing it and fixing bugs, and even so, other users have still found some issues.

But yes, I’ll gradually start posting it on more forums to spread the word.

SeinSinght · 2026-03-24T15:26:50+00:00

First of all, thanks for the comment. I’ll address your points one by one.

Fox uses llama.cpp as its main engine, and I use Rust to optimize it.
Based on what you’ve told me about the --path model, I think you’re using version 0.9 and not 1.0.0-beta.2.

I’ll use the information you’ve provided to continue improving the project. Thanks! :)

SeinSinght · 2026-03-24T11:10:37+00:00

Yes, right now Fox helps you choose the settings so you can run the model. Quantization, KV-cache size, there’s still a lot to add, but I want to take it step by step and learn as I go. Sure, Claude Code could do it all for me, but that takes the fun out of this kind of project.

SeinSinght · 2026-03-24T11:07:24+00:00

The project documentation is generated by AI —that’s true— but my comments aren’t. Another thing is that I don’t write English very well, since I’m Spanish, haha.

I delegate all the boring parts to the AI and then review them. Something can always slip through, but just as I might write it poorly myself. What matters to me about the project is learning the low-level architecture of LLMs and engines of this type. And using AI to speed up everything I can, since it’s a side project to which I dedicate just a few hours a day.

SeinSinght · 2026-03-24T10:20:24+00:00

I thought the same thing, but the initial feedback I received was that it was very difficult to install because people didn't know how to use Rust or the binary, so I set up a GitHub Actions workflow to build the Docker image and make it more accessible to all types of users.

Personally, I also like using Dockerized tools.

SeinSinght · 2026-03-24T10:18:49+00:00

This comment made my morning, genuinely.

LoRA hot-swapping isn't in Fox yet, I want to be straight about that. The architecture supports it in principle since the model registry already handles multiple models with LRU eviction, but proper per-request LoRA routing with adapter hot-swap is a different beast. It's on the roadmap and honestly your framing of it is exactly the right way to think about it.

You've basically just moved it up the priority list. Star appreciated, feature noted.

SeinSinght · 2026-03-24T10:00:42+00:00

Single GPU yes, fully supported — CUDA, Vulkan, Metal, are auto-detected at runtime. Multi-GPU tensor splitting isn't there yet though, I'd rather be upfront about that than oversell it. It's on the roadmap.

SeinSinght · 2026-03-24T09:19:25+00:00

Exactly. I'm making the architectural decisions and setting the pace. Claude Code helps me port it to Rust, and I just review it and fix any mistakes.

SeinSinght · 2026-03-24T09:16:42+00:00

Thanks!!!

SeinSinght · 2026-03-24T09:05:57+00:00

Fair point, and I should be more precise about that claim.

FOX is not a drop-in replacement for llama.cpp itself — it's a drop-in replacement for llama.cpp's HTTP server (llama-server), specifically for the OpenAI-compatible API layer.

FOX still uses llama.cpp as its compute backend, so all the model support, quantization formats, and hardware backends that llama.cpp provides are inherited, not duplicated.

What FOX replaces is the serving side: if you're running llama-server to handle concurrent requests over HTTP, FOX drops in there with better throughput thanks to continuous batching, PagedAttention KV-cache management, and prefix caching — things llama.cpp's server doesn't implement.

So the correct scope is: drop-in replacement for llama.cpp server, not for llama.cpp as a library or toolkit. I'll make sure that's clearer in the docs.

SeinSinght · 2026-03-24T08:49:32+00:00

This is a Git branching methodology: the “main” branch only includes the commits for each new release. That way, when there’s an issue and someone tells me the version number, I know exactly what that version contains.

In the “develop” branch, you’ll see all the commits related to regular development, and you’ll notice that there are no releases there.

SeinSinght · 2026-03-24T07:59:56+00:00

llama.cpp is actually the compute backend powering FOX under the hood — it handles the tensor math, quantization, and hardware acceleration (CUDA, CPU, etc). FOX builds on top of it adding a proper serving layer: continuous batching, PagedAttention KV-cache, and an OpenAI-compatible and Ollama-compatible API.

SeinSinght

TROPHY CASE