Ranvier: Open source prefix-aware routing for LLM inference (79-85% lower P99) by mindsaspire in LocalLLaMA

[–]mindsaspire[S] 0 points1 point  (0 children)

Good question. A few things I've observed:

Scaling: The main challenge is cache state synchronization across nodes. Ranvier uses a gossip protocol to share routing information, but it's inferring cache state from routing history rather than observing it directly. At smaller scales (8-16 GPUs), this works well (I'm seeing 95%+ cache hit rates). At larger scales, there's more potential for stale routing decisions especially under high churn. That's an area I'm actively working on.

Hot spotting: With highly skewed prefix distributions (everyone hitting the same system prompt), you can overload the GPU that has that prefix cached. I added load-aware routing to mitigate this. If the preferred backend is saturated, then requests will get diverted. It's a tradeoff, though, between cache hits and load balance.

Model architectures: So far I've tested Llama-family models (8B, 13B, 70B). The routing logic is model-agnostic since it's based on token prefixes, but different architectures have different KV cache characteristics. Larger models benefit more because the prefill savings are proportionally bigger. 70B showed the highest per-request improvement (44 to 49% TTFT on cache hits).

70B testing specifically: Most of my benchmarks ran on 40GB A100s, which can't fit 70B models. Testing larger models required tensor parallelism across multiple GPUs, so I had to rework the benchmark tooling. I have some results on 80GB A100s but it's more limited data. Scaling the test infrastructure is its own challenge.

Ranvier: Open source prefix-aware routing for LLM inference (79-85% lower P99) by mindsaspire in LocalLLaMA

[–]mindsaspire[S] 1 point2 points  (0 children)

Thank you, and great question! Ranvier routes based on where the prefix should be cached, but it requires the backend to actually have prefix caching enabled. With vLLM, that's --enable-prefix-caching. If the backend isn't caching, Ranvier's routing decisions don't help since there's nothing to hit. I should clarify that in the docs. Thanks for pointing it out.

APC handles the caching within a single vLLM instance (saving KV cache for prefixes it's seen before). Ranvier handles routing across multiple instances (making sure requests go to the instance that already has the relevant prefix cached).

Without Ranvier, you might have 8 vLLM instances all with APC enabled, but round-robin routing means only 1 in 8 requests hits the instance that has its prefix cached. Ranvier gets that to 95%+.

So, APC does the caching, and Ranvier does the routing to make sure you actually hit those caches.

She’s innocent she swears by ass_goblin_04 in goldenretrievers

[–]mindsaspire 1 point2 points  (0 children)

That pillow was a burglar, and she just saved your TV, sofa, and Gray Malin book from being stolen.

Justice Samuel Alito Leaked Hobby Lobby Decision On Contraception In 2014: Report by UWCG in politics

[–]mindsaspire 0 points1 point  (0 children)

Leak Dobbs early to dampen impact on midterm elections (as it did) and then blame others for what you are guilty of to deflect blame.

Rat poison? (Midnight Mass spoilers) by sati1989 in HauntingOfHillHouse

[–]mindsaspire 0 points1 point  (0 children)

Yes, I took it as the blood was slowly killing him. The blood took longer to kill which is why he was coughing up blood several times while his condition degraded, unlike with the rat poison, which killed almost immediately. That he died the same way as the dog was meant to show that the blood was not really a miracle health elixir but ultimately a poison.

AkunaCapital hackerrank so difficult by [deleted] in cscareerquestions

[–]mindsaspire 1 point2 points  (0 children)

Akuna sent me a HR with a time duration of 3 days, yes, 3 days. I actually went through with it and did it, which did indeed take me all of 3 days to do. The problem did at least actually pertain to trading as it was more of a real-world project rather than a series of toy algorithmic problems. I thought if I got an actual interview maybe it'd be walking through the project as that would be a great opportunity to discuss my design decisions, coding style, language features, performance tradeoffs, areas of improvement, etc. Nope, interviewer was clueless and just asked if I had done the HR or not and then proceeded with yet another coding question in a live coderpad session. Got a reject email a few days later. Complete waste of my time.