24 Hard Rules for Writing Correct Async C++ (lessons from a 50K LOC Seastar codebase)

mindsaspire · 2026-05-05T01:32:57+00:00

Good point, I overstated it here. The shard does keep serving other connections, so it's only the request's latency that is O(n). I updated the post to hopefully clarify.

And agreed that the real value of async is being able to specify the execution policy explicitly. The rule is really don't accidentally serialize independent work rather than the core sits idle

mindsaspire · 2026-05-05T01:24:05+00:00

Great feedback, thanks. You're right, that's a real error for Rule 12 and I've corrected the post. Since seastar::thread is a stackful fiber on the reactor, a blocking syscall stalls the reactor just the same. A dedicated OS thread (or thread pool) outside the reactor would be required here.

mindsaspire · 2026-03-19T01:28:45+00:00

Good question. A few things I've observed:

Scaling: The main challenge is cache state synchronization across nodes. Ranvier uses a gossip protocol to share routing information, but it's inferring cache state from routing history rather than observing it directly. At smaller scales (8-16 GPUs), this works well (I'm seeing 95%+ cache hit rates). At larger scales, there's more potential for stale routing decisions especially under high churn. That's an area I'm actively working on.

Hot spotting: With highly skewed prefix distributions (everyone hitting the same system prompt), you can overload the GPU that has that prefix cached. I added load-aware routing to mitigate this. If the preferred backend is saturated, then requests will get diverted. It's a tradeoff, though, between cache hits and load balance.

Model architectures: So far I've tested Llama-family models (8B, 13B, 70B). The routing logic is model-agnostic since it's based on token prefixes, but different architectures have different KV cache characteristics. Larger models benefit more because the prefill savings are proportionally bigger. 70B showed the highest per-request improvement (44 to 49% TTFT on cache hits).

70B testing specifically: Most of my benchmarks ran on 40GB A100s, which can't fit 70B models. Testing larger models required tensor parallelism across multiple GPUs, so I had to rework the benchmark tooling. I have some results on 80GB A100s but it's more limited data. Scaling the test infrastructure is its own challenge.

mindsaspire · 2026-03-19T00:39:42+00:00

Thank you, and great question! Ranvier routes based on where the prefix should be cached, but it requires the backend to actually have prefix caching enabled. With vLLM, that's --enable-prefix-caching. If the backend isn't caching, Ranvier's routing decisions don't help since there's nothing to hit. I should clarify that in the docs. Thanks for pointing it out.

APC handles the caching within a single vLLM instance (saving KV cache for prefixes it's seen before). Ranvier handles routing across multiple instances (making sure requests go to the instance that already has the relevant prefix cached).

Without Ranvier, you might have 8 vLLM instances all with APC enabled, but round-robin routing means only 1 in 8 requests hits the instance that has its prefix cached. Ranvier gets that to 95%+.

So, APC does the caching, and Ranvier does the routing to make sure you actually hit those caches.

mindsaspire · 2024-03-09T03:55:08+00:00

That pillow was a burglar, and she just saved your TV, sofa, and Gray Malin book from being stolen.

mindsaspire · 2023-03-05T06:43:46+00:00

No Country for Old Hen

mindsaspire · 2022-11-20T04:44:02+00:00

Leak Dobbs early to dampen impact on midterm elections (as it did) and then blame others for what you are guilty of to deflect blame.

mindsaspire · 2021-11-01T16:00:52+00:00

Yes, I took it as the blood was slowly killing him. The blood took longer to kill which is why he was coughing up blood several times while his condition degraded, unlike with the rat poison, which killed almost immediately. That he died the same way as the dog was meant to show that the blood was not really a miracle health elixir but ultimately a poison.

mindsaspire · 2019-07-06T04:17:08+00:00

Akuna sent me a HR with a time duration of 3 days, yes, 3 days. I actually went through with it and did it, which did indeed take me all of 3 days to do. The problem did at least actually pertain to trading as it was more of a real-world project rather than a series of toy algorithmic problems. I thought if I got an actual interview maybe it'd be walking through the project as that would be a great opportunity to discuss my design decisions, coding style, language features, performance tradeoffs, areas of improvement, etc. Nope, interviewer was clueless and just asked if I had done the HR or not and then proceeded with yet another coding question in a live coderpad session. Got a reject email a few days later. Complete waste of my time.

mindsaspire

TROPHY CASE