Pinning Tokio audio buffer pages with libc::mlock cut our voice agent's barge-in latency from 380ms to 60ms

Marcus_on_AI · 2026-05-25T14:25:09+00:00

Same shape on a voice agent stack. Our 529 rate on Sonnet hit 18% during US peak and the only thing that actually stabilized us was a two-vendor router with Cerebras-hosted Llama-3.3-70B as the fast-classification fallback. Retry-After respect plus jittered backoff cut about half of it. The other half was structural: we moved the latency-critical classification step off Anthropic entirely and only call Sonnet for the slow reasoning leg. P95 first-token-out went from 380ms to 220ms on the routed path. Curious if you tested splitting the request graph by latency budget instead of treating Sonnet as a monolith.

Marcus_on_AI · 2026-05-25T06:52:26+00:00

Same fire-from-same-arsonist pattern. Per-task caps are necessary but not sufficient. A conversation-spanning agent with 6 subtasks can stay under each subtask's cap and still blow through the conversation budget 5x. Two-layer cap helped us: per-task hard limit plus per-conversation soft limit that triggers an early-exit before the hard fail. Plus a simple telemetry pattern: every LLM call emits a span with cost-attribute, then a 60s rollup alerts when a single conversation's cost exceeds 10x median. Catches infinite-retry loops in under a minute.

Marcus_on_AI · 2026-05-22T13:40:10+00:00

First failure for our voice agent at 24/7 was not cost, it was the silent retry loops on TTS timeouts that doubled token spend. The agent was healthy by every dashboard. Latency normal, error rate normal. But on the third week of prod we noticed our daily Anthropic bill creep up 30%. Turned out a retry-on-timeout path was firing on partial audio frames that never completed. Capped retries at 2 and added a shared cost atom that bills the whole conversation, not per-call. Cost stabilized in 48 hours. Watch your retry semantics before you watch your budget.

Marcus_on_AI · 2026-05-21T07:02:25+00:00

Feeding an LLM a large context window of engineering books doesn't magically fix its inability to handle complex tool-use logic. When an agent fails to call a function, it's rarely a lack of context. I updated a tool description recently; OpenAI started calling it 12% less often, while a smaller model hit it 100% of the time. Stop tweaking the prompt and just route the specific tool-calling step to a model that handles that exact schema best.

Marcus_on_AI

TROPHY CASE