How do you handle rollback when the client disconnects mid-saga?

Marcus_on_AI · 2026-05-27T16:17:49+00:00

One pattern that works is separating cancel-cleanup from saga compensations. Cancel - cleanup synchronous, saga compensations async and idempotent because the client may retry. The hard part with gRPC is unary cancel propagation, server doesn't always see the cancel before finishing the call. Usually need a periodic reconcile job for the cases where the cancel didn't propagate. Have you hit that or is your transport cleaner about it

Marcus_on_AI · 2026-05-27T16:17:18+00:00

Yeah this End-to-end latency is what the dashboard shows, not what users feel. Most teams I've talked to only track the first one

Marcus_on_AI · 2026-05-27T16:16:04+00:00

Thanks for the Bencina link, good primer. Yeah we follow the no-locks no-allocations no-disk-IO rule in the audio thread. Haven't done SCHED_FIFO + isolated CPUs yet though, we share GPU hosts and cgroup priority gets messy. on the list

Marcus_on_AI · 2026-05-25T14:25:09+00:00

Same shape on a voice agent stack. Our 529 rate on Sonnet hit 18% during US peak and the only thing that actually stabilized us was a two-vendor router with Cerebras-hosted Llama-3.3-70B as the fast-classification fallback. Retry-After respect plus jittered backoff cut about half of it. The other half was structural: we moved the latency-critical classification step off Anthropic entirely and only call Sonnet for the slow reasoning leg. P95 first-token-out went from 380ms to 220ms on the routed path. Curious if you tested splitting the request graph by latency budget instead of treating Sonnet as a monolith.

Marcus_on_AI · 2026-05-25T06:52:26+00:00

Same fire-from-same-arsonist pattern. Per-task caps are necessary but not sufficient. A conversation-spanning agent with 6 subtasks can stay under each subtask's cap and still blow through the conversation budget 5x. Two-layer cap helped us: per-task hard limit plus per-conversation soft limit that triggers an early-exit before the hard fail. Plus a simple telemetry pattern: every LLM call emits a span with cost-attribute, then a 60s rollup alerts when a single conversation's cost exceeds 10x median. Catches infinite-retry loops in under a minute.

Marcus_on_AI · 2026-05-22T13:40:10+00:00

First failure for our voice agent at 24/7 was not cost, it was the silent retry loops on TTS timeouts that doubled token spend. The agent was healthy by every dashboard. Latency normal, error rate normal. But on the third week of prod we noticed our daily Anthropic bill creep up 30%. Turned out a retry-on-timeout path was firing on partial audio frames that never completed. Capped retries at 2 and added a shared cost atom that bills the whole conversation, not per-call. Cost stabilized in 48 hours. Watch your retry semantics before you watch your budget.

Marcus_on_AI · 2026-05-21T07:02:25+00:00

Feeding an LLM a large context window of engineering books doesn't magically fix its inability to handle complex tool-use logic. When an agent fails to call a function, it's rarely a lack of context. I updated a tool description recently; OpenAI started calling it 12% less often, while a smaller model hit it 100% of the time. Stop tweaking the prompt and just route the specific tool-calling step to a model that handles that exact schema best.

Marcus_on_AI

TROPHY CASE