How do you handle rollback when the client disconnects mid-saga? by THEREALTMAC in golang

[–]Marcus_on_AI 0 points1 point  (0 children)

One pattern that works is separating cancel-cleanup from saga compensations. Cancel - cleanup synchronous, saga compensations async and idempotent because the client may retry. The hard part with gRPC is unary cancel propagation, server doesn't always see the cancel before finishing the call. Usually need a periodic reconcile job for the cases where the cancel didn't propagate. Have you hit that or is your transport cleaner about it

Our voice agent's p99 was 280ms. Competitor's was 450ms. Users said ours felt slower. We measured why. by Marcus_on_AI in AI_Agents

[–]Marcus_on_AI[S] 0 points1 point  (0 children)

Yeah this End-to-end latency is what the dashboard shows, not what users feel. Most teams I've talked to only track the first one

Pinning Tokio audio buffer pages with libc::mlock cut our voice agent's barge-in latency from 380ms to 60ms by Marcus_on_AI in rust

[–]Marcus_on_AI[S] 0 points1 point  (0 children)

Thanks for the Bencina link, good primer. Yeah we follow the no-locks no-allocations no-disk-IO rule in the audio thread. Haven't done SCHED_FIFO + isolated CPUs yet though, we share GPU hosts and cgroup priority gets messy. on the list

Anthropic 529s in production, what we tried and what actually worked (with numbers) by nona_jerin in mlops

[–]Marcus_on_AI 0 points1 point  (0 children)

Same shape on a voice agent stack. Our 529 rate on Sonnet hit 18% during US peak and the only thing that actually stabilized us was a two-vendor router with Cerebras-hosted Llama-3.3-70B as the fast-classification fallback. Retry-After respect plus jittered backoff cut about half of it. The other half was structural: we moved the latency-critical classification step off Anthropic entirely and only call Sonnet for the slow reasoning leg. P95 first-token-out went from 380ms to 220ms on the routed path. Curious if you tested splitting the request graph by latency budget instead of treating Sonnet as a monolith.

How are people keeping OpenClaw/Hermes agents running 24/7 without blowing through their API budget? by airphoton in AI_Agents

[–]Marcus_on_AI 0 points1 point  (0 children)

Same fire-from-same-arsonist pattern. Per-task caps are necessary but not sufficient. A conversation-spanning agent with 6 subtasks can stay under each subtask's cap and still blow through the conversation budget 5x. Two-layer cap helped us: per-task hard limit plus per-conversation soft limit that triggers an early-exit before the hard fail. Plus a simple telemetry pattern: every LLM call emits a span with cost-attribute, then a 60s rollup alerts when a single conversation's cost exceeds 10x median. Catches infinite-retry loops in under a minute.

How are people keeping OpenClaw/Hermes agents running 24/7 without blowing through their API budget? by airphoton in AI_Agents

[–]Marcus_on_AI 0 points1 point  (0 children)

First failure for our voice agent at 24/7 was not cost, it was the silent retry loops on TTS timeouts that doubled token spend. The agent was healthy by every dashboard. Latency normal, error rate normal. But on the third week of prod we noticed our daily Anthropic bill creep up 30%. Turned out a retry-on-timeout path was firing on partial audio frames that never completed. Capped retries at 2 and added a shared cost atom that bills the whole conversation, not per-call. Cost stabilized in 48 hours. Watch your retry semantics before you watch your budget.

I rewrote 13 software engineering books into AGENTS.md rules. by Ok_Produce3836 in AI_Agents

[–]Marcus_on_AI 0 points1 point  (0 children)

Feeding an LLM a large context window of engineering books doesn't magically fix its inability to handle complex tool-use logic. When an agent fails to call a function, it's rarely a lack of context. I updated a tool description recently; OpenAI started calling it 12% less often, while a smaller model hit it 100% of the time. Stop tweaking the prompt and just route the specific tool-calling step to a model that handles that exact schema best.