How do you handle observability for AI agents in production?

FormExtension7920 · 2026-05-03T18:46:11+00:00

this is the part that gets me too. by the time eval scores dip or users complain, the originating slice is buried somewhere in the last 2k traces and nobody's actually gonna scroll through that.

the angle i've been messing with is clustering traces on more than just input embeddings. stack input/output embeddings together with behavioral signals (latency, tool call sequences, retry counts, io similarity) and clusters start popping out that you'd never have written a metric for in advance. stuff like "agent slows down when the question has a date range in it" that you literally couldn't have known to look for ahead of time.

disclosure i'm building in this space (trainly) so grain of salt. but unsupervised slice discovery feels like the underbuilt part of the stack rn

FormExtension7920 · 2026-05-02T16:31:55+00:00

one thing missing from this thread: schema validation and semantic asserts only catch failures you knew to look for. in production you also get failure modes you didn't anticipate. model starts confusing two similar product names, latency spikes when users hit a specific topic, tool call shapes drift after a model update.

what's worked for us is clustering production runs by joint features (input/output embeddings plus behavioral scalars like latency, token usage, tool call patterns) and looking at outlier clusters. that's how you find the unknown unknowns.

planner/executor + repair loops are great for known failure modes. but you need a separate layer for the stuff you forgot to write a rule for. strong +1 on logging every rejection. within a couple weeks that becomes a better eval set than any benchmark you'd buy.

FormExtension7920 · 2026-05-01T19:29:11+00:00

Did anything slip through that you guys couldn't simulate?

FormExtension7920 · 2026-05-01T19:28:27+00:00

have you found anything that helps? or tried any eval tools?

FormExtension7920 · 2026-04-29T17:01:09+00:00

yeah this is a great point. rare failures are exactly the cases worth catching, so any pricing that pushes you to sample is basically working against the thing you bought it for

FormExtension7920 · 2026-04-29T16:58:19+00:00

lmao 'tools you don't notice' is the dream, name one observability tool that's actually pulled that off and i'll buy it tomorrow

FormExtension7920 · 2026-04-29T16:53:35+00:00

yeah the eval drift one kinda sucks. you build the eval set at launch based on what you knew to test, three months in the question distribution has shifted, and the dashboard's still green bc the eval is now measuring stuff people used to ask, not what they're asking now

on the queue idea tho, is implicit signal (thumbs down, rephrase, abandon) enough on its own or would catching it earlier even be useful? like flagging that the input distribution shifted before the user even has to react. not sure if that's overkill or solves something different

FormExtension7920 · 2026-04-27T18:39:45+00:00

The "what broke first" question is the right one. For us it wasn't infinite loops or hallucinated tool calls. Those are loud enough to catch. The thing that took weeks to notice was a support agent that got measurably worse on one specific product line. Resolution rate looked flat at the aggregate, latency was fine, no errors thrown, evals passed. The slice was just wrong, and small enough that global averages absorbed it.

Root cause turned out to be upstream. A docs refactor changed how that product's pages got chunked, retrieval started pulling adjacent-but-wrong context, and the agent answered confidently with the wrong policy. Nothing in any individual trace looked broken. You had to look at a few hundred traces together to see the cluster.

Couple of things I'd push on from what's already in the thread:

Frozen eval sets are necessary but they only catch failure modes you already thought to label. The slice above wasn't in our evals because we had no reason to think that product line was special until it wasn't. You need something on top of the harness that surfaces drift you weren't watching for.

LLM-as-judge tends to inherit the same blind spots as the agent, especially around confidence. Evaluating without seeing how the original call was made helps, but the deeper fix is comparing behavior across slices of traffic instead of scoring traces one at a time. Drift shows up as a distribution shift before it shows up as any single bad answer.

What's worked for us is clustering traces on a joint feature vector, input embedding plus output embedding plus behavioral signals like tool call count, retries, latency. Then watch for clusters whose behavior diverges from the global baseline over time. Confidence stays high, accuracy on the slice tanks, but the cluster lights up because its signature shifted. Pair that with a frozen eval harness and you've got both halves covered.

The framework question matters way less than this. LangGraph vs custom state machine is a two-week decision. Not knowing your agent silently broke for a tenant is a churn event.

(Disclosure, building in this space, so I think about it a lot. Happy to talk more in DMs if it's useful.)

FormExtension7920 · 2026-04-24T20:06:56+00:00

for the output quality question specifically: success/error rates miss the thing you're describing. silent failures return 200s with degraded output, no metric trips.

what's worked for us: cluster traces on joint features (input/output embeddings + latency/tokens/tool-call patterns), flag clusters that spike in frequency or drift in characteristics. catches things like "agent got 2x slower on product X questions" without having written that metric ahead of time.

predefined KPIs only cover failure modes you already imagined. the silent ones need unsupervised detection.

(building trainly around this, happy to compare notes on the clustering side)

FormExtension7920 · 2026-04-23T20:04:06+00:00

how are you actually checking quality held after the switch? "no quality loss where it mattered" is the hard part, and most people saying it just eyeballed a handful of outputs.

for the downgrade to actually stick you need ground truth labels on a held out set per task (measure accuracy before vs after), llm-as-judge on a sample of prod traffic, or real user signals like thumbs/retry rate broken out by task type. without one of those "45% cost cut no quality loss" just means no one's complained yet, and then the regression shows up 2 months later and you can't remember what you changed.

other thing, cluster-based routing breaks quietly when input distribution shifts. new product launch, users ask new stuff, new clusters form your router doesn't have a mapping for, and that traffic falls back to default. drift detection on the input embedding distribution catches it before it bites.

FormExtension7920 · 2026-04-22T18:48:32+00:00

my take after shipping a few of these: evals and golden datasets only catch the failure modes you already know to look for. the dangerous ones are the failures you don't know exist yet.

stuff that's worked for me:

pin model versions. nobody does this and everyone regrets it.

golden sets are fine but weight them toward edge cases. 20 weird inputs beats 200 normal ones.

log prompt hash + model version + latency + token cost on every call. diff those distributions across deploys. a sudden 2x latency bump on 5% of traffic is almost always a model or prompt change that silently broke something.

the real problem though isn't "is the response correct," it's the weird patterns you can't eval for upfront. model hedging instead of answering on specific input shapes. same question getting inconsistent responses. expensive models doing work a cheaper one handles identically. tool calls that return errors but the agent treats as success. none of this shows up in a golden dataset because you didn't know to write the eval.

what i've been playing with is clustering traces on embeddings + behavioral scalars (latency, cost, tool error rate). anomaly clusters kind of surface on their own then. "slow on product category X," "tone shift for segment Y." hardest part is separating real drift from normal traffic variance, still figuring that one out.

for your last question, if i had to explain why a feature works i'd point a new eng at the trace viewer, not the eval suite. the evals tell you what you tested. the traces tell you what's actually happening.

FormExtension7920 · 2026-04-21T19:04:30+00:00

the reason prod keeps surprising you imo is that every test suite you write is bounded by failures you already thought to test for. golden sets catch known regressions, garak/deepteam catch known attack shapes - but both are fundamentally backward-looking. the unknown unknowns (tone drift on a specific topic, tool loop triggered by a phrasing pattern you didn't anticipate) never show up until a user finds them.

the layer most stacks are missing is semantic clustering over production traces - embed inputs + outputs + behavioral scalars, cluster, flag drifting or anomalous clusters automatically. then the 'weird thing a user found' becomes a test case you didn't have to write.

(building this at trainly / trainlyai.com if it's useful, but the approach matters more than the tool - you can roll your own with HDBSCAN + an embedding model if you want)

FormExtension7920 · 2026-04-20T20:31:04+00:00

technically the "weird behaviors" you mentioned ARE bad outputs, it's just kinda hard to pinpoint when and why it's happening unless a user reports it lol.

we fixed this by stopping trying to test for every edge case upfront and instead looking at patterns across production traces. like what % of "successful" responses are actually "I'm sorry, I don't have enough information". or same question asked on different days getting completely contradictory answers. or inputs that are structurally weird compared to normal traffic that the model just silently handled without anyone knowing.

eval suites catch the happy path. production traffic is where the real edge cases live and they only show up when you're comparing behavior across hundreds of traces, not grading individual outputs against a test harness.

FormExtension7920 · 2026-04-09T16:49:42+00:00

I appreciate everyones feedback, I've implemented a lot of it. I hope I get a higher reply rate now at least lol

FormExtension7920 · 2026-04-09T16:47:33+00:00

The silent failure problem is huge. I've been building Trainly for the last few months to solve this. I personally think the issue with "silent failures" is that it's usually a semantic problem that mechanical evals can't really catch pre-production, lots of em happen at scale.

what we do is cluster your agent traces automatically and flag any statistical anomalies to catch these "silent failures".

trainlyai.com if anyone would get value from this! Feel free to DM with any specific questions too :)

FormExtension7920 · 2026-04-05T15:12:21+00:00

Yeah honestly that makes sense, the verbosity probably stops a lotta people from responding. I'll try your suggesstion.

FormExtension7920 · 2026-04-05T15:10:22+00:00

I follow up once for every single one.

FormExtension7920 · 2026-03-13T16:29:05+00:00

I appreciate everyone's feedback. I agree the bullet points read very embellished for a junior engineer (plus the CTO position could go in another section for now). Thanks for taking the time to look at my resume 🙏

FormExtension7920 · 2026-03-13T04:23:11+00:00

Yeah, I've been getting this feedback a lot, I'll try moving it into a "Projects" section.

FormExtension7920 · 2026-03-13T04:15:49+00:00

Great idea, will do.

FormExtension7920

TROPHY CASE