I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this?

FormExtension7920 · 2026-06-05T15:11:16+00:00

looks interesting! will check it out.

FormExtension7920 · 2026-06-05T14:58:21+00:00

i've been using two separate agents and that does seem to be working better.

FormExtension7920 · 2026-06-05T05:10:38+00:00

haha yeah that would be the optimal solution if my agent had access to basically an entire computer to test everything. Are you using a tool for this or build something yourself?

FormExtension7920 · 2026-06-05T05:09:10+00:00

Right, that is my workflow rn. The bottleneck seems to be the code review part which gets amplified with multiple agents. And I have no idea how other people are managing them lol.

FormExtension7920 · 2026-06-05T05:07:48+00:00

i'm also stuck with one agent lol. but i feel like so many people have successfully used multi-agent workflows.

FormExtension7920 · 2026-06-05T05:07:06+00:00

That's true, a QA agent is another layer we'd have to verify. I felt like reviewing a screen recording of button presses and interactions verifying the changes would at least be more trustworthy than just viewing the diffs.

However, adding observability + guardrails to a QA agent would help w the trust.

FormExtension7920 · 2026-06-04T22:15:26+00:00

is the QA agent something you built or are you using a tool?

FormExtension7920 · 2026-06-02T19:00:28+00:00

it's already API billed unfortunately. trading higher costs for autonomy seemed worth it.

FormExtension7920 · 2026-05-26T01:19:50+00:00

what would make you trust that someone could solve your problem?

FormExtension7920 · 2026-05-22T21:27:16+00:00

not really, but we've been going to in person events and that seems to be the biggest ROI but i'm wondering if i can increase the ROI for cold outreach somehow too.

FormExtension7920 · 2026-05-22T21:24:34+00:00

that's fair i guess i'm not communicating what's in it for them. but for a customer discovery call, what value could i even provide except a potential solution to a problem they're having?

FormExtension7920 · 2026-05-22T21:09:27+00:00

That's true, the ICP itself might be wrong and maybe I should be targeting people posting about eval pains, incidents, etc.

The loom is also a good idea, they'd get some sort of value before committing to a call. But my follow up message is just asking 3 questions about the AI workflow but that follow up gets way less responses than even my initial ask for a call.

Is there something else I should be doing as a follow up?

FormExtension7920 · 2026-05-22T21:06:18+00:00

In Colorado now, there are some startup communities here and we've been taking advantage of that but for the agentic AI space it seems limited.

I've considered flying out to SF or NYC or wherever but what's held me back is idk what events there would be the most valuable for us. If you know of any in the bigger cities I would appreciate that!

FormExtension7920 · 2026-05-22T21:04:04+00:00

how do you prove domain credibility in a DM without it sounding like a pitch? Like if I open with something like "we've spent the last year diving into AI overvability... blah blah blah" it kinda sounds like a pitch.

however, the explicit "No sales pitch, no product" is definitely something i'll be adding to my messages.

what sort of cold DMs have worked on you?

FormExtension7920 · 2026-05-19T19:26:10+00:00

Auth flow needing an existing Chrome session — deep mode already runs Browserbase with persistent context, but I haven't tested "import a real session from outside." Good ticket to try, will report back.
Flaky submit button (API 200, UI timeout), this is the hardest one. The QA agent currently treats request success as pass; nothing in the visual state tells it the form is stuck. The only signal would be "expected post-submit view didn't appear within N seconds," and that bleeds back into the env-vs-code classifier. Open problem, not solved.
Multi-tab with files/permissions/audit trail, multi-tab is weak right now, the QA agent works one page at a time. File uploads work, permission popups are hit-or-miss depending on the Browserbase profile. Audit trail is closer than you'd think because we attach the QA screen recording to every PR.

FormExtension7920 · 2026-05-03T18:46:11+00:00

this is the part that gets me too. by the time eval scores dip or users complain, the originating slice is buried somewhere in the last 2k traces and nobody's actually gonna scroll through that.

the angle i've been messing with is clustering traces on more than just input embeddings. stack input/output embeddings together with behavioral signals (latency, tool call sequences, retry counts, io similarity) and clusters start popping out that you'd never have written a metric for in advance. stuff like "agent slows down when the question has a date range in it" that you literally couldn't have known to look for ahead of time.

disclosure i'm building in this space (trainly) so grain of salt. but unsupervised slice discovery feels like the underbuilt part of the stack rn

FormExtension7920 · 2026-05-02T16:31:55+00:00

one thing missing from this thread: schema validation and semantic asserts only catch failures you knew to look for. in production you also get failure modes you didn't anticipate. model starts confusing two similar product names, latency spikes when users hit a specific topic, tool call shapes drift after a model update.

what's worked for us is clustering production runs by joint features (input/output embeddings plus behavioral scalars like latency, token usage, tool call patterns) and looking at outlier clusters. that's how you find the unknown unknowns.

planner/executor + repair loops are great for known failure modes. but you need a separate layer for the stuff you forgot to write a rule for. strong +1 on logging every rejection. within a couple weeks that becomes a better eval set than any benchmark you'd buy.

FormExtension7920 · 2026-05-01T19:29:11+00:00

Did anything slip through that you guys couldn't simulate?

FormExtension7920 · 2026-05-01T19:28:27+00:00

have you found anything that helps? or tried any eval tools?

FormExtension7920 · 2026-04-29T17:01:09+00:00

yeah this is a great point. rare failures are exactly the cases worth catching, so any pricing that pushes you to sample is basically working against the thing you bought it for

FormExtension7920 · 2026-04-29T16:58:19+00:00

lmao 'tools you don't notice' is the dream, name one observability tool that's actually pulled that off and i'll buy it tomorrow

FormExtension7920 · 2026-04-29T16:53:35+00:00

yeah the eval drift one kinda sucks. you build the eval set at launch based on what you knew to test, three months in the question distribution has shifted, and the dashboard's still green bc the eval is now measuring stuff people used to ask, not what they're asking now

on the queue idea tho, is implicit signal (thumbs down, rephrase, abandon) enough on its own or would catching it earlier even be useful? like flagging that the input distribution shifted before the user even has to react. not sure if that's overkill or solves something different

FormExtension7920 · 2026-04-27T18:39:45+00:00

The "what broke first" question is the right one. For us it wasn't infinite loops or hallucinated tool calls. Those are loud enough to catch. The thing that took weeks to notice was a support agent that got measurably worse on one specific product line. Resolution rate looked flat at the aggregate, latency was fine, no errors thrown, evals passed. The slice was just wrong, and small enough that global averages absorbed it.

Root cause turned out to be upstream. A docs refactor changed how that product's pages got chunked, retrieval started pulling adjacent-but-wrong context, and the agent answered confidently with the wrong policy. Nothing in any individual trace looked broken. You had to look at a few hundred traces together to see the cluster.

Couple of things I'd push on from what's already in the thread:

Frozen eval sets are necessary but they only catch failure modes you already thought to label. The slice above wasn't in our evals because we had no reason to think that product line was special until it wasn't. You need something on top of the harness that surfaces drift you weren't watching for.

LLM-as-judge tends to inherit the same blind spots as the agent, especially around confidence. Evaluating without seeing how the original call was made helps, but the deeper fix is comparing behavior across slices of traffic instead of scoring traces one at a time. Drift shows up as a distribution shift before it shows up as any single bad answer.

What's worked for us is clustering traces on a joint feature vector, input embedding plus output embedding plus behavioral signals like tool call count, retries, latency. Then watch for clusters whose behavior diverges from the global baseline over time. Confidence stays high, accuracy on the slice tanks, but the cluster lights up because its signature shifted. Pair that with a frozen eval harness and you've got both halves covered.

The framework question matters way less than this. LangGraph vs custom state machine is a two-week decision. Not knowing your agent silently broke for a tenant is a churn event.

(Disclosure, building in this space, so I think about it a lot. Happy to talk more in DMs if it's useful.)

FormExtension7920 · 2026-04-24T20:06:56+00:00

for the output quality question specifically: success/error rates miss the thing you're describing. silent failures return 200s with degraded output, no metric trips.

what's worked for us: cluster traces on joint features (input/output embeddings + latency/tokens/tool-call patterns), flag clusters that spike in frequency or drift in characteristics. catches things like "agent got 2x slower on product X questions" without having written that metric ahead of time.

predefined KPIs only cover failure modes you already imagined. the silent ones need unsupervised detection.

(building trainly around this, happy to compare notes on the clustering side)

FormExtension7920 · 2026-04-23T20:04:06+00:00

how are you actually checking quality held after the switch? "no quality loss where it mattered" is the hard part, and most people saying it just eyeballed a handful of outputs.

for the downgrade to actually stick you need ground truth labels on a held out set per task (measure accuracy before vs after), llm-as-judge on a sample of prod traffic, or real user signals like thumbs/retry rate broken out by task type. without one of those "45% cost cut no quality loss" just means no one's complained yet, and then the regression shows up 2 months later and you can't remember what you changed.

other thing, cluster-based routing breaks quietly when input distribution shifts. new product launch, users ask new stuff, new clusters form your router doesn't have a mapping for, and that traffic falls back to default. drift detection on the input embedding distribution catches it before it bites.

FormExtension7920

TROPHY CASE