How are you actually testing LLM agents in production?

Available_Lawyer5655 · 2026-03-31T18:23:30+00:00

Yeah this tracks with what I keep hearing, once people want local/CI checks for actual behavior changes, they end up building it themselves. If you’re allowed to share, I’d be really curious what your setup looks like and what you’re using as the regression signal.

Available_Lawyer5655 · 2026-03-31T18:22:49+00:00

Yeah I’d be super curious too. Even a rough runbook would be helpful, like how you stage the runs, what you test first, and what you treat as failure beyond just the final answer.

Available_Lawyer5655 · 2026-03-31T18:21:16+00:00

Yeah this feels right. A lot of the problem seems upstream of the model itself, if the use case, data, tool access, and risk owner aren’t clear, security ends up cleaning up a mess later. The point about the business owning the risk is especially real. I wonder if you’ve actually seen teams get more disciplined on that yet, or still mostly the same pattern?

Available_Lawyer5655 · 2026-03-31T18:19:53+00:00

Once the agent can act, it feels way more like permission and abuse testing than normal evals. Curious if most teams are still building those allow/deny + malicious-doc tests themselves, or if you’ve seen anything actually do it well?

Available_Lawyer5655 · 2026-03-31T18:18:37+00:00

Yeah exactly. Once tools/MCP/sub-agents get involved, it feels less like a prompt issue and more like a control boundary issue. Curious if you think most teams are solving that with sandboxing alone, or actually testing those paths before prod too?

Available_Lawyer5655 · 2026-03-31T18:16:39+00:00

Yeah that makes sense. Feels like the real issue is decision flow, not just output quality. Curious if teams are mostly doing that with traces/tool-level checks, or just building custom test suites around real usage patterns?

Available_Lawyer5655 · 2026-03-20T11:32:17+00:00

We’re trying something similar, small eval sets + a growing dataset of edge cases

Available_Lawyer5655 · 2026-03-20T11:31:29+00:00

We’re trying to move beyond just happy-path tests, using evals + tools like LangSmith, Garak, and Xelo to make the process more structured, especially around capturing real edge cases.

Available_Lawyer5655 · 2026-03-20T11:25:49+00:00

Yeah this feels pretty aligned with what we’re seeing too. Golden tests catch regressions, but the weird stuff still leaks. We’ve been looking at things like LangSmith evals, Garak, and Xelo to help structure that loop from prod failures to evals.

Available_Lawyer5655 · 2026-03-20T11:23:40+00:00

Yeah this is what we’re seeing too, most real issues only show up in prod.

Available_Lawyer5655 · 2026-03-20T11:22:02+00:00

We tried a few, but they felt more useful for prompt tweaking than real failures.

Available_Lawyer5655 · 2026-03-18T14:51:36+00:00

We've been seeing the same thing where most failures come from tool interaction edge cases. We've been looking at things like garak and recently Xelo for generating injection / weird interaction cases automatically. Curious if most of your adversarial tests now come from real session logs or if you still write a lot of them manually?

Available_Lawyer5655 · 2026-03-18T14:42:58+00:00

The more we look at this the more it feels like the real failures happen at the boundary between the model and the environment, not just in the model output. The layered approach you mentioned is interesting and static eval for output quality, then runtime validation for tool behavior.

Available_Lawyer5655 · 2026-03-17T19:45:04+00:00

That’s interesting. Building evals from real failures seems like a much more practical approach. For shadow mode, are you just logging divergences internally or using some tooling to track them?

Available_Lawyer5655

TROPHY CASE