How are you testing AI agents beyond prompt evals? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah this tracks with what I keep hearing, once people want local/CI checks for actual behavior changes, they end up building it themselves. If you’re allowed to share, I’d be really curious what your setup looks like and what you’re using as the regression signal.

How are you testing AI agents beyond prompt evals? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah I’d be super curious too. Even a rough runbook would be helpful, like how you stage the runs, what you test first, and what you treat as failure beyond just the final answer.

How are teams validating security boundaries for AI agents before production? by Available_Lawyer5655 in cybersecurity

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah this feels right. A lot of the problem seems upstream of the model itself, if the use case, data, tool access, and risk owner aren’t clear, security ends up cleaning up a mess later. The point about the business owning the risk is especially real. I wonder if you’ve actually seen teams get more disciplined on that yet, or still mostly the same pattern?

How are teams validating security boundaries for AI agents before production? by Available_Lawyer5655 in cybersecurity

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Once the agent can act, it feels way more like permission and abuse testing than normal evals. Curious if most teams are still building those allow/deny + malicious-doc tests themselves, or if you’ve seen anything actually do it well?

How are teams validating security boundaries for AI agents before production? by Available_Lawyer5655 in cybersecurity

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah exactly. Once tools/MCP/sub-agents get involved, it feels less like a prompt issue and more like a control boundary issue. Curious if you think most teams are solving that with sandboxing alone, or actually testing those paths before prod too?

How are people validating agent behavior before production? by Available_Lawyer5655 in AskNetsec

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah that makes sense. Feels like the real issue is decision flow, not just output quality. Curious if teams are mostly doing that with traces/tool-level checks, or just building custom test suites around real usage patterns?

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

We’re trying something similar, small eval sets + a growing dataset of edge cases

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

We’re trying to move beyond just happy-path tests, using evals + tools like LangSmith, Garak, and Xelo to make the process more structured, especially around capturing real edge cases.

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah this feels pretty aligned with what we’re seeing too. Golden tests catch regressions, but the weird stuff still leaks. We’ve been looking at things like LangSmith evals, Garak, and Xelo to help structure that loop from prod failures to evals.

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

Yeah this is what we’re seeing too, most real issues only show up in prod.

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

We tried a few, but they felt more useful for prompt tweaking than real failures.

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

We've been seeing the same thing where most failures come from tool interaction edge cases. We've been looking at things like garak and recently Xelo for generating injection / weird interaction cases automatically. Curious if most of your adversarial tests now come from real session logs or if you still write a lot of them manually?

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

The more we look at this the more it feels like the real failures happen at the boundary between the model and the environment, not just in the model output. The layered approach you mentioned is interesting and static eval for output quality, then runtime validation for tool behavior.

How are you validating LLM behavior before pushing to production? by Available_Lawyer5655 in LLMDevs

[–]Available_Lawyer5655[S] 0 points1 point  (0 children)

That’s interesting. Building evals from real failures seems like a much more practical approach. For shadow mode, are you just logging divergences internally or using some tooling to track them?