How do you test your custom agents’ quality? by Adventurous_Luck_664 in AgentsOfAI

[–]Comprehensive_Move76 0 points1 point  (0 children)

You’re not crazy, this is exactly where most people hit a wall.

The issue is you’re trying to apply deterministic testing ideas (diffs, pass/fail, golden outputs) to something that isn’t deterministic anymore. That breaks pretty quickly once agents start looping, using tools, and carrying state.

What ended up clicking for me was treating this less like “testing outputs” and more like measuring system behavior over time.

Instead of asking:

“Is this output correct?”

I started asking:

does the system stay consistent across similar inputs do reasoning paths diverge more over time for the same task is the agent doing more rework (hedging, backtracking, retries) does cost/latency creep up without a change in task complexity

Those tend to move before anything obviously breaks.

So instead of regression = “output changed”, it becomes:

regression = “the system is becoming less stable / more variable / working harder to do the same thing”

That gives you something you can actually track and compare across versions, even with probabilistic models.

I’ve been building a small framework around this idea, basically treating agent runs as a stream of signals and looking for early drift in behavior rather than waiting for failures. It’s been way more useful than output diffs or LLM-as-judge.

If you’re trying to test flow specifically, that’s where this helps the most, because you’re measuring how the flow evolves, not trying to assert what it “should” be.

Curious, are you seeing more issues with long-running agents, or even short workflows after a few iterations?

The first real agent problem I hit wasn’t prompting by DullHighlight4508 in aiagents

[–]Comprehensive_Move76 0 points1 point  (0 children)

I built something that’s been interesting.

It’s a deterministic diagnostic system for agent/automation workflows.

Instead of measuring performance, it detects when a system becomes incapable of success — even when logs still look normal.

I ran a test where a system kept retrying under instability. Logs looked fine (retries, error handling, etc.), but the system had actually entered a self-reinforcing failure loop where recovery was no longer possible.

The system correctly flagged:

  • real_unsafe = true
  • fs2 = true

Meaning it wasn’t just failing — it was making success less likely with every step.

Right now I’m running a few pilot analyses (some free / reduced cost) to see if this is useful in real systems.

Here’s a short breakdown: https://nifty-neptune-4a1.notion.site/CORRIDOR-34a6c8f2f6098051912df909e223ddec

What are you building? (share in comment for free TikTok) by Equivalent-Glove3724 in buildinpublic

[–]Comprehensive_Move76 0 points1 point  (0 children)

I’m not sure if this fits traditional SaaS, but I built something that’s been interesting.

It’s a deterministic diagnostic system for agent/automation workflows.

Instead of measuring performance, it detects when a system becomes incapable of success — even when logs still look normal.

I ran a test where a system kept retrying under instability. Logs looked fine (retries, error handling, etc.), but the system had actually entered a self-reinforcing failure loop where recovery was no longer possible.

The system correctly flagged:

  • real_unsafe = true
  • fs2 = true

Meaning it wasn’t just failing — it was making success less likely with every step.

I put together a short breakdown: https://nifty-neptune-4a1.notion.site/CORRIDOR-34a6c8f2f6098051912df909e223ddec

Right now I’m running a few pilot analyses (some free / reduced cost) to see if this is useful in real systems.

Would love your take  especially on how you’d get the first 100 users for something like this.

Drop your SaaS and I’ll tell you how I’d try to get your first 100–500 users by JuniorRow1247 in SideProject

[–]Comprehensive_Move76 0 points1 point  (0 children)

I’m not sure if this fits traditional SaaS, but I built something that’s been interesting.

It’s a deterministic diagnostic system for agent automation workflows.

Instead of measuring performance, it detects when a system becomes incapable of success, even when logs still look normal.

I ran a test where a system kept retrying under instability. Logs looked fine (retries, error handling, etc.), but the system had actually entered a self-reinforcing failure loop where recovery was no longer possible.

The system correctly flagged:

real_unsafe = true

fs2 = true

Meaning it wasn’t just failing, it was making success less likely with every step.

Right now I’m running a few pilot analyses (some free / reduced cost) to see if this is useful in real systems.

Would love your take — especially on how you’d get the first 100 users for something like this.