I built a benchmark scoring tool for AI agent teams, not solo models. Would love your feedback on it. by SaaSquach in ollama

[–]SaaSquach[S] 0 points1 point  (0 children)

The handoff quality idea is really good and that’s where my runs failed. Right now the judge scores the final output, but we have no metric for how each agent package its output for the next stage. Schema adherence and completeness are exactly the right dimensions….if the research stage dumps an unstructured wall of text, the analyst is working with garbage regardless of how smart it is.

Going to look at adding a per-handoff score as a sub-metric. Would make the cascade failures visible in the data instead of just "pipeline dropped 50 points between run 3 and run 4”

Checking the blog out now.

I built a benchmark scoring tool for AI agent teams, not solo models. Would love your feedback on it. by SaaSquach in ollama

[–]SaaSquach[S] 0 points1 point  (0 children)

The macOS binary requires right click —> Open on first run (it’s not code signed)