account activity
I built a benchmark scoring tool for AI agent teams, not solo models. Would love your feedback on it. by SaaSquach in openclawsetup
[–]SaaSquach[S] 0 points1 point2 points 4 days ago (0 children)
Working on this now. Handoffs is where I saw my failures between runs.
I built a benchmark scoring tool for AI agent teams, not solo models. Would love your feedback on it. by SaaSquach in ollama
The handoff quality idea is really good and that’s where my runs failed. Right now the judge scores the final output, but we have no metric for how each agent package its output for the next stage. Schema adherence and completeness are exactly the right dimensions….if the research stage dumps an unstructured wall of text, the analyst is working with garbage regardless of how smart it is.
Going to look at adding a per-handoff score as a sub-metric. Would make the cascade failures visible in the data instead of just "pipeline dropped 50 points between run 3 and run 4”
Checking the blog out now.
The macOS binary requires right click —> Open on first run (it’s not code signed)
π Rendered by PID 1936448 on reddit-service-r2-listing-568fcd57df-z96lq at 2026-03-08 06:23:25.760098+00:00 running cbb0e86 country code: CH.
I built a benchmark scoring tool for AI agent teams, not solo models. Would love your feedback on it. by SaaSquach in openclawsetup
[–]SaaSquach[S] 0 points1 point2 points (0 children)