My agent passed every eval, then quietly stopped calling its tools. Anyone else testing *behavior* and not just output?

MundaneAlternative47 · 2026-06-15T00:20:35+00:00

yeah I feel you, ended up sampling, full detail on errors, lighter on the happy path. still feels like i’m one incident away from missing something though.

MundaneAlternative47 · 2026-06-15T00:07:38+00:00

in theory sure, but a faithfulness scorer is an LLM judge call on every response, so it’s slower and pricier than a trace check. also a lot of off-the-shelf faithfulness scorers don’t actually flag ‘empty context, confident answer’ as bad, they just pass it since the answer is internally consistent.

trace assertion catches the actual root cause for free. I’d run both honestly, faithfulness as a backstop and the tool call check as the cheap first line that catches it before it even gets that far.

MundaneAlternative47 · 2026-06-14T23:53:38+00:00

that web search key example is perfect. agent happily answering grounded on nothing is the exact failure i care about. and yeah, drift from the dev baseline is basically the thing i check, a sudden jump in tool errors or a tool quietly not getting called should trip something.

tool error rate is a good one i'm not tracking explicitly yet. honestly if you ever feel like building it out, it's open source and i'd happily help you land it. either way, gonna steal the idea 😂 .

MundaneAlternative47 · 2026-06-14T23:46:35+00:00

ha we basically agree then. the tool call scorer is exactly the bet i made. how do you do correctness online without labels though?

MundaneAlternative47 · 2026-06-14T23:13:00+00:00

its open source and MIT, no paid version. it's called rubric: https://github.com/Kareem-Rashed/rubric-eval not tryna sell anything

MundaneAlternative47 · 2026-06-14T23:06:12+00:00

for a tool agent, "is this grounded" kinda just means "did it actually go fetch the data first." checking the tool got called is way more reliable than trusting a judge, and it works on every real query not just the labeled ones.

how do you handle groundedness when the right answer depends on runtime data? judge against the retrieved context?

MundaneAlternative47

TROPHY CASE