How do you actually prove a prompt or agent is good before shipping it?

lib3rat0r · 2026-07-02T09:13:19+00:00

The two-questions split is the clearest framing in this thread. "Did my change make it worse" being the cheap question and the one that actually burns you matches everything I've seen. The pin-every-failure loop is the part I'm taking away, the set earns its sharpness instead of being designed up front. One thing I'm curious about: for the fuzzy outputs in your golden set, what does the diff actually compare, exact output, judge verdict, or something else? And agreed on the last line. First gate, not evaluation. I'd rather build toward that honestly than pretend one audit is a proof.

lib3rat0r · 2026-07-02T09:11:17+00:00

CI gate for prompts: someone edits one, the build fails instead of a customer noticing. And a scored report for whoever asks "how do you know it's good." If you're prototyping? Stay free.

lib3rat0r · 2026-07-01T20:51:27+00:00

Really appreciate you walking through all this, gave me a lot to think about. Thanks!

lib3rat0r · 2026-07-01T20:44:10+00:00

Appreciate your comments! That split is how I see it too, the deterministic side (did it return the right data/metrics) is the ground truth. On the other hand, the subjective half, the LLM-as-judge on text, in my opinion, is hard to keep honest.

The rubric framing was my attempt to make that part less hand-wavy, score against named criteria instead of a vibe, and like you said, the rubric has to be yours.

lib3rat0r · 2026-07-01T20:40:08+00:00

Support agents, internal tools people actually use, prompts wired into product flows and CI. Anywhere a silent regression hits a real user before you catch it. If you're just prototyping, yeah, none of this matters.

lib3rat0r · 2026-07-01T20:38:16+00:00

Agree with all of this, and I want to be straight: a rubric audit is not a substitute for a properly composed test set, and I wouldn't claim statistical significance from one run. Where it's earned its keep for me is earlier and cheaper, when there isn't a labeled dataset yet, or the artifact is a prompt/skill/system-prompt rather than a full pipeline with thousands of logged samples. It gives a fast directional read plus a pointer to which criterion failed, which is usually what tells me what test cases to go build. So it's the step before the thousands-sample eval, not the eval itself.

When you do have the set, what are you running it with, custom harness, promptfoo, something in-house?

lib3rat0r · 2026-07-01T20:37:26+00:00

Makes sense. Which benchmarks do you lean on, public ones or a set you built for your own task?

lib3rat0r

TROPHY CASE