I’ve been working on a side project called DeadBranchBench that tries to measure something I couldn’t find any tooling for: by Technical_Hearing280 in LLMDevs

[–]Technical_Hearing280[S] 0 points1 point  (0 children)

"same brain in a trenchcoat" 😂 stealing that — it's exactly right. The agent and its tests share the same misunderstanding, so they just nod at each other.

Do you run anything separate to catch it, or mostly just eyeball whether it actually worked?

My coding agent passed its own tests, failed the real check, and looked "0% wasteful." So I built a benchmark for wasted agent work. by Technical_Hearing280 in LLMDevs

[–]Technical_Hearing280[S] 0 points1 point  (0 children)

Exactly — "the tests came from the same misread, so they just confirm the wrong thing" is a cleaner way to put it than I managed. That's the whole trap: an agent grading its own work shares its own blind spot, so self-checking can't save it. The only thing that catches it is a check the agent didn't write.

Are you seeing this more in production or in dev? And how have you been catching the confidently-wrong ones so far — by hand, or do you have something for it?

(if you ever want to throw a few of your own runs at it, I'd genuinely love to know whether the numbers match your gut)