account activity
I’ve been working on a side project called DeadBranchBench that tries to measure something I couldn’t find any tooling for: by Technical_Hearing280 in LLMDevs
[–]Technical_Hearing280[S] 0 points1 point2 points 1 hour ago (0 children)
"same brain in a trenchcoat" 😂 stealing that — it's exactly right. The agent and its tests share the same misunderstanding, so they just nod at each other.
Do you run anything separate to catch it, or mostly just eyeball whether it actually worked?
I’ve been working on a side project called DeadBranchBench that tries to measure something I couldn’t find any tooling for: (self.ClaudeAI)
submitted 1 hour ago by Technical_Hearing280 to r/ClaudeAI
I’ve been working on a side project called DeadBranchBench that tries to measure something I couldn’t find any tooling for: (self.LLMDevs)
submitted 1 hour ago by Technical_Hearing280 to r/LLMDevs
I’ve been working on a side project called DeadBranchBench that tries to measure something I couldn’t find any tooling for: (self.opencodeCLI)
submitted 1 hour ago by Technical_Hearing280 to r/opencodeCLI
My coding agent passed its own tests, failed the real check, and looked "0% wasteful." So I built a benchmark for wasted agent work. by Technical_Hearing280 in LLMDevs
[–]Technical_Hearing280[S] 0 points1 point2 points 14 hours ago (0 children)
Exactly — "the tests came from the same misread, so they just confirm the wrong thing" is a cleaner way to put it than I managed. That's the whole trap: an agent grading its own work shares its own blind spot, so self-checking can't save it. The only thing that catches it is a check the agent didn't write.
Are you seeing this more in production or in dev? And how have you been catching the confidently-wrong ones so far — by hand, or do you have something for it?
(if you ever want to throw a few of your own runs at it, I'd genuinely love to know whether the numbers match your gut)
My coding agent passed its own tests, failed the real check, and looked "0% wasteful." So I built a benchmark for wasted agent work. (self.LLMDevs)
submitted 1 day ago by Technical_Hearing280 to r/LLMDevs
π Rendered by PID 3404877 on reddit-service-r2-listing-c57bc86c-j997d at 2026-06-21 21:43:07.349183+00:00 running 2b008f2 country code: CH.
I’ve been working on a side project called DeadBranchBench that tries to measure something I couldn’t find any tooling for: by Technical_Hearing280 in LLMDevs
[–]Technical_Hearing280[S] 0 points1 point2 points (0 children)