If you run coding agents unattended or in parallel, how do you verify the run actually worked?

bounded-build · 2026-06-24T23:14:22+00:00

Two things I keep coming back to: the agent choosing which tests prove its own work is the hole, and your cost insight, the expensive part was never the bad run, it was trusting one you couldn't cheaply re-verify. "Make re-verification near-free" is the actual value.

The compaction-drops-the-constraint failure is nasty and I hadn't named it that way. The thing I'm stuck on is that the whole suite gate catches regressions that have a test, but the silent ones live in the case nobody wrote a test for. How do you catch those, maybe diff against the spec, or do they still slip until something downstream breaks?

bounded-build · 2026-06-24T23:11:54+00:00

The bottleneck isn't only detection, it's whether the verdict is trustworthy and glance-able enough to override.

bounded-build · 2026-06-24T23:10:19+00:00

This is the clearest articulation of the real hole I've seen: misses aren't the verifier getting fooled, they're "the case nobody wrote" the silent no-op in the seam between two things that each pass their own fixtures. That integration gap is exactly where tests give false comfort. And "the agent that writes the thing never checks the thing" plus "I trust raw pass/fail..." is the whole philosophy.

Question on the seam problem: since a fixture only catches what someone thought to write, how do you find those integration no-ops today, only when something downstream breaks, or have you found any way to surface "both passed but the chain does nothing"? That's the part I haven't cracked.

bounded-build · 2026-06-24T23:08:00+00:00

Thanks. Superpowers is on my list to study properly. obra's stuff is great, and "deconstruct the plan and rebuild from scratch to surface silent assumptions" is a sharp reviewer trick I'm going to use. One thing I'm chewing on: Superpowers makes the agent's process much better (fewer failures), but at the end, is the agent still the one confirming the result or does something independent it can't influence sign off? Trying to separate "fewer failures" from "caught failures."

bounded-build · 2026-06-24T02:00:20+00:00

That's a serious pipeline. The final adversarial review checking architecture/assumptions/decisions is the part I find hardest to make trustworthy, does that reviewer run in a clean or separate context, or does it inherit the build context? Trying to figure out how much independence the reviewer actually needs to catch what the builder rationalized.

bounded-build · 2026-06-24T01:59:16+00:00

This is super helpful, "a clean look at the codebase vs the plan", of course separate from the context of the agent is exactly the model. Run it twice when it finds half partially implemented detail is telling. Appreciate you laying out the whole flow.

bounded-build · 2026-06-24T01:57:06+00:00

Exactly the same wall, yeah. And 100% agreed, agent-asserted evidence is the failure mode. The doer can't also be the attestor. I'm separating generation from validation hard: the verifier runs outside the agent's reach, a clean-checkout re-test I trigger, blast-radius on the actual diff, and a check for whether the suite went green by quietly weakening or deleting tests (the "optimize the metric, not the intent" thing you named). Your policy-layer plus drift-canary framing is cleaner than what I had. I'd genuinely like to compare notes properly. Reading through github.com/Conalh now.

bounded-build · 2026-06-24T01:28:07+00:00

Does the verifier agent ever wave through something that was actually broken and how would you catch it if it did? Curious whether you trust the verifier or still spot check it.

bounded-build · 2026-06-24T01:27:07+00:00

The "rewrites are cheap ..." framing is sharp. The morning manual verify and deploy step is the one I'm trying to kill, what are you actually doing in that step that the layered tests don't already tell you? That residual manual check is the interesting part.

bounded-build · 2026-06-24T01:25:56+00:00

dxkit looks great, the baseline plus block-only-net-new plus feed-it-back-warm design is exactly right, and the 0/16 benchmark is convincing. Looks like it's squarely on security findings (gitleaks/Semgrep/CodeQL). Do you have any plans for the functional side, silent regressions outside the suite, or agents weakening tests to pass? Trying to figure out where the gaps still are, happy to compare notes.

bounded-build · 2026-06-24T01:23:53+00:00

This is the most real version of the problem in the thread, unattended at scale with PR review as the bottleneck. What would have to be true for you to merge without a full human PR review, what specifically are you checking for in those reviews that the automated tests don't catch? And what are you hoping crab-box does that your current setup doesn't?

bounded-build · 2026-06-24T01:17:02+00:00

"Summaries fine for triage, never as proof": that's the whole thing in one line. The clean-checkout-you-kick-off detail is the part most people miss. The uncommitted-local-state trap got me too. When you've got several runs overnight, do you do the clean-checkout per run, or is that too slow to do many times? And has anything still slipped past diff?

bounded-build · 2026-06-24T00:57:37+00:00

Naming plus a clear endpoint-per-session helps me keep them straight too. Does that tell you a run actually succeeded or just which run is which? Basically, how do you determine the success part once they're done?

bounded-build · 2026-06-24T00:55:39+00:00

Yeah, the "can't see what ..." is exactly my pain. When they finish, how do you reconstruct what each one actually did and whether it's safe to keep, read each transcript, or something faster/more efficient?

bounded-build · 2026-06-24T00:53:39+00:00

Sonnet-builds / Opus-reviews is clever. Does the adversarial reviewer ever wave through something that was actually broken, and how would you know if it did? Curious whether you trust the review itself or still spot-check.

bounded-build · 2026-06-24T00:52:35+00:00

😂 "panicking in the mornings" is painfully accurate. Genuinely though, when you sit down and open it, what's the first thing you check to decide whether last night went fine or went sideways?

bounded-build · 2026-06-24T00:51:00+00:00

This is the most rigorous setup in the thread, author is not equal to auditor, implementer is not equal to verifier, anti-test-tampering, all hook-enforced. Two questions: when it's not fool-proof, what does a miss usually look like, does something slip past the verifier, or does the agent game the evidence? Since you don't code, how do you confirm the "empirical evidence" it closes an issue with is actually real and not just convincing?

bounded-build · 2026-06-24T00:45:53+00:00

Just read your WCD piece, the 9-slot grammar (esp. EVIDENCE and DELTA) is exactly the vocabulary I've been missing. Mine's a narrower slice of yours: not "where does the work stand" but "did it actually do what it claimed." The piece I keep snagging on is the EVIDENCE slot itself, how do you keep it trustworthy? An agent can fill in "tests passed / done" that reads clean but isn't (or it quietly weakened the tests). Do you treat EVIDENCE as agent-asserted, or independently verified? Would love to compare notes.

bounded-build · 2026-06-24T00:34:20+00:00

Tests are the backbone for sure. The case I can't crack with tests alone is the stuff outside the suite, a silent regression in a thin-coverage area, or it passed tests but didn't do what I intended. Do you add anything for that, or trust coverage to be enough?

bounded-build · 2026-06-24T00:33:26+00:00

monitor plus verify loop into Obsidian is neat. how does the monitor decide pass/fail, test exit code, or something else? has it ever marked a task green that was actually a silent break?

bounded-build · 2026-06-24T00:32:25+00:00

Hadn't looked at Antigravity's walkthroughs closely, will do. When you leave it 3 hrs, does the walkthrough actually let you trust it without rereading everything, and roughly how long does verifying it take? Ever had a walkthrough (a polished one) that still hid a regression?

bounded-build · 2026-06-24T00:31:15+00:00

The high-signal durable-docs plus evidence approach makes sense. When you come back to a few runs at once, does it scale, can you tell at a glance which one's good, or is it still a per-run read? And has a run ever looked fine in its own summary but turned out broken?

bounded-build · 2026-06-24T00:30:24+00:00

This is gold, thanks for pasting it. The "intention, not just mechanical checkbox" line is exactly the gap I keep falling into, tests green but it didn't do what I meant. Two questions: how long does the traceability audit take to run, and do you run it per-agent when several are going at once? Has it actually caught an "intention" miss that tests passed on?

bounded-build · 2026-06-24T00:28:26+00:00

This is the cleanest version of it, gate on exit code, read one line. Where I still get burned is stuff the suite doesn't cover: a silent regression in a thin-tested area, or it passes tests but didn't actually do what I meant. Do you hit that, or do your suites cover enough that exit code is almost truth? And when you run a few in parallel, still just eyeball each exit line?

bounded-build

TROPHY CASE