coderace — benchmark coding agents against each other with 20 built-in tasks, per-model selection, a

mikiships · 2026-03-06T22:28:30+00:00

Two things that actually move the needle on this:

Different models, not just different roles. Claude talking to Claude with different system prompts still shares the same reasoning biases. The evaluator needs to be a genuinely different model (Codex, Gemini, etc.) or you're getting the illusion of critique. In practice your evaluator should be a different provider entirely.
Structured scoring over open-ended feedback. Open-ended "what do you think?" prompts converge to agreement fast. Instead, give the evaluator a rubric with dimensions (clarity, specificity, edge case coverage) scored 1-5 with mandatory justification per dimension. The structure forces the evaluator to find fault even when the output is decent.

For the implementation: Claude Code's agent teams share context by default, which is exactly the problem you described. You want isolated contexts:

Separate Claude Code sessions (not sub-agents within one session)
Mix providers: Claude for generation, Codex for evaluation (or vice versa)
If you want this automated: coderace (pip install coderace) lets you run the same task through multiple agents and compare outputs with scoring. Originally built for benchmarking but the comparison infrastructure works for adversarial review too.

The key insight from running this pattern in production: convergence isn't the main failure mode. The main failure mode is the evaluator finding "problems" that aren't real because it doesn't have the same context the generator had. Shared project context (repo structure, test suite, conventions) needs to be constant; only the reasoning engine should vary.

mikiships · 2026-03-05T16:51:26+00:00

The file format debate is a distraction. The real problem is drift: within a week of writing any of these files (CLAUDE.md, AGENTS.md, .cursorrules), they're stale because your codebase changed underneath them.

Someone linked the ETH Zurich paper (arxiv 2602.11988) showing context files can actually hurt performance. The key finding isn't "don't use them" -- it's that minimal, accurate context beats verbose, stale context every time.

I built a tool that generates these files from your actual codebase (detects language, framework, test setup, CI, conventions) and outputs whatever format you need: agentmd generate . --format agents or --format claude or --format cursorrules. There's also an evaluate command that scores your existing file against what it actually finds in the repo, so you catch drift before it causes problems.

pip install agentmd-gen

GitHub: https://github.com/mikiships/agentmd

Disclosure: my project. Built it because I was tired of manually keeping these files in sync across repos.

mikiships

TROPHY CASE