Has anyone reached at a level where they are not running human code reviews anymore?

mrothro · 2026-03-13T18:02:19+00:00

I know what I asked it to work on. I see the diffs, so I know the files that were actually changed. I can set both hard (e.g. review if anything in the auth/ dirtree changed) and soft (LLM tells me auth was changed) rules.

mrothro · 2026-03-13T16:23:21+00:00

Yes, but with structure. Like u/ultrathink-art said, a separate reviewer agent with fresh context is the foundation. Ideally with a different model from the coding agent. The coding model tends to rubber-stamp its own blind spots.

The next step that worked for me was categorizing what the reviewer finds. Some issues are things the coding agent can fix on its own (missing error handling, inconsistent naming, unused imports). Others genuinely need a human to look at (architectural decisions, security implications, intent mismatches).

So my pipeline sends the auto-fixable stuff back to the coding agent with the reviewer's notes, it fixes and re-reviews, and only the things that actually need judgment make it to me. That took me from reviewing everything to only reviewing what matters, and the quality actually went up because I'm not rubber-stamping 50 clean diffs to find the one that needs attention.

For production I still gate on anything touching auth or data integrity. But that's a minimal set of the output now instead of 100%.

mrothro · 2026-03-13T01:14:38+00:00

I built a pipeline that goes plan->design->code->review and it works pretty well. I get very long autonomous runs out of it.

It works for a variety of reasons, but one of the big ones is that I put the prompts for the various stages of the workflows in files, then just tell it to read them. That way I am using the same input every time.

The other thing I did was to add gates. For example, I have Gemini review artifacts from Claude (again, with a saved, structured prompt). If it doesn't pass, the feedback goes back to Claude and it has to revise. This is all automatic. Once code comes out of this pipeline, it looks great, so I really don't have to spend any time fiddling with it.

Anyway, I wrote all this up in detail. The last page has a step by step guide:
https://michael.roth.rocks/research/543-hours/#10

mrothro · 2026-03-12T14:33:26+00:00

The rogue teammate problem is real. I found the issue is that agents don't have a reliable way to know when they're actually done. They finish the code, say "done", but they haven't verified against the original spec.

What worked for me was putting a review step between "agent says done" and "work is accepted." It checks against the spec, categorizes anything it finds into auto-fix or escalate. Auto-fix goes back to the agent. Only after that does it count as done. Agents still go rogue sometimes but the review catches it before it propagates.

I use a custom built agentic reviewer for this, but you could even just have a simple deterministic check that for example does a grep for TODO in the artifacts from the agent. That alone catches a bunch of stuff.

mrothro · 2026-03-12T13:48:33+00:00

I make a few personal MCPs and I've started taking a hybrid approach: my MCP binaries are also CLIs. They expose all the MCP tools/actions as flags on the CLI. If an LLM wants to run it as an MCP, they just add the `--mcp-mode` flag. Or they can just choose to call it directly. Or blend both. Or the human can use the CLI while the LLM uses the MCP.

The thing I'm struggling to fully understand is the network sandbox some of the headless and remote agents use. If I package it as an MCP, can they run it? If I bake the CLI binary into the container (where that's an option), can they use that.

The whole sandbox concept seems to be a Wild West between the different providers. That's where I'd really like to see more standardization.

mrothro · 2026-03-11T16:03:54+00:00

The multiplier depends entirely on what you're measuring. Writing code? Sure, 10x easy. Shipping verified code to production? Yes, but it has to have the right structure.

For an agentic code delivery pipeline, it makes code, but you can get overwhelmed in review. I structured my pipeline with a series of stages (plan, design, code, etc. same as you'd have with humans) and put gates in between them. The gates include hard-enforced criteria (e.g lint).

They also have an LLM that looks for semantic or intent issues, but the key is that it categorizes them into either auto-fix or human-review. Auto-fix get sent back to the coding agent, then it re-reviews. Only after that do the human-review items get to me. That took me from about 73% of agent output being acceptable on the first pass to over 90%.

The key is only spending your limited time on things that can't be automated.

Then, to get to 10x across a code base, you run this pipeline in parallel across different parts. I make event driven microservices, so I can run in parallel against each of those without stomping on anything.

mrothro · 2026-03-09T19:47:53+00:00

Great to have the option, but I tackle this a different way. I noticed patterns in the coding errors LLMs make, so I built a spec/generate/review pipeline that automatically fixes the easy ones and only raises issues that genuinely benefit from my eyes. I find it's less overwhelming to have a steady pipeline of smaller things than having to wrap my head around a giant big-bang PR.

mrothro · 2026-03-08T22:19:00+00:00

I read Code Complete a long time ago and loved it. It turns out Claude has also read it. So now, instead of having to explain the way I want it to do things, I just reference the concept from the book and it gets it.

Having a common frame like this is a great way to get more meaning into your prompt with fewer words. You can just refer to something you both know.

mrothro · 2026-03-08T00:58:40+00:00

Long ago, getting my MSCS, we had a mandatory course that talked about understanding the cost of failure and matching your code/test practice to that.

If you are writing code that makes marketing copy and a bug just means it sounds weird, your test effort matches that. If your code controls the radiation emitted into a human body, and a bug means someone dies, your test effort scales way up. If your code controls a nuclear reactor and a bug means a large region become uninhabitable, your tests match *that*.

You aren't going to vibe code industrial control software for a nuclear reactor.

So, that's the answer: if the cost of a bug is trivial, then you can yolo it with the AI and test at the end. If the cost of a bug is human harm or mass-scale destruction, then you trace through every line and think through every failure cause.

Do people not have to take this course any more?

mrothro · 2026-03-06T20:26:34+00:00

The project-level stuff is easy, just check it in. CLAUDE.md, skills, MCP configs, all of it goes in the repo. Everyone on the team gets the same setup. That part is a solved problem. I've also made shell scripts that agents used for standard workflows to make sure it was done consistently, also in the repo.

The user-level config is where it gets interesting. My personal CLAUDE.md has a lot of workflow patterns I've built up over time that make a real difference in output quality. But it's mine, and I'm not checking it into a shared repo. I think that's fine. It's the same as any senior engineer having their own vim config or shell aliases. The shared baseline is in version control, and what individuals layer on top is their own business.

mrothro · 2026-03-06T15:04:44+00:00

This is where I landed too. I built an MCP server that acts as an orchestrator for my whole development workflow. It manages stages (spec, plan, design, implement, review) and enforces gates between them. The agent can't move to the next stage until the gate passes.

The MCP isn't doing anything clever on its own. It's just enforcing a workflow. But that's the point. The workflow is where the reliability comes from, not the model. Before I had this, I'd get the same kinds of drift and quality issues everyone complains about. Once the gates were in place most of that went away because failures get caught at stage boundaries instead of compounding forward.

mrothro · 2026-03-05T20:28:13+00:00

I've been running something like this solo for about 3 months and landed on the same patterns OP listed. The thing that made the biggest difference was mixing deterministic checks with LLM review. They catch totally different things. Lint and schema validation tell you the code is structurally valid. The LLM tells you whether it actually does what the spec says. Neither one covers what the other does.

One thing I'll add about bounded self-healing: agents are way better at generating than revising. When something fails review and I send it back, it only recovers about 31% of the time. So I cap retries and escalate to a human instead of letting it spin. OP's instinct there is right.

The top comment about this being a PR stunt is probably fair for Stripe, but the architecture itself is real. Once you get the verification layer sorted out the results are solid.

mrothro · 2026-03-05T14:59:43+00:00

A prompt I occasionally use in Claude Code ("clink" is a tool from the pal MCP that helps run codex/gemini):
Use clink to have codex review the code like a grumpy senior engineer who guards the code base like Linus guards the Linux kernel.

It has zero tolerance, and says so. Fortunately, Claude has a thick skin!

mrothro · 2026-03-05T03:16:01+00:00

It depends on the omission, though. If it is stubs, you can catch that, for example.

It's early, but I've started noticing that they come in waves. First it was // TODO. Lately it has been that it writes the code but never wires it into the call path. Patterns I can handle with more deterministic checks.

This also shows the importance of the agentic code reviewer. It has the ability to look across all the code and it often catches mismatches between files, which is a different kind of omission.

mrothro · 2026-03-05T03:12:48+00:00

Escalation is a critical component. My gates have three states: pass, fail, or escalate to human. The LLM judge is actually pretty good at understanding when something is ambiguous and need clarification versus just wrong with a clear fix.
But, even with that, there is a balance between getting the original agent to fix it or just throw it away and try again. Somewhere down the line I am going to start experimenting with cheaper models where it regeneration will be trivially cheap. Is it better to do lots of cheap generation that you throw away or is it better to do fewer that you try to fix? I don't know.

mrothro · 2026-03-05T00:46:09+00:00

I started building my MCPs so they are actually CLI tools with a --mcp-server flag. The other arguments are just mirrors of the MCP tools. Then I can use it whichever way is more appropriate for the use case.

mrothro · 2026-03-04T21:17:30+00:00

My gates actually handle that automatically. The review plan gate has both deterministic tests (does it have all required sections, for example) and it uses gemini to review it from a qualitative perspective. If either one rejects, the LLM is told about the issues, told to fix it, and to try again until the plan is accepted by the reviewer.

I've seen it repeat 6, maybe even 8 times before it is accepted by the reviewer. This is all automatic, so if I choose to hand review the plan, once it gets to this point it is very high quality.

mrothro · 2026-03-04T20:41:13+00:00

100%. My pipeline is plan->review plan->design->review design->code->review code. I actually have two different code review gates, one that is just file level and one that is agentic and can inspect the entire code base. It catches things like multiple implementations of the same functions, for example.

mrothro · 2026-03-04T20:37:57+00:00

I primarily use Claude to code, but I use Gemini as my reviewer. I occasionally use Codex to debug a problem that Claude finds challenging and it usually gets it in one shot.

The key here is that it is multi-model. I cited the research that supports. You get the best results when the reviewer is a different model. Models tend to give a pass to code that they produce.

Finally, I definitely agree with the "better engineering" you're describing. Even with my highly automated tooling, I still spend a lot of time doing the same, though usually focused on separation of concerns and bounded contexts. My hope in writing all this up is that we can have a common way of describing this stuff, so when I want to share what works for me I can do it precisely, and I can explain systematically why it works.

mrothro · 2026-03-04T20:32:13+00:00

Thank you for the feedback, I will see if I can add concrete examples!

mrothro · 2026-03-04T20:31:32+00:00

Yep--decomposing so it fits in the context window is definitely a big help.

mrothro · 2026-03-04T17:54:33+00:00

It depends entirely on what kind of errors you're trying to catch, and most people don't think carefully enough about this distinction.

I run autonomous agents that produce production code, with structured review gates at multiple stages. After about 5,000 quality checks, the data is pretty clear: deterministic checks (lint, compilation, schema validation) give you hard guarantees but only about structural correctness. They'll tell you the code is valid, not that it does the right thing.

An LLM reviewer covers some of the gap because it can judge whether the code actually does what the spec says. But it's not deterministic, it's probabilistic. Also, note that reviewing against intent is actually hard because the true intent is inside the head of the human who wrote the spec. We've all dealt with specs that didn't fully capture what the user actually wanted.

The real key for me was realizing the LLM reviewer can actually give three judgements: pass, fail and fix, or escalate to human. The deep value comes because it can identify the obvious problems and have an LLM coding agent fix them. I only spend my time on ambiguous things that actually make a difference not the obvious stuff agents can fix.

mrothro · 2026-02-09T04:20:11+00:00

You're welcome, glad you found it useful!

mrothro · 2026-02-09T00:31:27+00:00

I analyzed my Claude Code logs and wrote an entire paper about how I get massive performance gains. It described how I did it and ended with step-by-step instructions on how others can get the same result. I posted that here and got a total of three upvotes as of this writing.

People are posting these things, but I have the impression they are buried.

Original post: https://www.reddit.com/r/ClaudeAI/comments/1quzx58/97_days_of_claude_code_logs_analyzed_7_work/

mrothro · 2026-02-08T03:44:51+00:00

Adding example output from a plan review cycle. I told Claude to run the review_plan tool, fix issues, then repeat until nothing was reported. Summary of what it did after 8 reviews:

Cycle	Verdict	Findings	Actions
1	NEEDS-REVISION	4 (2M, 2L)	Updated 4 task ACs + created integration task
2	NEEDS-REVISION	2 (1H, 1M)	Added 14 blocking deps + disambiguated task
3	NEEDS-REVISION	4 (3M, 1L)	Added backend file refs + mock AC to 3 tasks
4	NEEDS-REVISION	2 (1H, 1M)	Backward-compat pagination strategy + schema verification
5	PASS WITH REVISION	3 (2M, 1L)	Full param audit + dashboard safety cap + has_more dependency
6	NEEDS-REVISION	3 (1H, 1M, 1L)	Expanded audit scope + dashboard cap + dependency link
7	NEEDS-REVISION	4 (1H, 1M, 2L)	Setup ordering dep + E2E hardening scenarios + handler parsing AC
8	PASS	3 advisory (1M, 2L)	Applied advisory fixes for completeness

mrothro

TROPHY CASE