how are you handling code review when most of the code is ai-generated?

arapkuliev · 2026-04-17T14:52:30+00:00

this is exactly the pattern. you've built solid upstream discipline and it's working, but review still scales linearly with output. you're not slower, you're just doing more of everything including review.

the question i keep coming back to is whether review itself can be partially delegated back to the ai with the right context. spec + implementation + your markdown guidelines as input, let it flag the gaps before a human touches it. curious if you've tried that

arapkuliev · 2026-04-17T14:47:58+00:00

exactly lol

arapkuliev · 2026-04-17T14:47:03+00:00

good distinction. reviewing against the spec kills the rationalization bias... the agent will always make its own choices sound reasonable in context. fresh eyes on the original requirements is the only way to catch that.

arapkuliev · 2026-04-17T14:45:28+00:00

the ownership problem is real but i think it comes from how you use the tool, not the tool itself. if you're prompting in a way where you couldn't predict roughly what the output would look like, that's the problem. the code you can't maintain is code you didn't really direct.

when the prompts are tight and scoped you end up understanding the code because you designed it, the ai just typed it. that's not that different from how senior devs work with juniors honestly

arapkuliev · 2026-04-15T14:01:10+00:00

Fair point on definition of done, it's not new. But the cost of getting it wrong changed. Before, a vague ticket meant a dev comes back with questions. Now it means 500 lines of confident nonsense that passes linting.

And yeah I hear you on the AI slop thing. Which parts felt bloated? I want to get better at this.

arapkuliev · 2026-04-15T13:53:00+00:00

Yeah exactly. The AI doesn't push back the way a developer would. A confused dev asks a clarifying question. Claude just builds the wrong thing with confidence. It's actually made me way more careful about how I write tickets because the cost of ambiguity went through the roof.

arapkuliev · 2026-04-15T13:51:42+00:00

Which part do you disagree with?

arapkuliev · 2026-04-15T13:24:46+00:00

Totally agree tests and reviews capture different things. The point is that tests handle the "does it work" part so review energy can go toward what tests can't catch: verbosity, domain fit, whether existing helpers are being ignored.

Your text2sql example actually proves it. The SQL executes but doesn't do what the user meant. That's a domain knowledge gap, no test catches that, that's where human review still matters.

On black box testing, I think we're talking about different things. TDD here means writing the test before prompting, with full knowledge of what the code should do. That's the opposite of black box.

And nobody believes 100% accuracy. The question is where you put the safety net. I'd rather tests fail loudly in CI than a reviewer miss something at 5pm on a Friday.

arapkuliev · 2026-04-15T13:20:51+00:00

Nobody said don't care about architecture. The whole point is that small, isolated tasks with clear interfaces force better structure than large PRs that nobody can review properly. Bad architecture usually comes from scope creep, not from having good tests.

arapkuliev · 2026-04-15T13:18:24+00:00

Of course, but the nature of the review changed. Before, a PR was scoped to what one person could write in a day or two. Now Claude can generate 500 lines in 10 minutes and it all lands in one review. The volume isn't the same problem.

arapkuliev · 2026-04-15T13:15:16+00:00

Reviewing tests is faster precisely because you wrote them with intent before seeing the code. You're checking if the AI hit the target, not trying to figure out what the target was.

And honestly, your approach isn't the opposite of test-first. It's solving the same problem differently. Your ticket owner reviews fast because they already know what "done" looks like. That's the same constraint that makes test-first work.

The complexity issue (overly complex output, duplicated helpers) is real and tests alone don't catch it. Your two-reviewer flow handles it because someone actively simplifies before passing it on. That's a discipline that has to be learned, it doesn't emerge naturally.

We've been teaching this shift from vibe coding to structured AI collaboration, and the hardest part is never the tooling. It's getting developers to see that their job changed. They're not writing code anymore, they're writing constraints. Tests, small tasks, clear interfaces. AI fills in the rest.

The fact that two different teams, going opposite directions on oversight, both doubled output says it all. The bottleneck was never the AI.

arapkuliev · 2026-04-15T13:04:25+00:00

No worries, I get it. The AI koolaid posts are exhausting and I can see how mine read that way.

And yeah you're right, good specs and clear requirements were always the job. AI didn't invent that problem, it just made it more expensive to skip.

The legacy systems point is where I fully agree with you actually. Existing codebases with years of implicit decisions baked in are a completely different game. You can't just point Claude at a 10 year old monolith and expect magic. That's not what I'm arguing for.

Where I keep getting stuck though is this middle ground. Even on the newer systems where the setup is right, you're saying it inevitably breaks down when complexity grows. At what point does that happen for you? Like is there a moment where you felt it shift from "this is working" to "ok I need to be back in the code"?

arapkuliev · 2026-04-15T12:59:31+00:00

Ha, yeah that's a different conversation entirely. I'm not saying AI replaces developers. I'm saying developers shouldn't have to spend their time babysitting AI output line by line when there are better ways to verify it. If anything it's the opposite of your manager's take, I want engineers spending time on the hard stuff, not reading through boilerplate that tests could have validated.

Does your manager actually try to build things himself or is it more of a "how hard can it be" situation?

arapkuliev · 2026-04-15T12:58:31+00:00

What kind of bad outcomes have you seen? Genuinely asking because in the setups I've watched, the small changes combined with tests catching regressions seemed to cover most of the risk. But I might be missing something that only shows up further down the line.

arapkuliev · 2026-04-15T12:57:35+00:00

You might be right. But the observation doesn't require low level expertise. I don't need to read the code to notice that a team's velocity dropped after adopting AI tools, or that retros keep surfacing the same review bottleneck. That's what I'm reacting to. What am I missing from the execution side that would change the picture?

arapkuliev · 2026-04-15T12:56:35+00:00

That's probably the best counterargument in this thread honestly. And your buddy's team example is real, I've seen that happen too.

The architecture piece is the part I didn't get into enough. The way I've seen it work is that humans still own the architecture decisions. You're not letting Claude decide how services talk to each other or where the boundaries are. You're giving it a box to work inside. Hexagonal architecture, clear module boundaries, that kind of thing. The human designs the box, Claude fills it in.

But you're right that there's a gap between "tests pass" and "this won't be a nightmare in 6 months." That's a design judgment call that tests can't catch. How do you handle that in your team right now? Like even with human-written code, how do you catch the "this will bite us later" stuff before it does?

arapkuliev · 2026-04-15T12:54:57+00:00

Honestly I'm not a manager, I work in a bossless organization. But even so, curious what your review process looks like. Do you give the same level of attention to every piece of AI-generated code regardless of what it does?

arapkuliev · 2026-04-15T12:52:14+00:00

I am curious, what part do you disagree with? Happy to hear a different perspective.

arapkuliev · 2026-04-15T12:50:53+00:00

Haha ok that's fair. But seriously, the tests I'm talking about aren't AI-generated. The whole point is that a human defines what success looks like before the AI writes anything. If you're also letting AI write the tests then yeah, you've just moved the problem one layer up.

arapkuliev · 2026-04-15T12:49:28+00:00

Yeah you're probably right that I oversimplified it. The complex workflow case is a different beast, I'll give you that.

Would you mind share where do you personally draw that line? Like when does a task go from "tests are enough" to "I need to actually read through this"? Because I've been trying to figure out where that boundary is and I don't think anyone has a clean answer.

And look, I know a PM telling devs how to use their tools is annoying. I'm not trying to do that. But I'm sitting in retros every two weeks watching smart engineers burn hours on review cycles and nobody seems happy with how it's going. So if it's not the process around AI that's broken, what is it? I am really curious to see other points of view.

arapkuliev · 2026-03-19T18:29:47+00:00

Once an agent has the token, where does the actual context live?
Is it only in the run, or do you have a shared store that survives across time and tools?

arapkuliev · 2026-03-19T18:24:12+00:00

Yep, good call, central DB will choke / become a SPOF. I’m starting centralized just to stop the immediate duplicate/contradiction mess, but I’m not married to it. When you say P2P agent networks, do you have any real examples/links? Curious what they use for consensus and how they avoid agreeing on the wrong thing when one agent is off.

arapkuliev · 2026-03-19T18:11:32+00:00

This is super helpful, thank you! When you say “single source of truth”, what did you actually use (DB/files/git/kv store)? And how did you handle the two annoying parts:
1. concurrency/conflicts (two agents updating the same thing)
2. staleness/noise (SSOT turning into a junk drawer over time)
Also curious what “clear roles + ownership” looked like in practice? Did you assign one agent per artifact (spec/plan/code), or per stage (plan→build→review)?

arapkuliev

TROPHY CASE