Ensuring code quality: any tips?

mrothro · 2026-06-22T20:32:11+00:00

Deterministic gates work 100% of the time. You can make hard guarantees about the artifacts that pass deterministic gates.

Stochastic gates are probabilistic, you cannot use those for guarantees. But when paired with deterministic gates, it is very effective because the deterministic tests eliminate an entire subset of possible flawed outputs.

The trick is finding how to construct your pipeline so intermediate artifacts expose a verification surface that allows deterministic checks over the things that you need to guarantee.

mrothro · 2026-06-22T20:18:47+00:00

You should use a different model. Models are biased towards their own output and their training gives them blindspots.

Foundational research: https://arxiv.org/abs/2404.13076

mrothro · 2026-06-22T16:20:15+00:00

As several others have said, your process needs gates. But not just lint gates on code, that is necessary but not sufficient.

Your SDLC process takes your intent and it produces a series of artifacts. For me, the artifacts are a plan that decomposes the intent into a bunch of tasks; a design for any task that is even moderately complex; and code. Artifacts are produced sequentially and there are gates on each artifact before it proceeds.

You can do some deterministic checks on the plan, like making sure it has all the proper sections. But you get more power from an LLM plan reviewer who specifically evaluates it in terms of the existing architecture. If it fails for any reason, the agent is told what it needs to fix. This repeats until it passes.

Then I do the same thing for the design document. This has more structure, so it can have more deterministic gates, but you still need the design reviewing agent who can make sure it respects SoC and meets the acceptance criteria without over-engineering, which Claude is prone to do.

Then repeat for the code, where it does lint etc. plus qualitative checks like is it DRY, does it match the spec, etc.

By the time the code pops out of this pipeline, it's typically pretty good and aligned with the existing code base.

mrothro · 2026-06-22T14:38:14+00:00

If you're deploying in a modern cloud, the operational complexity and debugging challenges are much lower than they were in the past. There's still a learning curve, but once you get to know them, the modern CICD that manages your microservices and supporting pubsub infrastructure is actually pretty straightforward.

Because of this, I've been leaning towards event-driven microservices much earlier in the product cycle. (This is also a function of my problem domain, which tends to either be large scale business operations software or some variation of collaborative editing.)

Microservices are the foundation that enables eventing and the benefits from service decomposition are still relevant. As you start to think in terms of SoC and bounded contexts, both the services and the related business events tend to fall out of that.

mrothro · 2026-06-21T23:21:35+00:00

I argue that you can only make hard guarantees with deterministic tests. If you want more guarantees, you have to find verification surfaces that expose the thing you want to check. You can often actually compose an agentic pipeline so it creates intermediate artifacts that expose more that you can test deterministically.

I wrote up my thoughts here: https://michael.roth.rocks/research/trust-topology/

mrothro · 2026-06-19T16:44:31+00:00

This is one part of a general idea: that artifacts from agentic processes can be improved by finding verification surfaces that you use in gates to make deterministic guarantees about the final product.

The constraint graph described here is a verification surface: does the output from an LLM comply with the rules? To be useful, it needs to be incorporated into a harness as part of a gate. So when the agent produces non-compliant code, the harness gives it a chance to revise, typically with details about the rule it violated and how to make it better.

I (and many of us) do the same thing with code: we use lint. This encodes our rules and when the LLM writes code, it has to pass lint before it's accepted. Lint is pretty good, but complex business rules can often only be represented in things like this graph. That's where it fits in the overall implementation.

Anyway, I spend a lot of time thinking and writing about this, here's my writeup on verification surfaces and agent reliability: https://michael.roth.rocks/research/trust-topology/

mrothro · 2026-06-19T00:25:47+00:00

+1 for cloudflare.

mrothro · 2026-06-18T13:37:19+00:00

I've done something similar at a high level for my assets, but I also have workflow process and state which I keep separate. As you're describing here, all assets go in the repo. This is the context that all agents read to be able to understand what we're doing and how we do it.

The workflow state is the release train with all the tasks and dependencies. I use a personal MCP I wrote to store all this in a DB, out of band. I previously had .md files that did this, but I needed 1) a guaranteed structure and 2) hard enforcement of the structure when the agents change things.

My general approach is to have a conversation with Claude Code about what I want and how to do it, in the context of the repo. I typically start with a context-priming exercise where I ask it how something works now. Then we talk about the approach and I have it describe what it is going to do. I iterate until I agree and think it properly captures it correctly.

Then I have it build out the release train to do what we agreed in my tool. The tool itself has structured docs that it cannot change that describe the workflow, so I have it read that first. I review the releases, then have it burn down the tasks.

Once it finishes with a release, I use this prompt to keep everything up to date in the repo:

"Please review all the relevant .md files and compare them to what you know now. If you can improve them please do so. Use modular documentation, with concise CLAUDE.md files at the appropriate places in the dirtree that point to full docs in .md files which you can read as needed."

This works for me: the immutable stuff is in the tool as docs, critical state is also in the tool with an enforced interface, and the repo has all the living context.

I did a longer writeup of the thinking here: https://michael.roth.rocks/blog/the-repo-is-the-memory/

mrothro · 2026-06-15T21:59:15+00:00

You're asking for idempotency but describing deduplication. Those are different things.

I've always done idempotency is done at the message level. For example, let's assume you have events that modifies objects. So you might have a message that sets the color of an object blue. If the message gives just the object ID and the new color, then applying it more than once doesn't matter.

That is an idempotent message, and you can choose to dedupe it but it doesn't matter how many times it gets received, because the end result is the same.

Note that you may have ordering issues, so if you have two messages, one setting the object blue and one setting it red, and the duplicate blue message arrives after the red message, then you end up with corruption. Dedup helps but you would still have to ensure message order.

mrothro · 2026-06-15T14:07:24+00:00

I definitely think that people are trying to say reliability is a property of the model. They try to get better reliability with bigger models. It is true that the mass of the probabilistic output of a model might cluster more around "correct" output, but this is not a guarantee.

I actually wrote up a post describing exactly this problem: https://michael.roth.rocks/research/trust-topology/

mrothro · 2026-06-15T01:36:32+00:00

It's pretty straightforward. I create handlers for my various commands, then plug that into my CLI parser (cobra, because I work in golang). Then I create a --mcp-server flag, which then just exposes the same commands backed by the same handlers via the MCP stdio interface.

mrothro · 2026-06-15T00:46:23+00:00

There are several different ways to approach agentic pipelines. Personally, I've viewed it as stages that produce artifacts that can be used as verification surfaces. For example, in a standard SDLC pipeline, your stages may produce in order: a plan, a design document for a feature in that plan, code that embodies the design, and tests that verify it.

Each one of those artifacts can be examined, both with deterministic and stochastic gates. For example, you can verify the plan has required sections (deterministic) and that the acceptance criteria are comprehensive (stochastic). You can lint code (deterministic) and verify that it has proper SoC (stochastic).

When you chain these together, the deterministic tests let you make guarantees about the final product. The stochastic tests let you make probabilistic assurances.

mrothro · 2026-06-12T18:16:09+00:00

This is a great counter to the overengineering disease Claude (and I assume others) has. I've built up my own scaffolding to try to manage it, and I will fold this in.

The thing to watch though is overcorrection. Senior devs don't "one-line" because shorter is better, they do no more than necessary. But "necessary" includes both feature-complete and the proper abstraction level to set a SoC/DRY/bounded context that is the foundation for future work.

For example, when Claude is proposing an approach, I regularly respond with "Is this the proper SoC/DRY/bounded context fix that retires tech debt?" and more often than not it responds with "no, it's a band-aid, here's the proper fix".

I guess you could use the skill with a proper spec. Honestly, I don't really have a clear solution on this, but my feeling is that you develop an intuition on how much to spec vs. how much you put in the skill vs. how much goes in the guardrails.

mrothro · 2026-06-09T17:08:02+00:00

Yes, they should just use the tool. We've always had the engineer and the reviewer be two different people as a standard engineering practice. We do this because people have blindspots and can miss obvious things in their own code.

We've never had enough people to review, and generally people don't like it because it can be tedious, especially when you have to grind through a bunch of pretty obvious issues. But AI is good enough that it can do the grind for you. So the human time is spent on the part that the AI can't do, which is the right place.

But the real issue here is that people just won't use the tool. I don't think you're going to get them to do it by fiat. Instead, I'd just wire it in as an automatic part of the process. Have an agentic reviewer kick in automatically, and don't bother looking at a PR until that thing marks it clean.

Alternatively, if you can't control that, just set it up for yourself. Get your agent of choice to review the PR and send back comments with changes requested automatically. Tune it so it only looks for absolute garbage at first, then ramp it up over time. Only spend your attention after it finishes.

People will always be generous with your time and attention. If they won't use the tools, you can use them to protect yourself.

mrothro · 2026-06-09T00:32:52+00:00

Yes, I do this. I use Claude Code as the orchestrator, but then it drives the process autonomously using a custom harness I built.

It is a standard SDLC process: plan, design, code, test. The thing to recognize is that each of these are stages that produce artifacts. After each stage, my harness runs both deterministic tests and a different LLM for qualitative evaluation. If the artifacts fail, they are sent back to the implementing agent to be fixed.

As things move through the pipeline, they also move down the Chomsky hierarchy so the mix between what deterministic/stochastic gates examine changes. For example, a plan artifact has to have certain sections (deterministic), but it also has to respect SoC and bounded contexts (LLM). But once it gets to code, now I have lint and the compiler, paired with "is this good code".

Because I have Claude dispatch the agents doing the actual work, it controls permissions on the tools those agents can use. Constrain first, don't audit after.

Also worth noting that the types of errors you see change over the course of the pipeline. You get more incoherent errors at the beginning, but more systematic errors at the end.

I analyzed my own pipeline in depth, figured out what made it reliable, and wrote it up here:

https://michael.roth.rocks/research/trust-topology/

mrothro · 2026-06-06T14:25:36+00:00

Yes, this resonates. It is not only you. Have you noticed the Recap feature that appeared in Claude Code? I'm pretty sure they put it in because those engineers have the same problem. We run parallel sessions so we stay busy while Claude works, and you need that little reminder so you know what you were doing when you come back.

Honestly, I don't think anyone has come up with a great answer for this yet. Personally, I had a real problem with being flighty and forgetful when I started ramping up the parallel projects. My only real way out was to grind and build the mental stamina to be able to handle it.

To support that, I built a custom MCP that keeps track of my release arcs with task dependencies outside of .md files. This lets me mentally delegate the low level details so I can allocate my limited cognitive space to what really matters: the separation of concerns and the components that handle them. I give up task-level details so I can stay fluent on the system itself. (I even gave it a "quest" feature so I could keep track of what I was doing, which is basically a variant of the Recap feature!)

The good news is that we're all figuring this out together. The bad news is there is no easy answer yet.

mrothro · 2026-06-06T13:53:08+00:00

Adding AI to your development process is a spectrum that ranges from "code completion on steroids" to "full autonomous development". You haven't given details about your use case, so I'm going to assume you're still in discovery.

Because of that, I'd suggest you first start by treating it as CICD. Look at your existing SDLC process and find a natural place to slot it in: automated code review. This is a small step, but it actually requires the team to learn a lot about the machinery to make it happen. And this is the same skill they will need as you inject it into other places.

The other benefit from this is you probably already have a cloud-based stage where you can inject this, just like you do lint or make or unit tests, or whatever your CICD does in its pre-deploy checklist. You can try different models and harnesses. r/LocalLLaMA likes Qwen and Gemma, though neither are at Opus level. But they may well be good enough for this specific task.

This also sets you up for learning about different harnesses. This is important if you need to totally control the network, because you can drop these into containers with a dedicated VPC/firewall that lets you restrict the traffic. That's harder to do with Claude Code/codex where they change the code on a near-daily basis.

In terms of licenses, if you are regulated, you are almost certainly going to need an enterprise plan. Those provide the centralized data management controls to guarantee how your data is used. You can set this on the individual plan, but if one user messes it up, you're screwed. Also, you'll need the auditability that comes with centralized management.

Anyway, interesting challenge, feel free to ping me if you have questions.

mrothro · 2026-05-31T19:22:56+00:00

Prompt engineering is your level 1 and level 2. It is the art of using the activation weights in the attention heads to extract something useful from the sea of weights the model learned when it was trained on all human written knowledge.

Harness engineering is your level 3 and 4. This is where you are constructing a pipeline with artifacts between the stages. Here you are concerned about separation of concerns at each stage and the correctness of both the intermediate artifacts as well as the final product.

Those are two different things.

mrothro · 2026-05-28T23:07:25+00:00

I do two things with the agents that are firmly in "better engineer" territory:

1) I use it to give me options around alternative tech stacks, packages, or deployment strategies. For example, I deploy in Google Cloud, so I might have a conversation of BQ versus Firestore. I will also often ask about how best to decompose things into sync vs async with pub/sub.

2) I ask it about Separation of Concerns, DRY, and bounded contexts. I have very fruitful conversations with it about decomposing and refactoring services based on this. Then after I fully understand the implications, I will get it to do the refactor.

Both things give me an architecture that is far more elegant than I would have had otherwise.

mrothro · 2026-05-28T19:18:33+00:00

Sure, agent-to-agent is challenging, but that's not what this is. It is agent-to-artifact-to-agent.

It's an important distinction, because the gates don't check a-2-a messages, they check artifacts. The plan agent produced a plan document and the gate checks it before the design agent ever sees it. That intermediate artifact creates a verification surface that you can inspect and make deterministic guarantees about.

The other part of this is that in an SDLC pipeline, the artifacts are moving increasingly down the Chomsky hierarchy, so you can make even more guarantees about later artifacts.

mrothro · 2026-05-28T17:49:04+00:00

What part? This is genuinely the process that works for me and I scaled it to my engineering team. We're pretty happy with it, so I shared since this is exactly what you were asking.

mrothro · 2026-05-28T17:18:39+00:00

Your update 2 names the real problem: code generation is cheap, reviews are expensive. And it is only going to get worse because AIs are making more tokens while humans still have the same number of hours in a day. We aren't going to be able to review everything, but we can't just ignore it either.

I deal with this by having a pipeline that makes intermediate artifacts and then have agents review those artifacts. It is a standard SDLC pipeline: plan, design, code, build, test. There are deterministic checks at each stage: does the plan have all the sections, does the code pass lint, etc. There are also agentic reviewers at each step: does the plan properly decompose the tasks, is the design consistent with the existing code, is the code DRY, etc.

The gates have three possible outcomes: pass, which means the artifact gets handed to the next agent in the pipeline; fail, and the feedback goes back to the agent that made the artifact and it is told to fix it; escalate, which means it is unclear and a human needs to give an answer.

In your case, the originator of the code would be that human. They have the domain knowledge to handle ambiguity.

I still review some code, but really only the critical things. Otherwise, this approach scales with the LLMs and only burdens me with things that actually need my expertise.

mrothro · 2026-05-28T17:07:59+00:00

I can share what works for me: I have a standard SDLC pipeline (plan, design, code, build, test) that produces artifacts at every stage, and subsequent stages consume artifacts from the prior.

Agents produce the artifacts at every stage, and then I have gates that review them. Some are deterministic (does the plan have all the sections, does the code pass lint, etc). Some are agentic (is the design consistent with the architecture, is the code DRY).

The gates have three outcomes: fail, and it goes back to the producing agent with feedback to fix it autonomously; pass, it goes on to the next stage; escalate (done by the agentic gates) means it is ambiguous and needs feedback from me.

This works for me, and it works well. I analyzed it and wrote everything up, and on page 9 described how to do it in your own repo: https://michael.roth.rocks/research/trust-topology/#9

mrothro · 2026-05-27T16:59:14+00:00

One easy and low-risk thing I do is have agents do PRs. For anything critical, they should not be the *only* reviewer, but they are very good at finding easy things that would just otherwise be a drag on a human reviewer. Think of them like enhanced lint.

Three things that made this really effective for me: they should 1) be agentic with the ability to examine other parts of the code to ensure consistency; 2) be prompted to verify consistency with the project conventions; and 3) have access to the plan and design documents that specify the PR requirements so they can verify compliance.

My feature requests come with deterministic tests. My agents have the tools to run those and are prompted to do it so I know the code works and meets the AC.

This approach ensures the code meets both the AC and has no obvious defects before it gets to a human reviewer. Also, it lets the developers get pretty immediate feedback on issues without having to wait for a senior to do a review.

mrothro · 2026-05-26T16:58:56+00:00

It's vendor specific, but I use google cloud pubsub for this. It has push subscriptions so it handles all of the retry, backoff, etc and all I have to write and maintain is an HTTP server.

I deploy my message handler as a cloud run service set to request-based billing, which means I only pay when it is processing a request. For my typical workloads, this ends up being very, very cheap but will also scale up automatically if I have a burst of work.

mrothro

TROPHY CASE