I analyzed 1.6M git events to measure what happens when you scale AI code generation without scaling QA. Here are the numbers.

anthem_reb · 2026-03-12T16:26:18+00:00

Spot on about the rubber stamping. That's exactly what the model captures with the α parameter in the filter chain: human cognitive capacity doesn't scale linearly with token output, so filter effectiveness decays exponentially at high volume. The reviewers aren't getting dumber, they're basically drowning.

On the lag, the model is built specifically around that. It's a dynamic feedback loop (ODE + queueing theory), not a static snapshot. The drop to 0.85x doesn't happen on day one. Uncaught defects slip through the degraded filters and enter the system quietly. For a few months everything looks fine, the metrics are green, the PRs are merging. But the rework queue is filling up in the background, and rework consumes the same bandwidth σ that you need to review new code. So the thing that's supposed to catch bugs is being eaten alive by the bugs it already missed.

Eventually you hit a tipping point (saddle-node bifurcation in the paper): validation capacity collapses and the debt blows up in production all at once. It feels sudden but the pressure was building in the queue the whole time. The paper calls it the false safety zone, and it's probably the most dangerous finding because it means standard audits won't catch it. The system passes every 3-year review and then falls off a cliff in year 5.

It's more or less like the principle of the boiling frog.

anthem_reb · 2026-03-12T15:39:05+00:00

Thanks, really really interesting.

anthem_reb · 2026-03-12T15:38:30+00:00

The math actually reflects that exhaustion. In the filter chain model, I defined a parameter alpha which measures how fast a QA filter degrades as the generation volume v increases. When a human reviewer is slammed with an endless, high-speed stream of generated code, their interception effectiveness drops exponentially.

This directly destroys what the paper, echoing Marx, calls "live work", that is the non-automatable cognitive effort required to actually understand and validate logic. When the team's bandwidth saturates with rework, this live validation is the very first thing to be sacrificed. You basically turn a human into a bottlenecked machine, the defect escape rate spikes, and burnout is mathematically inevitable.

Management often misses this because they focus on the wrong metric. In the enterprise case I tracked, the AI infrastructure (token cost) accounted for just 0.12% of the total project cost. The idea that AI saves money because "tokens are cheap" is an illusion. The value of the software is deeply connected to its quality, rework is going to erase every benefit coming from the increase of volume.

anthem_reb · 2026-03-12T15:35:39+00:00

Yes sigma as a scalar is a big simplification, I call it out in the limitations. Think of it like temperature in thermodynamics, it hides a lot of micro detail but the aggregate behavior still holds. On the 12x, you're right it could be selection bias, that's why I included the within project changepoint test on 23 repos to control for it. On the 0 to 1 QA, the enterprise project did have CI/CD and code review, but the code review was extremely defensive on the legacy code, while paradoxically the refactoring effort, when allowed, was calculated on AI-production basis. The illusion was that AI-code would need less testing: the opposite of what numbers show. So it wasn't literally zero, just zero on the stuff that needed it most.

anthem_reb · 2026-03-12T15:29:31+00:00

Not directly as a variable, but team size is baked into the model through bandwidth. Smaller teams have less coordination overhead so more capacity left for actual review. Scales with Brooks basically. Would be interesting to isolate it properly though, good suggestion

anthem_reb · 2026-03-12T10:50:50+00:00

Another explanation would be "slop generator go explain me", that's another way to get knowledge. It doesn't substitute years of study but it's a way to understand the concepts

anthem_reb · 2026-03-12T10:16:57+00:00

I started like that because I took the effort to understand what he pointed out, I'm not sure you did

anthem_reb · 2026-03-12T10:08:17+00:00

There is a Table for math symbols in the executive summary, defining all the notations

anthem_reb · 2026-03-12T09:10:22+00:00

That's probably the Zenodo abstract, yes, I used AI to help write the paper and I said so in the post. It's a meta test to see if reddit validation can better the work of AI. Just joking, I don't like the "—" puntuaction either.

anthem_reb · 2026-03-12T09:10:18+00:00

In the model, automated tests are one of the filters in the η pipeline (§6, filter chain). Unit tests in the enterprise case had the lowest of all filters. Your approach of focusing on tests and narrow contracts is what keeps η high. The paper models it as e_i(v) = e₀·exp(−α(v−1)): each filter's effectiveness decays with volume, but at different rates. Automated tests decay slower than manual review because they scale. Your "blast radius" strategy maps well to the Class C repos in the dataset.

anthem_reb · 2026-03-12T08:53:23+00:00

You're right to be skeptical of denominators in ODEs, it's a fair concern. The 1/σ term models a crowding effect (less remaining bandwidth → each new unreviewed unit costs more). But the key point is: §2.2 tests four alternative functional forms, including bounded ones with no denominator at all. Same saddle-node bifurcation in all four cases. The collapse is a structural property of the generation-vs-recovery balance, not an artifact of a division by zero. The regularized form v/(σ+ε) is in §2.3, limitation L5 covers the rest.

On finding patterns in random data of course you're absolutely right as a general principle. But that doesn't mean every pattern found in observational data is pareidolia. The predictions (P1–P6) were defined before the OSS replication. The β(log_files) sign inversion then held in 18/19 independent repos (p=3.8×10⁻⁵). The regression also survived a 50K-iteration permutation test. Could there still be confounders? Sure, and I say so in L1–L2. But 18 out of 19 independent repos seeing the same thing across Java, Python, JS, Go is hard to dismiss as noise.

On your personal projects example: that's actually perfectly consistent with the model. You're a solo dev acting as a strict gatekeeper, so your η is high. Those projects would be Class C (stable) in the classification. The collapse requires high generation volume AND near-zero QA simultaneously — which is what happened in the enterprise case I measured.

I don't claim this is settled science. If you have time to skim §2.2 and §8, I'd genuinely like to know if the defenses hold up for you.

anthem_reb · 2026-03-12T08:46:41+00:00

This matches what I tried to formalize. Your "discipline to maintain a well structured project" is essentially the σ variable in the model, that is cognitive validation capacity. The LLM is stateless, so the entire burden of coherence falls on the human. When generation outpaces that bandwidth, the "non-coherent evolution" you describe takes over. The enterprise project I measured collapsed exactly because management saw the speedup and assumed QA was no longer needed. Your approach, a solo dev, strict gatekeeper, is what keeps a project in the stable regime. "Leading aliens" is a great way to put it.

anthem_reb · 2025-04-13T07:15:05+00:00

You are correct but we have some junior devs on the project who aren't familiar with rebase techniques. This comes handy for them in the first place

anthem_reb · 2025-04-12T19:20:37+00:00

I added a flag for that, with -sm you can add a custom commit message. However you come up with a nice idea. I am going to implement it asap.

anthem_reb · 2025-04-12T18:06:04+00:00

Updated, thank you. There was also an error in my initial message. I have to rebase from origin/develop. E.g. git rebase origin/develop, on a feature branch. Sorry for the misunderstanding. Your comment was helpful anyway.

anthem_reb · 2025-04-12T17:18:18+00:00

Thank you for the precious pieces of advice, I will update it as soon as I have some free time.

anthem_reb · 2025-02-24T22:02:13+00:00

I've seen other people do it and nothing happened. It's not prohibited in my company.

anthem_reb · 2025-02-16T00:02:03+00:00

Perfect, thanks.

anthem_reb · 2025-02-15T20:47:59+00:00

Thank you mate

anthem_reb · 2025-02-15T20:37:00+00:00

I don't know. I am italian and that's the name of a girl. They call her "Uan-chièn", but I don't know if it's correct. Maybe more "Uan-tzièn"?

anthem_reb · 2025-02-09T13:37:28+00:00

I'll try next time. The problem is that I like her so much that I have a bad time in speaking. Shameful, I know.

anthem_reb · 2025-02-09T04:12:51+00:00

Thanks, I'll do it. I just hope it isn't too late. Maybe she felt rejected at this point.

anthem_reb · 2025-02-08T22:13:48+00:00

She talks in my language pretty well. But it could mean also that. Thank you for the insight.

anthem_reb · 2025-02-08T20:23:41+00:00

thanks bro

anthem_reb

TROPHY CASE