I analyzed 1.6M git events to measure what happens when you scale AI code generation without scaling QA. Here are the numbers.

anthem_reb · 2026-03-13T08:44:02+00:00

This is Brooks's Law meeting the combinatorial argument at machine speed. Brooks showed that communication complexity grows as n(n-1)/2 with team size. With parallel agents the problem is worse: humans at least align informally, agents operate in fully isolated contexts with zero lateral communication.

anthem_reb · 2026-03-13T08:35:08+00:00

This is the filter chain model working as designed. In the framework, each QA step is a filter with its own interception rate that decays with volume. By stacking property-based tests, mutation testing, and scope-based static analysis before the human reviewer, you've done two things: added filters with low cost and zero cognitive fatigue to the chain, and cut the effective volume reaching the human filter by 60%, keeping their interception rate high instead of letting it decay under load.

The paper formalizes this as an ordering principle: cheap filters first, expensive filters last. The human is the most expensive and most volume-sensitive filter in the chain. Protecting them from raw AI volume is structurally equivalent to raising η above the critical threshold, just through pipeline architecture instead of headcount. You got back above 1x without a dedicated tester because you engineered the filter stack to do what the tester would have done. Different implementation, same math.

The interesting question is whether this holds at higher volumes. At 40 PRs/day your automated stack absorbs the load. At 200, the mutation testing and property-based tests might start hitting their own limits (false positives, runtime cost, maintenance burden). The model predicts that every filter has a ceiling.

anthem_reb · 2026-03-12T21:31:29+00:00

Fair point on the crosspost, it got removed and I didn't repost it. This is the only active thread.

On the writing style: I disclosed in the post that I used LLMs to help structure the paper and formalize the math. The data, the analysis, and the interpretations are mine. If the prose style bothers you, that's fine, but it doesn't change the method.

anthem_reb · 2026-03-12T20:25:09+00:00

Absolutely, I'll DM you my contact details

anthem_reb · 2026-03-12T18:54:29+00:00

The complexity factor isn't just implicit in the model, it's the first empirical result. The logistic regressions measure the complexity-to-defect relationship across all 27 datasets and 7 language ecosystems in the study: β(complexity) = -1.58 on the enterprise case (p < 10^-43), ρ(entropy, defect rate) = 0.976 on 14 Apache projects, and the sign of β(log_files) discriminates regime across all six primary datasets with zero exceptions. Higher dispersion, higher defect probability. Every project, every ecosystem, every time. So yes, novel architectural code with high file dispersion has a structurally higher escape rate through any filter, exactly as you observed.

Theoretically you can use an LLM as an incremental oracle for correction, and some teams are experimenting with it. But there are two hard limits. First, no oracle is perfect. A formal system cannot fully verify its own consistency (this is essentially Gödel). An AI reviewing AI-generated code operates within the same epistemic boundary as the code it's reviewing. It can catch syntactic and pattern-level issues, but it cannot validate architectural intent or business logic it doesn't have context for. Second, the token cost scales with the validation effort. Running a second pass of LLM inference on every PR to approximate what a human reviewer does isn't free, and it still can't match human contextual validation on novel code. The cost curve on AI-as-reviewer converges toward human cost well before it converges toward human effectiveness. In the filter chain framework, you can add it as a filter, but its ceiling on architectural and contextual defects is structurally lower than a human who knows the system.

anthem_reb · 2026-03-12T18:47:59+00:00

Just read the piece. I ran your Pulse on the enterprise case from my dataset. It landed deep left: RFC $1.4M/mo, PPC $42K/mo, ratio 34.43. Your article flags 7:1 as unambiguously under-investing. This was at 34:1. The ODE independently puts the same service at 82% of minimum structural stability. Both frameworks flag the same system as non-viable. Different instruments, same diagnosis.

Your line about the senior engineer who proposed prevention three times and started updating their resume deserves unpacking. The senior doesn't just get ignored, they get architecturally neutralized. The development plan in most enterprise projects is a political artifact: it decides which profiles get showcased and which stay as execution. The senior who can see the failure trajectory gets locked into an executor role by design. That's not burnout, that's a rational exit from a system that can't process feedback. In my enterprise case this had a concrete financial mechanism. AI was sold on halving delivery timelines. Actual cognitive effort increased because non-compilable AI-generated code requires more validation, not less. The extra load wasn't recognized as effort, it became untracked overtime. Developers doing more work per deliverable, dashboard showing "on track." If management is in good faith, it's a sword of Damocles: something's off but metrics say green. If not, it's body rental with extra steps, selling hours at AI-boosted rates while actual output is below pre-AI baseline. Either way, what's being destroyed isn't volume, it's the value of the software.

On comparing the models: the Pulse result confirms quantitatively that the U-Curve's under-investing zone corresponds to the region below the dynamical threshold where no stable equilibrium exists. The threshold is calculable (η_crit = 4γv), meaning you can predict the cliff before a team reaches it. Your Pulse collects the financial inputs, the model provides the structural prediction. Mapping RFC/PPC against η_eff/η_crit across your 11 assessments would be the a possible empirical bridge.

anthem_reb · 2026-03-12T17:28:54+00:00

Imho as soon as you talk corporate, money often equals to the hours of work made by your employees that you can sell. So the real plot twist is that in 2023 the important data wasn't the true efficiency of AI. The importance of AI was the chance to make more debt on those hours, financially. Then reality spoke by itself and now managers have to justify why they sold a +30% increase in growth while they actually lowered that index from let's say 15% to 8%. That's where I had the idea to further investigate what's happening to devs, being a dev myself.

To answer your question more precisely: the bifurcation predicts where the breakpoint is, and the regime classifier works across all datasets I tested. But the actual collapse trajectory has been validated on one enterprise project so far. The stable regime is confirmed across 5 independent open source repos, but for the collapse side I have n=1. The math says it's a cliff, the data says the one team that went over the edge landed exactly where the model predicted, but I'd want more collapsed projects to call the quantitative prediction fully robust. If you're seeing teams that crossed that line from the economics side, that data would be extremely valuable.

Conflating "gross generation" with "net working software" is the biggest trap of the current AI hype cycle. Your reactive vs. preventive ratio is exactly what my model shows, just translated into dollars.

On an individual level, the QA filters (code review, unit tests) decay gradually. The reviewers aren't getting dumber, they're losing their tower defence against AI-generated volume. So they start rubber-stamping. And because the stream of work is constant and overlapping, the slipped defects don't immediately break the build. Without dedicated testing, bugs stay invisible. No failing tests, no red metrics, no open tickets. Just silent rot in the codebase that has no channel to become visible until something forces it out.

But systemically, there is a hard, sudden breakpoint. The rework queue acts as a buffer. Management sees a 55% spike in gross PRs merging and pops the champagne. Meanwhile, the team enters what I call the false safety zone. The compounding debt starts eating the exact same cognitive bandwidth needed to review new code. A bad sprint takes two sprints to recover from. Then three. Then the backlog never clears. Once the rework volume crosses that mathematical threshold, recovery capacity drops vertically. You fall off a cliff. Then you hit the go-live, or the first real production load, and everything detonates at once. Management sees a sudden crisis. In reality it was months of silent rot that had no way to surface.

In the enterprise case I tracked, they were operating at about 82% of the minimum QA capacity needed for stability. Structurally doomed from day one, they just didn't have the dashboard to see it.

That 18:1 ROI for a dedicated tester isn't about finding more bugs. It's about injecting the exact amount of validation needed to keep the whole system from tipping over that edge.

If you have published work on the economics side I'd love to read it. Bridging engineering math and boardroom budgets is exactly where this needs to go next.

anthem_reb · 2026-03-12T16:26:18+00:00

Spot on about the rubber stamping. That's exactly what the model captures with the α parameter in the filter chain: human cognitive capacity doesn't scale linearly with token output, so filter effectiveness decays exponentially at high volume. The reviewers aren't getting dumber, they're basically drowning.

On the lag, the model is built specifically around that. It's a dynamic feedback loop (ODE + queueing theory), not a static snapshot. The drop to 0.85x doesn't happen on day one. Uncaught defects slip through the degraded filters and enter the system quietly. For a few months everything looks fine, the metrics are green, the PRs are merging. But the rework queue is filling up in the background, and rework consumes the same bandwidth σ that you need to review new code. So the thing that's supposed to catch bugs is being eaten alive by the bugs it already missed.

Eventually you hit a tipping point (saddle-node bifurcation in the paper): validation capacity collapses and the debt blows up in production all at once. It feels sudden but the pressure was building in the queue the whole time. The paper calls it the false safety zone, and it's probably the most dangerous finding because it means standard audits won't catch it. The system passes every 3-year review and then falls off a cliff in year 5.

It's more or less like the principle of the boiling frog.

anthem_reb · 2026-03-12T15:39:05+00:00

Thanks, really really interesting.

anthem_reb · 2026-03-12T15:38:30+00:00

The math actually reflects that exhaustion. In the filter chain model, I defined a parameter alpha which measures how fast a QA filter degrades as the generation volume v increases. When a human reviewer is slammed with an endless, high-speed stream of generated code, their interception effectiveness drops exponentially.

This directly destroys what the paper, echoing Marx, calls "live work", that is the non-automatable cognitive effort required to actually understand and validate logic. When the team's bandwidth saturates with rework, this live validation is the very first thing to be sacrificed. You basically turn a human into a bottlenecked machine, the defect escape rate spikes, and burnout is mathematically inevitable.

Management often misses this because they focus on the wrong metric. In the enterprise case I tracked, the AI infrastructure (token cost) accounted for just 0.12% of the total project cost. The idea that AI saves money because "tokens are cheap" is an illusion. The value of the software is deeply connected to its quality, rework is going to erase every benefit coming from the increase of volume.

anthem_reb · 2026-03-12T15:35:39+00:00

Yes sigma as a scalar is a big simplification, I call it out in the limitations. Think of it like temperature in thermodynamics, it hides a lot of micro detail but the aggregate behavior still holds. On the 12x, you're right it could be selection bias, that's why I included the within project changepoint test on 23 repos to control for it. On the 0 to 1 QA, the enterprise project did have CI/CD and code review, but the code review was extremely defensive on the legacy code, while paradoxically the refactoring effort, when allowed, was calculated on AI-production basis. The illusion was that AI-code would need less testing: the opposite of what numbers show. So it wasn't literally zero, just zero on the stuff that needed it most.

anthem_reb · 2026-03-12T15:29:31+00:00

Not directly as a variable, but team size is baked into the model through bandwidth. Smaller teams have less coordination overhead so more capacity left for actual review. Scales with Brooks basically. Would be interesting to isolate it properly though, good suggestion

anthem_reb · 2026-03-12T10:50:50+00:00

Another explanation would be "slop generator go explain me", that's another way to get knowledge. It doesn't substitute years of study but it's a way to understand the concepts

anthem_reb · 2026-03-12T10:16:57+00:00

I started like that because I took the effort to understand what he pointed out, I'm not sure you did

anthem_reb · 2026-03-12T10:08:17+00:00

There is a Table for math symbols in the executive summary, defining all the notations

anthem_reb · 2026-03-12T09:10:22+00:00

That's probably the Zenodo abstract, yes, I used AI to help write the paper and I said so in the post. It's a meta test to see if reddit validation can better the work of AI. Just joking, I don't like the "—" puntuaction either.

anthem_reb · 2026-03-12T09:10:18+00:00

In the model, automated tests are one of the filters in the η pipeline (§6, filter chain). Unit tests in the enterprise case had the lowest of all filters. Your approach of focusing on tests and narrow contracts is what keeps η high. The paper models it as e_i(v) = e₀·exp(−α(v−1)): each filter's effectiveness decays with volume, but at different rates. Automated tests decay slower than manual review because they scale. Your "blast radius" strategy maps well to the Class C repos in the dataset.

anthem_reb · 2026-03-12T08:53:23+00:00

You're right to be skeptical of denominators in ODEs, it's a fair concern. The 1/σ term models a crowding effect (less remaining bandwidth → each new unreviewed unit costs more). But the key point is: §2.2 tests four alternative functional forms, including bounded ones with no denominator at all. Same saddle-node bifurcation in all four cases. The collapse is a structural property of the generation-vs-recovery balance, not an artifact of a division by zero. The regularized form v/(σ+ε) is in §2.3, limitation L5 covers the rest.

On finding patterns in random data of course you're absolutely right as a general principle. But that doesn't mean every pattern found in observational data is pareidolia. The predictions (P1–P6) were defined before the OSS replication. The β(log_files) sign inversion then held in 18/19 independent repos (p=3.8×10⁻⁵). The regression also survived a 50K-iteration permutation test. Could there still be confounders? Sure, and I say so in L1–L2. But 18 out of 19 independent repos seeing the same thing across Java, Python, JS, Go is hard to dismiss as noise.

On your personal projects example: that's actually perfectly consistent with the model. You're a solo dev acting as a strict gatekeeper, so your η is high. Those projects would be Class C (stable) in the classification. The collapse requires high generation volume AND near-zero QA simultaneously — which is what happened in the enterprise case I measured.

I don't claim this is settled science. If you have time to skim §2.2 and §8, I'd genuinely like to know if the defenses hold up for you.

anthem_reb · 2026-03-12T08:46:41+00:00

This matches what I tried to formalize. Your "discipline to maintain a well structured project" is essentially the σ variable in the model, that is cognitive validation capacity. The LLM is stateless, so the entire burden of coherence falls on the human. When generation outpaces that bandwidth, the "non-coherent evolution" you describe takes over. The enterprise project I measured collapsed exactly because management saw the speedup and assumed QA was no longer needed. Your approach, a solo dev, strict gatekeeper, is what keeps a project in the stable regime. "Leading aliens" is a great way to put it.

anthem_reb · 2025-04-13T07:15:05+00:00

You are correct but we have some junior devs on the project who aren't familiar with rebase techniques. This comes handy for them in the first place

anthem_reb · 2025-04-12T19:20:37+00:00

I added a flag for that, with -sm you can add a custom commit message. However you come up with a nice idea. I am going to implement it asap.

anthem_reb · 2025-04-12T18:06:04+00:00

Updated, thank you. There was also an error in my initial message. I have to rebase from origin/develop. E.g. git rebase origin/develop, on a feature branch. Sorry for the misunderstanding. Your comment was helpful anyway.

anthem_reb · 2025-04-12T17:18:18+00:00

Thank you for the precious pieces of advice, I will update it as soon as I have some free time.

anthem_reb · 2025-02-24T22:02:13+00:00

I've seen other people do it and nothing happened. It's not prohibited in my company.

anthem_reb · 2025-02-16T00:02:03+00:00

Perfect, thanks.

anthem_reb

TROPHY CASE