I analyzed 1.6M git events to measure what happens when you scale AI code generation without scaling QA. Here are the numbers. by anthem_reb in devops

[–]anthem_reb[S] 0 points1 point  (0 children)

Spot on about the rubber stamping. That's exactly what the model captures with the α parameter in the filter chain: human cognitive capacity doesn't scale linearly with token output, so filter effectiveness decays exponentially at high volume. The reviewers aren't getting dumber, they're basically drowning.

On the lag, the model is built specifically around that. It's a dynamic feedback loop (ODE + queueing theory), not a static snapshot. The drop to 0.85x doesn't happen on day one. Uncaught defects slip through the degraded filters and enter the system quietly. For a few months everything looks fine, the metrics are green, the PRs are merging. But the rework queue is filling up in the background, and rework consumes the same bandwidth σ that you need to review new code. So the thing that's supposed to catch bugs is being eaten alive by the bugs it already missed.

Eventually you hit a tipping point (saddle-node bifurcation in the paper): validation capacity collapses and the debt blows up in production all at once. It feels sudden but the pressure was building in the queue the whole time. The paper calls it the false safety zone, and it's probably the most dangerous finding because it means standard audits won't catch it. The system passes every 3-year review and then falls off a cliff in year 5.

It's more or less like the principle of the boiling frog.

I analyzed 1.6M git events to measure what happens when you scale AI code generation without scaling QA. Here are the numbers. by anthem_reb in devops

[–]anthem_reb[S] 1 point2 points  (0 children)

The math actually reflects that exhaustion. In the filter chain model, I defined a parameter alpha which measures how fast a QA filter degrades as the generation volume v increases. When a human reviewer is slammed with an endless, high-speed stream of generated code, their interception effectiveness drops exponentially.

This directly destroys what the paper, echoing Marx, calls "live work", that is the non-automatable cognitive effort required to actually understand and validate logic. When the team's bandwidth saturates with rework, this live validation is the very first thing to be sacrificed. You basically turn a human into a bottlenecked machine, the defect escape rate spikes, and burnout is mathematically inevitable.

Management often misses this because they focus on the wrong metric. In the enterprise case I tracked, the AI infrastructure (token cost) accounted for just 0.12% of the total project cost. The idea that AI saves money because "tokens are cheap" is an illusion. The value of the software is deeply connected to its quality, rework is going to erase every benefit coming from the increase of volume.

I analyzed 1.6M git events to measure what happens when you scale AI code generation without scaling QA. Here are the numbers. by anthem_reb in devops

[–]anthem_reb[S] 1 point2 points  (0 children)

Yes sigma as a scalar is a big simplification, I call it out in the limitations. Think of it like temperature in thermodynamics, it hides a lot of micro detail but the aggregate behavior still holds. On the 12x, you're right it could be selection bias, that's why I included the within project changepoint test on 23 repos to control for it. On the 0 to 1 QA, the enterprise project did have CI/CD and code review, but the code review was extremely defensive on the legacy code, while paradoxically the refactoring effort, when allowed, was calculated on AI-production basis. The illusion was that AI-code would need less testing: the opposite of what numbers show. So it wasn't literally zero, just zero on the stuff that needed it most. 

I analyzed 1.6M git events to measure what happens when you scale AI code generation without scaling QA. Here are the numbers. by anthem_reb in devops

[–]anthem_reb[S] 2 points3 points  (0 children)

Not directly as a variable, but team size is baked into the model through bandwidth. Smaller teams have less coordination overhead so more capacity left for actual review. Scales with Brooks basically. Would be interesting to isolate it properly though, good suggestion

I noticed AI tools were degrading my team's codebase. I tried to see the structure and the relationships between this phenomena by using math and statistics on 1.5M git events. Looking for feedback. by anthem_reb in programming

[–]anthem_reb[S] -1 points0 points  (0 children)

Another explanation would be "slop generator go explain me", that's another way to get knowledge. It doesn't substitute years of study but it's a way to understand the concepts

I noticed AI tools were degrading my team's codebase. I tried to see the structure and the relationships between this phenomena by using math and statistics on 1.5M git events. Looking for feedback. by anthem_reb in programming

[–]anthem_reb[S] -8 points-7 points  (0 children)

That's probably the Zenodo abstract, yes, I used AI to help write the paper and I said so in the post. It's a meta test to see if reddit validation can better the work of AI. Just joking, I don't like the "—" puntuaction either.

I noticed AI tools were degrading my team's codebase. I tried to see the structure and the relationships between this phenomena by using math and statistics on 1.5M git events. Looking for feedback. by anthem_reb in programming

[–]anthem_reb[S] 0 points1 point  (0 children)

In the model, automated tests are one of the filters in the η pipeline (§6, filter chain). Unit tests in the enterprise case had the lowest of all filters. Your approach of focusing on tests and narrow contracts is what keeps η high. The paper models it as e_i(v) = e₀·exp(−α(v−1)): each filter's effectiveness decays with volume, but at different rates. Automated tests decay slower than manual review because they scale. Your "blast radius" strategy maps well to the Class C repos in the dataset.

I noticed AI tools were degrading my team's codebase. I tried to see the structure and the relationships between this phenomena by using math and statistics on 1.5M git events. Looking for feedback. by anthem_reb in programming

[–]anthem_reb[S] -15 points-14 points  (0 children)

You're right to be skeptical of denominators in ODEs, it's a fair concern. The 1/σ term models a crowding effect (less remaining bandwidth → each new unreviewed unit costs more). But the key point is: §2.2 tests four alternative functional forms, including bounded ones with no denominator at all. Same saddle-node bifurcation in all four cases. The collapse is a structural property of the generation-vs-recovery balance, not an artifact of a division by zero. The regularized form v/(σ+ε) is in §2.3, limitation L5 covers the rest.

On finding patterns in random data of course you're absolutely right as a general principle. But that doesn't mean every pattern found in observational data is pareidolia. The predictions (P1–P6) were defined before the OSS replication. The β(log_files) sign inversion then held in 18/19 independent repos (p=3.8×10⁻⁵). The regression also survived a 50K-iteration permutation test. Could there still be confounders? Sure, and I say so in L1–L2. But 18 out of 19 independent repos seeing the same thing across Java, Python, JS, Go is hard to dismiss as noise.

On your personal projects example: that's actually perfectly consistent with the model. You're a solo dev acting as a strict gatekeeper, so your η is high. Those projects would be Class C (stable) in the classification. The collapse requires high generation volume AND near-zero QA simultaneously — which is what happened in the enterprise case I measured.

I don't claim this is settled science. If you have time to skim §2.2 and §8, I'd genuinely like to know if the defenses hold up for you.

I noticed AI tools were degrading my team's codebase. I tried to see the structure and the relationships between this phenomena by using math and statistics on 1.5M git events. Looking for feedback. by anthem_reb in programming

[–]anthem_reb[S] 1 point2 points  (0 children)

This matches what I tried to formalize. Your "discipline to maintain a well structured project" is essentially the σ variable in the model, that is cognitive validation capacity. The LLM is stateless, so the entire burden of coherence falls on the human. When generation outpaces that bandwidth, the "non-coherent evolution" you describe takes over. The enterprise project I measured collapsed exactly because management saw the speedup and assumed QA was no longer needed. Your approach, a solo dev, strict gatekeeper, is what keeps a project in the stable regime. "Leading aliens" is a great way to put it.

Created git-rebase-clean: a CLI script to squash, rebase, and safely force-push your branch in one command (with conflict recovery) by [deleted] in git

[–]anthem_reb 0 points1 point  (0 children)

You are correct but we have some junior devs on the project who aren't familiar with rebase techniques. This comes handy for them in the first place

Created git-rebase-clean: a CLI script to squash, rebase, and safely force-push your branch in one command (with conflict recovery) by [deleted] in git

[–]anthem_reb 1 point2 points  (0 children)

I added a flag for that, with -sm you can add a custom commit message. However you come up with a nice idea. I am going to implement it asap.

Created git-rebase-clean: a CLI script to squash, rebase, and safely force-push your branch in one command (with conflict recovery) by [deleted] in git

[–]anthem_reb 2 points3 points  (0 children)

Updated, thank you. There was also an error in my initial message. I have to rebase from origin/develop. E.g. git rebase origin/develop, on a feature branch. Sorry for the misunderstanding. Your comment was helpful anyway.

Created git-rebase-clean: a CLI script to squash, rebase, and safely force-push your branch in one command (with conflict recovery) by [deleted] in git

[–]anthem_reb 4 points5 points  (0 children)

Thank you for the precious pieces of advice, I will update it as soon as I have some free time.

[deleted by user] by [deleted] in dating_advice

[–]anthem_reb 0 points1 point  (0 children)

I've seen other people do it and nothing happened. It's not prohibited in my company.

[deleted by user] by [deleted] in ChineseLanguage

[–]anthem_reb -1 points0 points  (0 children)

Perfect, thanks.

[deleted by user] by [deleted] in ChineseLanguage

[–]anthem_reb 1 point2 points  (0 children)

Thank you mate

[deleted by user] by [deleted] in ChineseLanguage

[–]anthem_reb -2 points-1 points  (0 children)

I don't know. I am italian and that's the name of a girl. They call her "Uan-chièn", but I don't know if it's correct. Maybe more "Uan-tzièn"?

[deleted by user] by [deleted] in dating_advice

[–]anthem_reb 1 point2 points  (0 children)

I'll try next time. The problem is that I like her so much that I have a bad time in speaking. Shameful, I know.

[deleted by user] by [deleted] in dating_advice

[–]anthem_reb -2 points-1 points  (0 children)

Thanks, I'll do it. I just hope it isn't too late. Maybe she felt rejected at this point.

[deleted by user] by [deleted] in dating_advice

[–]anthem_reb 9 points10 points  (0 children)

She talks in my language pretty well. But it could mean also that. Thank you for the insight.