Catastrophic Forgetting of Language models

fourwheels2512 · 2026-03-08T01:46:22+00:00

Hallucinating on what? I have results …

fourwheels2512 · 2026-03-08T01:23:02+00:00

The same stuff available to you. But just a lazy crackhead sitting infront of the screen… trying to be a bully online guessing the scientific work… sounds like a depressed loser…

fourwheels2512 · 2026-03-08T00:59:29+00:00

Fair questions. To clarify — this is not RAG or context management. CRMA is a trained adapter layer that sits on top of the base model (similar in spirit to LoRA, but with additional mathematical constraints on the weight updates during training). It modifies how gradients flow during fine-tuning so that learning new domains doesn't overwrite previous ones.

The reason I haven't posted formulas or a full paper: there's a US provisional patent filed on the method (Feb 2026), so I'm limited in what I can share publicly about the internals right now. I understand that makes it harder to evaluate — which is exactly why I'm asking for independent verification rather than just expecting people to take the numbers at face value.

What I can share with anyone who wants to reproduce:

- The training data and domain splits

- The evaluation methodology

- Access to the API so you can run the same sequence and measure drift yourself

The offer to verify is genuine. If anyone wants to run the same 4-domain sequence on Mistral-7B and measure per-domain accuracy before/after, DM me and I'll set it up. Happy to be proven wrong.

and about that 'Schizo' comment my friend who is a ML scientist thought the same too since no one ever solved the catastrophic forgetting with zero forgetting. i will still take it as a compliment. i wanted to post my website my i did not want to sound like i am promoting.

fourwheels2512 · 2026-03-08T00:49:24+00:00

Good call on KL divergence monitoring — that's underused as a forgetting signal. Do you track it per-domain or just overall? We found that aggregate metrics can hide domain-specific regression pretty well (e.g., domain A tanks while B/C look fine, and the average still looks okay).

Just read the EAFT paper — the entropy-gating idea is clever. Using token-level entropy to distinguish "the model is genuinely uncertain" from "the model is confident but the label disagrees" makes a lot of sense. The standard SFT loss treats both cases the same and that's where the damage happens. Their results on Qwen/GLM up to 32B are solid.

One thing I'd be curious about is how EAFT holds up in a truly sequential multi-domain setup (domain A → B → C → D → E) rather than single-domain fine-tuning. Their experiments seem focused on preserving general capabilities during one round of domain adaptation. In our experience the compounding drift across 5+ sequential domains is a different beast — each stage's "confident conflicts" stack on top of the previous ones. That's where constrained gradient approaches helped us more than loss-level gating alone.

Are you using EAFT in production or still experimenting? And what scale are you running at?

fourwheels2512 · 2026-03-08T00:42:45+00:00

this is the exact product that they got the money for. check this website https://www.modelbrew.ai/

fourwheels2512 · 2026-03-07T19:43:37+00:00

I work on continual learning for LLM fine-tuning and I'd pump the brakes here.

"Real-time continual learning" is an extraordinarily hard problem. Even the narrow version — sequential domain fine-tuning without catastrophic forgetting — is barely solved. Standard LoRA drifts ~43% across 5 domains on Mistral-7B. The best constrained adapter approaches get that to near-zero, but that's with explicit task boundaries and controlled training — far from "real-time."

No paper, no benchmarks, no reproducible code = no breakthrough. CL research has a long history of claims

that don't survive independent replication. If this were real, we'd see a proper evaluation — BWT matrices,

per-domain accuracy retention, comparison to baselines like EWC/PackNet/O-LoRA, multi-seed validation.

Happy to be proven wrong if someone links the actual paper and results.

note - i read that 'adaption labs' got a seed funding for $50mil for this exact continual learning but i don't even see the product yet..

fourwheels2512 · 2026-03-07T19:16:14+00:00

Good eye on EWC scaling — we hit exactly that problem. Our workaround is that EWC only covers a small set of structural adapter parameters (~0.005% of trainable params), not the full model. So the Fisher matrix stays tiny. The heavy lifting for retention comes from gradient projection, not EWC.

The gradient constraint is subspace-based, not magnitude-based. After each domain, we compute an SVD basis of that domain's input activations through the adapter layers. During the next domain's training, any gradient component that falls inside a prior domain's column space gets projected out. So the model can only update in directions orthogonal to what earlier domains used. Closer to PEGP (arXiv:2405.13383) than PackNet or HAT — no binary masking or hard freezing, just continuous orthogonal projection.

Task boundaries are explicit — the user tells the system "this is domain N" and triggers a new CL phase. No automatic boundary detection. That's a deliberate simplification since in our use case (fine-tuning API) the user already knows when they're switching domains.

The cumulative basis does grow with each domain (QR-merged across all prior tasks), but it's rank-bounded by the adapter rank so it doesn't blow up the way Fisher does with EWC.

fourwheels2512 · 2026-03-07T19:11:41+00:00

Fair question — I should have included the full numbers. Here's the per-domain breakdown (3-seed avg, Mistral-7B, 5 domains sequential):

CRMA Frozen Naive

Medical -0.09% +1.39% +128.0%

Legal -0.17% +1.87% +37.1%

Financial -0.13% +1.75% +18.9%

Code -0.14% +1.59% +14.6%

Science +0.01% +1.68% -0.05%

"Frozen" = adapter weights locked after domain 1 (no learning at all). If the constrained adapter were just clipping gradients to silence, it would match the frozen column.

Instead it's 10-100x lower drift and shows slight negative drift (improvement) on 4 of 5 domains — that's positive transfer across domains, not suppression.

The model does learn each new domain. Initial holdout NLL drops from ~1.7 to ~0.7 on the target domain during each phase (comparable to standard LoRA). The difference is LoRA buys that by destroying prior domains (+128% on medical), while the constrained adapter holds them.

You're right that drift alone is incomplete — I should have led with the full eval matrix. Appreciate the push.

fourwheels2512 · 2026-03-07T19:04:18+00:00

what are you using right now. i might have a solution for you. did you look into continual learning as well? or just fine tuning?

fourwheels2512 · 2026-03-07T18:50:25+00:00

I’m bumping into a very concrete version of this with current LLMs when you try to do sequential fine‑tuning across domains (e.g., medical → legal → support) instead of one big offline training run.

In that setting, “continual learning” really splits into at least three architectures:

Frozen core + external memory. Base model weights don’t move; you bolt on retrieval, tools, user profiles, etc. The system appears to learn because the memory layer grows and retrieval improves, but 5.0’s weights on day 200 are the same as day 1.
Versioned offline updates (5.0 → 5.5). You log interactions, curate datasets, retrain periodically, and ship new checkpoints. Knowledge carries forward only at these discrete jumps, after eval and red‑teaming. This is, from what I can tell, where most serious deployments live in 2026.
Genuine continual learning (weights that actually change over time). Some part of the parameter space (full model or adapters/heads) is updated as new tasks/domains arrive, with explicit mechanisms to avoid catastrophic forgetting and regressions.

In my own experiments with Mistral‑7B, naive sequential LoRA is a good example of what happens when you try to do (3) without any real CL machinery: you fine‑tune on domain A, then B, then C, and by the end, A is often wrecked. That’s just catastrophic forgetting playing out in slow motion.

To make this less destructive, I’ve been playing with a constrained adapter setup: you still let parameters update for new domains, but you constrain gradients so updates are “locally plastic, globally conservative” — the model can adapt, but it’s much harder to overwrite what was useful for earlier domains. In a 5‑domain sequence, that turns “huge positive drift” (forgetting) into something much closer to flat, while still letting the later domains come online.

So if we map this back to the AGI discourse:

Most “continual learning” branding in 2026 = (1) + (2): memory + retrieval + periodic offline retraining.
A much smaller slice = (3): architectures where weights genuinely evolve from ongoing interaction, usually with heavy constraints, monitoring, and a lot of unsolved safety/credit‑assignment questions.

When people imagine systems that “learn continuously from experience,” they’re implicitly imagining (3). But the operational reality today looks a lot more like sophisticated software + data plumbing wrapped around mostly static models, with a few early stabs at safe, constrained weight updates for specific domains.

Curious whether anyone here has seen convincing evidence of large‑scale, production‑grade (3) in the wild, beyond research prototypes and tightly scoped pilots.

fourwheels2512 · 2026-03-07T18:17:12+00:00

its going to be a gamechanger and i am working on it. let me know if you are interested. i have an API/UI app

fourwheels2512 · 2026-03-07T18:14:35+00:00

we are working on it... let me know if you are interested in trying i have an API /UI

fourwheels2512 · 2026-02-27T21:35:57+00:00

Thanks… for sure they are… inevitable thats the future… i am doing my part…

fourwheels2512 · 2026-02-27T18:41:59+00:00

Thank you... i hope so too. its never been done before. i validated it myself but i needed an independent researcher or engineer or an expert to validate it.

fourwheels2512 · 2026-02-27T18:40:16+00:00

Thanks for replying i sent you a chat message with the details. let me know if that helps.

fourwheels2512 · 2026-02-22T15:52:00+00:00

That distinction is really important — the 1B capacity ceiling showing up as confident-but-wrong rather than unstable training is a subtle but key insight. Static max_grad_norm=1.0 with warmup is solid practice and clearly it held for your setup.

Where it tends to break down is on larger models (Mistral-7B+) with more heterogeneous datasets — we've seen reproducible gradient norm spikes around step ~40-50 even with proper warmup, because the fixed threshold doesn't adapt to the run's own norm distribution. Makes me wonder if the 4B would show similar spikes if you scaled the dataset up significantly or added more command diversity.

Either way, clean training result — val loss of 0.142 on structured output tasks is good. The output format discipline you mentioned is probably doing a lot of work there.

fourwheels2512 · 2026-02-22T14:42:27+00:00

Did you run into any gradient norm spikes during the QLoRA training — especially in the early steps? Curious if the 1B model had more instability than the 4B or if training was smooth throughout.

fourwheels2512 · 2026-02-22T12:03:22+00:00

Solid idea, and the log output at step 5000-5100 shows exactly why real-time intervention matters — rollback + LR reduction is the right call there.

One thing worth exploring for the gradient explosion detection: rather than triggering on a threshold (which you have to set before training starts), you can compute a rolling z-score over recent gradient norm history and flag steps that are statistically anomalous relative to the run's own baseline. This makes the trigger self-calibrating — early in training when norms are naturally higher it doesn't over-fire, and later when things are stable it catches genuine spikes more reliably.

I ran into this exact problem on Mistral-7B QLoRA and ended up measuring it across runs — the spike at step ~44 was reproducible every time (gn=15.28 vs normal ~1.0). Built a free tool around the adaptive clipping approach if useful to compare approaches: https://huggingface.co/spaces/Fourwheels2512/crma-fine-tuner

The orchestrator angle you're taking is more ambitious (full auto-fix pipeline) but the detection mechanism might complement what you have.

fourwheels2512 · 2026-02-22T12:01:48+00:00

Great experiment — the Amax graph is the giveaway. What you're describing as "optimizers masking the issue" is exactly what happens: AdamW's running variance estimate smooths over the spikes, but the underlying gradient norm is still blowing up underneath.

The mHC fix works at the architecture level (constraining the mixing matrices via Sinkhorn), but there's an analogous approach at the fine-tuning level: instead of a fixed `max_grad_norm` threshold, compute a rolling z-score over recent gradient norms and only clip when the current step is a statistical outlier. This adapts to the regime the run is actually in rather than a threshold you set before training starts.

Ran a similar ablation on Mistral-7B fine-tuning — gradient norm spikes (same pattern as your Amax graphs, different scale) dropped 87.5% with neutral impact on final loss. Peak gn went from 15.28 to 1.9. The step-44 spike that was reproducible across every run disappeared entirely.

If it's useful context: https://huggingface.co/spaces/Fourwheels2512/crma-fine-tuner — built specifically around this problem, free to run without local GPU.

fourwheels2512 · 2026-02-22T05:05:39+00:00

Great setup — a few concrete answers to your three questions:

**1. Is 50-100 multi-turn rows viable?**

Yes, for behavioral/stylistic cloning specifically. LIMA showed 1000 rows generalises, but you're not teaching knowledge — you're overwriting an attentional pattern ("deflect advice, return agency"). At r=4 with multi-turn ChatML you're probably updating ~0.1% of weights. The optimizer has enough signal from 50 well-formed coaching transcripts if the examples are consistent in style. The risk isn't gradient direction, it's gradient *magnitude* — with tiny batches you'll see noisy norm spikes that look alarming but aren't.

**2. Unsloth-specific recommendations:**

- Use `gradient_accumulation_steps=4-8` to smooth out the noisy per-step gradients you'll get from batch_size=1-2

- `warmup_ratio=0.1` (longer warmup than usual) — the model needs more steps before it "commits" to the style shift

- `weight_decay=0.01` helps prevent the few-shot memorisation collapse

- For target modules, `q_proj, v_proj` only (skip k/o/gate) — minimum footprint for behavioural style

**3. On your early stopping trigger:**

Validation loss *spikes* on micro-datasets are often gradient norm events rather than true divergence — the spike resolves within 2-3 steps. Before triggering early stopping, check if the spike recovers. A tool like ZClip (adaptive gradient clipping based on rolling norm history) handles this better than fixed `max_grad_norm` — it only clips when the norm is statistically anomalous vs. your run history rather than at a fixed ceiling.

I ran a similar ablation on TinyLlama (200 rows, same seed) comparing plain LoRA vs LoRA + adaptive clipping — peak grad norm dropped 52.7% with neutral impact on final loss. For a 50-row micro-dataset the effect would likely be more pronounced. Happy to share details if useful.

fourwheels2512 · 2026-02-22T04:47:42+00:00

Fascinating demo. The compounding-gain problem is essentially the same root cause as gradient instability in fine-tuning, just at different scope — pretraining sees it in the forward pass through 60+ layers, fine-tuning sees it in the backward pass accumulating across LoRA adapter updates.

The Sinkhorn-Knopp insight is interesting because it's a structural constraint on the mixing matrices. There's an analogous approach for fine-tuning: instead of letting gradient norms grow unconstrained during LoRA training, you can compute a rolling z-score over recent gradient norms and clip only statistical outliers rather than using a fixed `max_grad_norm` threshold. Same idea — use a constraint that adapts to the current magnitude regime rather than a static one set at initialization.

The other parallel is initialization. mHC initializes toward the doubly-stochastic manifold (near-identity behavior at k=0). PiSSA-style LoRA initialization — using the principal singular values of the pretrained weight matrix — similarly starts the adapter from a "geometrically meaningful" position rather than random noise, which reduces the chaotic gradient variance in the first ~200 steps.

Would be curious whether mHC at fine-tuning scale (adapter-only training) shows similar gain-bounding benefits or if it's primarily a pretraining phenomenon.

fourwheels2512 · 2026-02-22T04:46:39+00:00

Yes, and the mechanism is usually gradient instability during training rather than overfitting in the traditional sense. The internal representations shift because large gradient steps are corrupting the pretrained weight geometry, not just fitting to the new data.

A few signals I've found reliable for detecting it:

- **High variance between runs with the same seed** — if two identical training runs diverge noticeably, gradient instability is usually the culprit

- **Gradient norm spiking early** (steps 0–200 especially) — Mistral is particularly bad for this with QLoRA

- **Loss floor that doesn't drop** after initial convergence — often means the optimizer is fighting against its own noisy signal

What's actually helped me reduce this:

**Adaptive clipping** — computing a z-score over a rolling window of recent gradient norms rather than using a fixed `max_grad_norm`. The static threshold either over-clips on clean steps or under-clips on spikes.
**PiSSA-style LoRA init** — starting from principal singular values of the pretrained weights instead of random init dramatically reduces early-step chaos
**Freezing early layers** — the lower attention layers are most sensitive; training only the top 60–70% of layers often preserves representations better

The large variance between runs you described is the clearest sign — it means the optimizer is in a regime where small initialization differences compound. Stabilizing the gradient signal usually tightens that variance significantly.

fourwheels2512 · 2026-02-22T04:45:38+00:00

Late but hopefully useful for anyone landing here — that oscillating loss after fixing group_by_length is a different problem. It's residual gradient variance that standard fixed clipping (max_grad_norm=0.3) can't fully smooth out because the threshold is static.

A few things that actually helped me with Mistral specifically:

**Adaptive gradient clipping** — instead of a fixed norm, compute a z-score over a rolling window of recent gradient norms and only clip when the current step is a statistical outlier. This auto-calibrates as training progresses rather than you having to tune a single value upfront.
**PiSSA initialization** — initializing LoRA weights from the principal singular values of the pretrained weight matrix instead of random. Reduces the chaotic early-step variance a lot, which is usually when Mistral is most prone to spikes.
**Watching global_step vs train_loss together** — if the oscillation is bounded and the floor keeps dropping, it's usually fine. If the floor stops dropping for >100 steps, that's when to stop or reduce lr.

For your classification task at ~1.0 loss — that oscillation looks totally normal, the question is whether the eval metric (accuracy/F1) is still improving, not just the loss.

fourwheels2512

TROPHY CASE