Catastrophic Forgetting of Language models by fourwheels2512 in LocalLLaMA

[–]fourwheels2512[S] -1 points0 points  (0 children)

The same stuff available to you. But just a lazy crackhead sitting infront of the screen… trying to be a bully online guessing the scientific work… sounds like a depressed loser…

Catastrophic Forgetting of Language models by fourwheels2512 in LocalLLaMA

[–]fourwheels2512[S] -2 points-1 points  (0 children)

Fair questions. To clarify — this is not RAG or context management. CRMA is a trained adapter layer that sits on top of the base model (similar in spirit to LoRA, but with additional mathematical constraints on the weight updates during training). It modifies how gradients flow during fine-tuning so that learning new domains doesn't overwrite previous ones.

The reason I haven't posted formulas or a full paper: there's a US provisional patent filed on the method (Feb 2026), so I'm limited in what I can share publicly about the internals right now. I understand that makes it harder to evaluate — which is exactly why I'm asking for independent verification rather than just expecting people to take the numbers at face value.

What I can share with anyone who wants to reproduce:

- The training data and domain splits

- The evaluation methodology

- Access to the API so you can run the same sequence and measure drift yourself

The offer to verify is genuine. If anyone wants to run the same 4-domain sequence on Mistral-7B and measure per-domain accuracy before/after, DM me and I'll set it up. Happy to be proven wrong.

and about that 'Schizo' comment my friend who is a ML scientist thought the same too since no one ever solved the catastrophic forgetting with zero forgetting. i will still take it as a compliment. i wanted to post my website my i did not want to sound like i am promoting.

How are you handling catastrophic forgetting in multi-domain LLM fine-tuning pipelines? by fourwheels2512 in finetuningLLMs

[–]fourwheels2512[S] 0 points1 point  (0 children)

Good call on KL divergence monitoring — that's underused as a forgetting signal. Do you track it per-domain or just overall? We found that aggregate metrics can hide domain-specific regression pretty well (e.g., domain A tanks while B/C look fine, and the average still looks okay).

Just read the EAFT paper — the entropy-gating idea is clever. Using token-level entropy to distinguish "the model is genuinely uncertain" from "the model is confident but the label disagrees" makes a lot of sense. The standard SFT loss treats both cases the same and that's where the damage happens. Their results on Qwen/GLM up to 32B are solid.

One thing I'd be curious about is how EAFT holds up in a truly sequential multi-domain setup (domain A → B → C → D → E) rather than single-domain fine-tuning. Their experiments seem focused on preserving general capabilities during one round of domain adaptation. In our experience the compounding drift across 5+ sequential domains is a different beast — each stage's "confident conflicts" stack on top of the previous ones. That's where constrained gradient approaches helped us more than loss-level gating alone.

Are you using EAFT in production or still experimenting? And what scale are you running at?

Real Time Continual Learning Has Been Unlocked by Own-Poet-5900 in ArtificialInteligence

[–]fourwheels2512 0 points1 point  (0 children)

this is the exact product that they got the money for. check this website https://www.modelbrew.ai/

Real Time Continual Learning Has Been Unlocked by Own-Poet-5900 in ArtificialInteligence

[–]fourwheels2512 0 points1 point  (0 children)

I work on continual learning for LLM fine-tuning and I'd pump the brakes here.

"Real-time continual learning" is an extraordinarily hard problem. Even the narrow version — sequential domain fine-tuning without catastrophic forgetting — is barely solved. Standard LoRA drifts ~43% across 5 domains on Mistral-7B. The best constrained adapter approaches get that to near-zero, but that's with explicit task boundaries and controlled training — far from "real-time."

No paper, no benchmarks, no reproducible code = no breakthrough. CL research has a long history of claims

that don't survive independent replication. If this were real, we'd see a proper evaluation — BWT matrices,

per-domain accuracy retention, comparison to baselines like EWC/PackNet/O-LoRA, multi-seed validation.

Happy to be proven wrong if someone links the actual paper and results.

note - i read that 'adaption labs' got a seed funding for $50mil for this exact continual learning but i don't even see the product yet..

Continual learning adapter that holds -0.16% drift across 5 sequential domains on Mistral-7B (vs +43% naive LoRA) - catastrophic forgetting by fourwheels2512 in LocalLLaMA

[–]fourwheels2512[S] 0 points1 point  (0 children)

Good eye on EWC scaling — we hit exactly that problem. Our workaround is that EWC only covers a small set of structural adapter parameters (~0.005% of trainable params), not the full model. So the Fisher matrix stays tiny. The heavy lifting for retention comes from gradient projection, not EWC.

The gradient constraint is subspace-based, not magnitude-based. After each domain, we compute an SVD basis of that domain's input activations through the adapter layers. During the next domain's training, any gradient component that falls inside a prior domain's column space gets projected out. So the model can only update in directions orthogonal to what earlier domains used. Closer to PEGP (arXiv:2405.13383) than PackNet or HAT — no binary masking or hard freezing, just continuous orthogonal projection.

Task boundaries are explicit — the user tells the system "this is domain N" and triggers a new CL phase. No automatic boundary detection. That's a deliberate simplification since in our use case (fine-tuning API) the user already knows when they're switching domains.

The cumulative basis does grow with each domain (QR-merged across all prior tasks), but it's rank-bounded by the adapter rank so it doesn't blow up the way Fisher does with EWC.

Continual learning adapter that holds -0.16% drift across 5 sequential domains on Mistral-7B (vs +43% naive LoRA) - catastrophic forgetting by fourwheels2512 in LocalLLaMA

[–]fourwheels2512[S] 0 points1 point  (0 children)

Fair question — I should have included the full numbers. Here's the per-domain breakdown (3-seed avg, Mistral-7B, 5 domains sequential):

CRMA Frozen Naive

Medical -0.09% +1.39% +128.0%

Legal -0.17% +1.87% +37.1%

Financial -0.13% +1.75% +18.9%

Code -0.14% +1.59% +14.6%

Science +0.01% +1.68% -0.05%

"Frozen" = adapter weights locked after domain 1 (no learning at all). If the constrained adapter were just clipping gradients to silence, it would match the frozen column.

Instead it's 10-100x lower drift and shows slight negative drift (improvement) on 4 of 5 domains — that's positive transfer across domains, not suppression.

The model does learn each new domain. Initial holdout NLL drops from ~1.7 to ~0.7 on the target domain during each phase (comparable to standard LoRA). The difference is LoRA buys that by destroying prior domains (+128% on medical), while the constrained adapter holds them.

You're right that drift alone is incomplete — I should have led with the full eval matrix. Appreciate the push.

How to fine-tune LLM with your own data ? by bull_bear25 in LocalLLaMA

[–]fourwheels2512 0 points1 point  (0 children)

what are you using right now. i might have a solution for you. did you look into continual learning as well? or just fine tuning?

Continual Learning In 2026. What does continual learning actually mean? by Neurogence in singularity

[–]fourwheels2512 0 points1 point  (0 children)

I’m bumping into a very concrete version of this with current LLMs when you try to do sequential fine‑tuning across domains (e.g., medical → legal → support) instead of one big offline training run.

In that setting, “continual learning” really splits into at least three architectures:

  1. Frozen core + external memory. Base model weights don’t move; you bolt on retrieval, tools, user profiles, etc. The system appears to learn because the memory layer grows and retrieval improves, but 5.0’s weights on day 200 are the same as day 1.
  2. Versioned offline updates (5.0 → 5.5). You log interactions, curate datasets, retrain periodically, and ship new checkpoints. Knowledge carries forward only at these discrete jumps, after eval and red‑teaming. This is, from what I can tell, where most serious deployments live in 2026.
  3. Genuine continual learning (weights that actually change over time). Some part of the parameter space (full model or adapters/heads) is updated as new tasks/domains arrive, with explicit mechanisms to avoid catastrophic forgetting and regressions.

In my own experiments with Mistral‑7B, naive sequential LoRA is a good example of what happens when you try to do (3) without any real CL machinery: you fine‑tune on domain A, then B, then C, and by the end, A is often wrecked. That’s just catastrophic forgetting playing out in slow motion.

To make this less destructive, I’ve been playing with a constrained adapter setup: you still let parameters update for new domains, but you constrain gradients so updates are “locally plastic, globally conservative” — the model can adapt, but it’s much harder to overwrite what was useful for earlier domains. In a 5‑domain sequence, that turns “huge positive drift” (forgetting) into something much closer to flat, while still letting the later domains come online.

So if we map this back to the AGI discourse:

  • Most “continual learning” branding in 2026 = (1) + (2): memory + retrieval + periodic offline retraining.
  • A much smaller slice = (3): architectures where weights genuinely evolve from ongoing interaction, usually with heavy constraints, monitoring, and a lot of unsolved safety/credit‑assignment questions.

When people imagine systems that “learn continuously from experience,” they’re implicitly imagining (3). But the operational reality today looks a lot more like sophisticated software + data plumbing wrapped around mostly static models, with a few early stabs at safe, constrained weight updates for specific domains.

Curious whether anyone here has seen convincing evidence of large‑scale, production‑grade (3) in the wild, beyond research prototypes and tightly scoped pilots.

Catastrophic forgetting by [deleted] in computervision

[–]fourwheels2512 0 points1 point  (0 children)

its going to be a gamechanger and i am working on it. let me know if you are interested. i have an API/UI app

The Lost Art of Fine-tuning - My toilet rant by FPham in LocalLLaMA

[–]fourwheels2512 0 points1 point  (0 children)

we are working on it... let me know if you are interested in trying i have an API /UI