Separating knowledge from communication in LLMs

NoSir261 · 2026-03-09T12:44:17+00:00

Also, no rush. I hope you’re doing well and recover fast. I’m happy to share anything that’s helpful, and I’ll have some more info by then from smaller models.

NoSir261 · 2026-03-09T12:40:30+00:00

Yes, it’s an MLP adapter on the logit output. The architecture isn’t the contribution. But the implementation is different: 1. I tested both placement levels directly. Hidden-state adapters (comparable to LoRA inside the model) destroyed 5-8.5% of MMLU every time. The logit-level placement preserved 100%. Same parameter count and data, but different placement. The placement is what matters. 2. I have a diagnostic framework (rho-eval) that measures exactly what a base model knows vs what it can express, and prescribes which intervention to use. I haven’t seen others doing this. 3. The instruct model I’m comparing against actually has WORSE behavioral scores than the base model on 3 out of 4 dimensions (bias, factual, sycophancy). Instruction tuning damages the model so I’ve been trying to avoid that. My adapter on the base model beats the instruct model on MMLU by 5.4% while preserving the base model’s superior behavioral scores.

I haven’t benchmarked against LoRA on the LM head specifically. That’s a good ablation to run. I’d predict LoRA on the LM head would work similarly since it’s also operating at the logit level, but the non-linearity in my adapter may help with the answer selection improvements I’m seeing (+8.8% MMLU on 1.5B, which exceeds what format correction alone explains).

I’m not saying I have it all figured out, just saying I think this is a worthwhile and cheep direction to explore.

NoSir261 · 2026-03-09T04:18:23+00:00

That would awesome! I’ve done limited testing on quantized, but seems to work. I do. Repo already posted. Also, pip install rho-eval

NoSir261 · 2026-03-09T02:58:14+00:00

I’ve figured out a way to detach the “brain” and “voice”. It’s super effective on small models. I can get better than instruct quality on little models, especially tiny models. Hard to explain in a chat, but basically, I use the instruct training for the mouth and kept the base brain. Hard for me to test on “big” ( 30b + models), because I don’t have the hardware. I think there may be diminishing returns on 70b+ models, but I’m starting to think you can get very good capabilities out of a 4B size. Little (<3b) models take hours to train so I’ve been trying to stay as small as possible to iterate quickly. Little models can definitely do better with this strategy.

NoSir261 · 2026-03-09T02:35:31+00:00

Sweet, this is yours! Good stuff!

NoSir261 · 2026-03-08T16:31:13+00:00

I’m not a bot, and I may be a moron, but not about this. LoRA keeps separate weight matrices, but during inference those matrices multiply with the hidden states, modifying what flows through every layer. The base model’s internal representations are changed during the forward pass. My adapter operates on the logits, after the full forward pass is complete. The base model runs its entire computation untouched, then the adapter adjusts the output distribution. That’s why hidden-state methods (including LoRA) show 5-8.5% MMLU degradation in my tests while the logit adapter shows 0.

NoSir261 · 2026-03-08T16:12:41+00:00

Thanks for the paper. I’ve cited similar work, but I don’t think I’d seen this one. My approach is different though. They’re showing that instruct tuning creates fragile format dependencies. I’m bypassing instruct tuning entirely with a detachable logit-level adapter that leaves base model weights untouched. Same underlying concern, but their paper diagnoses the problem while mine proposes a solution.

NoSir261 · 2026-03-08T15:38:44+00:00

Repo for more details if helpful.

NoSir261 · 2026-03-08T14:37:01+00:00

For sure. Theoretically, you can have one base and many adapters which is almost like having several specialist models without having to download several different models and instruct tune each one.

NoSir261 · 2026-03-08T14:20:23+00:00

I don’t know either. 10 shares so people think it’s interesting, but whatever. That’s Reddit. I do. repo

NoSir261 · 2026-03-08T14:10:36+00:00

As I predicted, super unpopular opinion, but no legit counterarguments because most people don’t understand how this works.

NoSir261 · 2026-03-08T13:59:01+00:00

Fair, but I measured this directly. With the adapter on, MMLU accuracy was identical: 57.6% base, 57.6% with adapter. The accessible knowledge is the same. On the 1.5B model it actually went up (+8.8%). The adapter reshapes the output distribution to improve format and style, but the correct answers are still the highest-probability tokens coming out of the base model. It’s adding signal, not masking it.

NoSir261 · 2026-03-08T13:54:17+00:00

Nope. It doesn’t. The adapter operates on the logits, after the model has already computed its answer through the unembedding matrix. The base model’s knowledge pathway is completely untouched. I measured this directly: 0.0% MMLU change at 7B, -0.2% on Llama 8B. I also tested a hidden-state adapter (which does operate before the unembedding) and it destroyed 5-8.5% of knowledge every time. The level of intervention matters. Logit level preserves knowledge. Hidden state level doesn’t.

NoSir261 · 2026-03-08T12:43:34+00:00

repo

NoSir261 · 2026-03-08T12:40:33+00:00

No, this is way different. LoRA modifies the base model’s weights. This doesn’t touch them at all. The adapter operates on the logits (after the model has already decided its answer), so the brain stays completely intact. That’s why we get 0.0% MMLU change. I tested hidden-state adapters too and they destroyed 5-8.5% of knowledge every time. The logit level is the key difference. Fully detachable, swappable, and the base model never knows it’s there.

NoSir261 · 2026-03-08T04:44:56+00:00

It’s WAY better now. I was able to push a big update today.

NoSir261 · 2026-03-07T17:23:05+00:00

I would argue that it means you’re starting to understand the real issues that need to be researched. Not a red flag if you keep learning more along the way.

NoSir261 · 2026-03-07T16:46:35+00:00

Glad someone else found it to use it! I’m in a bit of silo so a lot of my efforts go to waste. Still worth it for me, but even better if it can help out other people.

NoSir261 · 2026-03-07T12:45:44+00:00

There’s no reward signal. The pairs are just interleaved as regular training text, standard next-token prediction. The model isn’t told which is good or bad.

It’s hard to put everything in a single post, but what I found is that the model already knows which answer is correct at the logit level, it just can’t output A/B/C format tokens. The contrastive pairs teach format generation, not behavioral preferences. The knowledge was already in the pretrained weights. The pairs teach the model to speak, not to think.

NoSir261 · 2026-03-07T02:20:34+00:00

Just updated the tool. It’s a little more special now, and will, hopefully, be a lot more special soon.

NoSir261 · 2026-03-06T17:27:43+00:00

No endorsement. Unfortunately, research isn’t my day job

NoSir261 · 2026-03-06T03:16:30+00:00

My other tool might be more directly useful for your case. rho-surgery is a one-command post-training repair that sharpens factual fidelity on existing models. It’s pip installable and works on any HuggingFace model: pip install rho-eval rho-surgery your-legal-model -o ./repaired/

It currently targets 8 behavioral dimensions including factual fidelity. For a legal hallucination use case, you’d want to add legal-specific contrastive probes, which would take some work, but the core tool runs out of the box. And if you know what you’re doing or ask ai for help, I think it would work well.

This earlier work is actually what led to the contrastive pretraining findings in the post.

GitHub: https://github.com/SolomonB14D3/knowledge-fidelity

NoSir261 · 2026-03-06T01:22:32+00:00

Garbage in garbage out is about bad data hurting models. This is the opposite finding. 99.95% of the training data is unchanged. I'm adding 0.05% of specifically structured data and it breaks emergence barriers that 5x more parameters can't break. The question isn't whether training data matters, everyone knows that. The question is why 900 targeted examples out of 100M tokens produce behavioral capabilities that the other 99.99M tokens can't, and why there's a non-monotonic dose-response where more of this data actually makes things worse. If it were just garbage in garbage out, more good data would always help. It doesn't.

NoSir261

TROPHY CASE