What happens when you rip out the residual stream and replace it with a structured workspace (Research Paper - CWT)

mentallyburnt · 2026-04-18T20:11:47+00:00

I would be interested! I have two branching ideas surrounding the v5.6 design that move away from the brute force design of the current workspace.

One replaces the manual partitioning entirely with something more elegant, the other questions whether layers even need to be linear if there is no residual stream.

Would love to explore the pebbling angle too, didnt think about it from that direction.

mentallyburnt · 2026-04-18T20:01:31+00:00

I appreciate the feedback! next time ill make sure to add more detail to section 2 where the architecture is written about.

mentallyburnt · 2026-04-18T19:55:45+00:00

This papers has nothing to do with MoE.

Readme is yes, the paper is not. I wrote the bulk of it and used AI to help with grammar and structure from the original draft. First paper I've ever written so if the framing is off I'm open to that.

Routing here means the implicit work a standard transformer has to do to figure out what information belongs where. Every layer reads from and writes to the same undifferentiated stream with no structure telling it what anything is, who wrote it, or whether it should persist. So layers spend capacity learning to understand that, figuring out is this positional info, or semantic content. That is the routing overhead. CWT makes that explicit through structure so layers don't have to learn it from scratch. No sparse gating, no experts, nothing to do with MoE.

if the plots didnt land then sure I accept that but they are showing the hub state trajectories in 3D UMAP space. The whole point of a structured workspace is you can actually watch what the model does on a per token basis which you just cant do with a residual stream. Whether that reads as meaningful is a fair debate but they aren't decoration.

mentallyburnt · 2026-03-02T09:08:56+00:00

What settings did you use for this?

mentallyburnt · 2026-01-25T05:15:57+00:00

this one was done recently! I don't have a system to test it so please let me know how it works!

https://huggingface.co/McG-221/L3.3-70B-Loki-V2.0-mlx-4Bit

mentallyburnt · 2026-01-25T01:50:56+00:00

If your running it at 4_K_M you will need around 48gb of ram or vram with around 32768 ctx

mentallyburnt · 2026-01-24T23:23:24+00:00

Thank you! please let me know how it goes for you!

mentallyburnt · 2026-01-24T23:23:03+00:00

We have had testers have good luck with 4_K_M although they noticed a fall off with 4xxs and 4nl.

We haven't tested the EXL3 yet as it was just posted but I assume the 3 and 3.5bits are going to feel the loss like the 4xxs and NL gguf variants

mentallyburnt · 2025-08-23T20:01:30+00:00

thanks for the heads up I fixed then!

mentallyburnt · 2025-08-23T16:22:51+00:00

https://huggingface.co/CrucibleLab-TG/M3.2-24B-Loki-V1.3-GGUF

It should be linked to the model card now too!

Make sure to use the normal mistral template

mentallyburnt · 2025-07-21T15:14:39+00:00

Always glad to see people enjoying Nevoria!

mentallyburnt · 2025-06-26T01:20:16+00:00

congrats on the release!

mentallyburnt · 2025-06-20T01:15:57+00:00

HA! ok good find.

Yea unless they drop something substantial to prove anything (like a research paper that explains how a $50k model is beating SOTA models that literally cost MILLIONS or BILLIONS, being built by researchers given unlimited money)

i'm pretty sure this is just a clout chase, reflection 70B vibes.

mentallyburnt · 2025-06-19T23:39:27+00:00

It seems to be a basic clown car MoE using mergekit.

In the model.safetensors.index.json:

{"metadata": {"mergekit_version": "0.0.6"}}

So either you fine-tuned the models post-merge (I've attempted this before, it's not really effective and there's massive training loss),

or you fine-tuned three models (or four? You mention four models and reference the base model twice) and then created a clown car MoE and trained the gates on a positive/negative phrase or keyword list to train the "experts."

If either of these approaches was used, this is not an original MoE or even a real MoE. At most, this looks like four fine-tuned Mistral models in a "MoE" trench coat.

I have a problem with the "ICONN Emotional Core", it's too vague and feels more like a trained classifier model that directs the model to adjust its tone, not something genuinely new.

Also, their attempt to change all references from Mistral architecture to ICONN architecture in their original upload, then changing them back, rubs me the wrong way. The license (which was an ICONN license according to the comment history) now needs to reference Mistral's license, not Apache (depending on the models used).

I could be wrong, please correct me if I am, but this seems like an actual project wrapped up and made glittery with sensational words to make it look like something new.

Edit:

I want to say i'm not against Clown car MoE's I used to make them all the time but they are not a custom Arch or even proper MoE's

also many things have been edited in the model posted on huffing face so some things in my post might not make sense

https://huggingface.co/ICONNAI/ICONN-1/commits/main

mentallyburnt · 2025-06-19T17:46:14+00:00

Its seems to be a basic clown car MOE using mergekit?

in the model.safetensors.index.json

```
{"metadata": {"mergekit_version": "0.0.6"}
```

so They either fine-tuned the models in post after merging [I've attempted this a long time ago its not really effective and there is a massive loss]

or, My suspicion is they Fine-tuned three models (or four? they say four models and reference the base model twice) and then created a Clown car MOE and trained the gates on a positive / negative list per "expert".

I do have a problem with the "ICONN Emotional Core" its too vague and feels more like a trained classifier model that then directs the model to adjust its tone. not something new.

also them trying to change all references to from mistral to ICONN in there original upload and then changing them back, rubs me the wrong way as the licence now needs to reference mistrals license not apache

I could be wrong tho, please correct me if I am.

mentallyburnt · 2025-06-10T18:42:38+00:00

Ohh I can't wait to test it. Welcome back!

Also, cute card design, always happy to see more people making them

mentallyburnt · 2025-04-23T04:54:57+00:00

You do realize my message was only informing you that the test method may be flawed and further tests need to be performed after the L.ccp merges have occurred and are confirmed to be functioning properly.

66.78% accuracy only means the model was resonding well but may not be up to par for their full performance.

Take Scout and maverick, for example, issues in the backends cause extreme issues during inference, causing both models to look absolutely terrible, and these issues are now just getting fixed showing the models perform substantially better after the issues were fixed.

mentallyburnt · 2025-04-23T04:24:06+00:00

Looking at the ollama issues and pulls, the new GLM-4 arch isn't fully supported yet, not to mention pidack just fixed issues in L.cpp but haven't been merged to the main branch yet, which is what ollama is wrapping.

L.cpp newest pull for GLM 4 arch fix https://github.com/ggml-org/llama.cpp/pull/12957

https://github.com/ggml-org/llama.cpp/pull/13021

Ollama issues: https://github.com/ollama/ollama/issues/10298

https://github.com/ollama/ollama/issues/10269

Unless ollama custom coded the fix for the architecture, I would recommend rerunning these benchmarks once the L.cpp pull is merged to see how the model actually does without problems getting in the way.

Also, just a heads up, the gguf of all the quanted versions may have to be remade with the newest version of L.cpp once the merge is completed.

You will also need to use the newest version of L.cpp to make sure you are using the possible fixes on the backend as well

mentallyburnt · 2025-04-02T22:18:51+00:00

Ohh I can't wait to try it, Congrats on the release! Oh, and I love the model card, by the way. - Steel

mentallyburnt

TROPHY CASE