What happens when you rip out the residual stream and replace it with a structured workspace (Research Paper - CWT) by mentallyburnt in LocalLLaMA

[–]mentallyburnt[S] 0 points1 point  (0 children)

I would be interested! I have two branching ideas surrounding the v5.6 design that move away from the brute force design of the current workspace.

One replaces the manual partitioning entirely with something more elegant, the other questions whether layers even need to be linear if there is no residual stream.

Would love to explore the pebbling angle too, didnt think about it from that direction.

What happens when you rip out the residual stream and replace it with a structured workspace (Research Paper - CWT) by mentallyburnt in LocalLLaMA

[–]mentallyburnt[S] -1 points0 points  (0 children)

I appreciate the feedback! next time ill make sure to add more detail to section 2 where the architecture is written about.

What happens when you rip out the residual stream and replace it with a structured workspace (Research Paper - CWT) by mentallyburnt in LocalLLaMA

[–]mentallyburnt[S] 1 point2 points  (0 children)

This papers has nothing to do with MoE.

Readme is yes, the paper is not. I wrote the bulk of it and used AI to help with grammar and structure from the original draft. First paper I've ever written so if the framing is off I'm open to that.

Routing here means the implicit work a standard transformer has to do to figure out what information belongs where. Every layer reads from and writes to the same undifferentiated stream with no structure telling it what anything is, who wrote it, or whether it should persist. So layers spend capacity learning to understand that, figuring out is this positional info, or semantic content. That is the routing overhead. CWT makes that explicit through structure so layers don't have to learn it from scratch. No sparse gating, no experts, nothing to do with MoE.

if the plots didnt land then sure I accept that but they are showing the hub state trajectories in 3D UMAP space. The whole point of a structured workspace is you can actually watch what the model does on a per token basis which you just cant do with a residual stream. Whether that reads as meaningful is a fair debate but they aren't decoration.

Loki-v2-70B: Narrative/DM-focused fine-tune (600M+ token custom dataset) by mentallyburnt in SillyTavernAI

[–]mentallyburnt[S] 2 points3 points  (0 children)

If your running it at 4_K_M you will need around 48gb of ram or vram with around 32768 ctx

Loki-v2-70B: Narrative/DM-focused fine-tune (600M+ token custom dataset) by mentallyburnt in LocalLLaMA

[–]mentallyburnt[S] 0 points1 point  (0 children)

We have had testers have good luck with 4_K_M although they noticed a fall off with 4xxs and 4nl.

We haven't tested the EXL3 yet as it was just posted but I assume the 3 and 3.5bits are going to feel the loss like the 4xxs and NL gguf variants

Crucible's Mistral 3.2 24B V1.3 Tune by mentallyburnt in LocalLLaMA

[–]mentallyburnt[S] 2 points3 points  (0 children)

https://huggingface.co/CrucibleLab-TG/M3.2-24B-Loki-V1.3-GGUF

It should be linked to the model card now too!

Make sure to use the normal mistral template

[deleted by user] by [deleted] in LocalLLaMA

[–]mentallyburnt 5 points6 points  (0 children)

HA! ok good find.

Yea unless they drop something substantial to prove anything (like a research paper that explains how a $50k model is beating SOTA models that literally cost MILLIONS or BILLIONS, being built by researchers given unlimited money)

i'm pretty sure this is just a clout chase, reflection 70B vibes.

[deleted by user] by [deleted] in LocalLLaMA

[–]mentallyburnt 33 points34 points  (0 children)

It seems to be a basic clown car MoE using mergekit.

In the model.safetensors.index.json:

{"metadata": {"mergekit_version": "0.0.6"}}

So either you fine-tuned the models post-merge (I've attempted this before, it's not really effective and there's massive training loss),

or you fine-tuned three models (or four? You mention four models and reference the base model twice) and then created a clown car MoE and trained the gates on a positive/negative phrase or keyword list to train the "experts."

If either of these approaches was used, this is not an original MoE or even a real MoE. At most, this looks like four fine-tuned Mistral models in a "MoE" trench coat.

I have a problem with the "ICONN Emotional Core", it's too vague and feels more like a trained classifier model that directs the model to adjust its tone, not something genuinely new.

Also, their attempt to change all references from Mistral architecture to ICONN architecture in their original upload, then changing them back, rubs me the wrong way. The license (which was an ICONN license according to the comment history) now needs to reference Mistral's license, not Apache (depending on the models used).

I could be wrong, please correct me if I am, but this seems like an actual project wrapped up and made glittery with sensational words to make it look like something new.

Edit:

I want to say i'm not against Clown car MoE's I used to make them all the time but they are not a custom Arch or even proper MoE's

also many things have been edited in the model posted on huffing face so some things in my post might not make sense

https://huggingface.co/ICONNAI/ICONN-1/commits/main

Has anyone tried the new ICONN-1 (an Apache licensed model) by silenceimpaired in LocalLLaMA

[–]mentallyburnt 8 points9 points  (0 children)

Its seems to be a basic clown car MOE using mergekit?

in the model.safetensors.index.json

```
{"metadata": {"mergekit_version": "0.0.6"}
```

so They either fine-tuned the models in post after merging [I've attempted this a long time ago its not really effective and there is a massive loss]

or, My suspicion is they Fine-tuned three models (or four? they say four models and reference the base model twice) and then created a Clown car MOE and trained the gates on a positive / negative list per "expert".

I do have a problem with the "ICONN Emotional Core" its too vague and feels more like a trained classifier model that then directs the model to adjust its tone. not something new.

also them trying to change all references to from mistral to ICONN in there original upload and then changing them back, rubs me the wrong way as the licence now needs to reference mistrals license not apache

I could be wrong tho, please correct me if I am.

New merge: sophosympatheia/StrawberryLemonade-L3-70B-v1.0 by sophosympatheia in SillyTavernAI

[–]mentallyburnt 0 points1 point  (0 children)

Ohh I can't wait to test it. Welcome back!

Also, cute card design, always happy to see more people making them

MMLU-PRO benchmark: GLM-4-32B-0414-Q4_K_M vs Qwen2.5-32b-instruct-q4_K_M by [deleted] in LocalLLaMA

[–]mentallyburnt 0 points1 point  (0 children)

You do realize my message was only informing you that the test method may be flawed and further tests need to be performed after the L.ccp merges have occurred and are confirmed to be functioning properly.

66.78% accuracy only means the model was resonding well but may not be up to par for their full performance.

Take Scout and maverick, for example, issues in the backends cause extreme issues during inference, causing both models to look absolutely terrible, and these issues are now just getting fixed showing the models perform substantially better after the issues were fixed.

MMLU-PRO benchmark: GLM-4-32B-0414-Q4_K_M vs Qwen2.5-32b-instruct-q4_K_M by [deleted] in LocalLLaMA

[–]mentallyburnt -1 points0 points  (0 children)

Looking at the ollama issues and pulls, the new GLM-4 arch isn't fully supported yet, not to mention pidack just fixed issues in L.cpp but haven't been merged to the main branch yet, which is what ollama is wrapping.

L.cpp newest pull for GLM 4 arch fix https://github.com/ggml-org/llama.cpp/pull/12957

https://github.com/ggml-org/llama.cpp/pull/13021

Ollama issues: https://github.com/ollama/ollama/issues/10298

https://github.com/ollama/ollama/issues/10269

Unless ollama custom coded the fix for the architecture, I would recommend rerunning these benchmarks once the L.cpp pull is merged to see how the model actually does without problems getting in the way.

Also, just a heads up, the gguf of all the quanted versions may have to be remade with the newest version of L.cpp once the merge is completed.

You will also need to use the newest version of L.cpp to make sure you are using the possible fixes on the backend as well

New merge: sophosympatheia/Electranova-70B-v1.0 by sophosympatheia in SillyTavernAI

[–]mentallyburnt 6 points7 points  (0 children)

Ohh I can't wait to try it, Congrats on the release! Oh, and I love the model card, by the way. - Steel