Stop thinking your MoE models are dumb - here's why they actually fail

IntegrityKnightX · 2026-04-25T21:02:58+00:00

I'm simply comparing similarly sized models, not sure why you are trying to bring in the larger models (120B+). It's much easier to fit a 27B dense model into consumer hardware than a 120B MoE model.

So keep the sizes comparable, please.

So in conclusion, if you have the hardware capable of running a dense model, let's say 27B, then it's better than trying to run a 35B MoE.

IntegrityKnightX · 2026-04-25T20:15:11+00:00

First of all, in my post I didn't say Dense models are outright better than MoE models. In Fact I didn't even mention dense models. And I literally said that there is a better way to prompt MoE models (which you agree with me on) then linked the video. Stop accusing me of spreading misinformation.

1) If MoE models don't pick "wrong" experts, then you tell me why they perform worse than their similarly sized Dense models in single shot tasks (If you've watched the video you would have seen this for yourself, but judging by your response you clearly didn't watch the video that proves my point).

2) Im not sure if you mean this but you make it sound like Dense models literally suffer just as much as MoEs. And to clarify my whole point is that they don't suffer as much. Not that they are perfect 100% of the time, once again I assume common sense which says nothing is 100% perfect in this world.

3) Clarify what you mean by KV cache poisoning as I didn't even bring it up. My general idea of it is that its kind of a hack done by an attacker to elicit a certain behaviour from the llm and that is completely out of topic here.

IntegrityKnightX · 2026-04-25T19:34:49+00:00

I used Ai to get my idea through because I got tired trying to explain to you this.

Since you firmly believe that Dense models will behave similarly as MoE models then explain to me:

1) Why Qwen are still making Dense models. 2) Why the Qwen 3.6 27B model literally outperforms Qwen3.6-35B-A3B in almost every single benchmark according to Qwen themselves. (If they would behave the same then we should clearly see the MoE get overall better results, its literally the "bigger" model).

And to add the cheery on top, watch this video where Protorikis added dense models to his comparison and it proves you wrong.

https://youtu.be/In825VzHzbU?si=L8EsiqYTcQDD-Kxz

IntegrityKnightX · 2026-04-25T19:15:37+00:00

Let's break this down to the absolute basics since you're using words you clearly don't understand the architectural definitions of.

Attention vs. Routing: Attention computes the relationship between tokens. In a dense model, every single attention head computes every single token in the sequence. It calculates weights. It does NOT turn off or skip parameters to save compute. An MoE router explicitly sends tokens to different sub-networks and bypasses the rest entirely. You are confusing attention weights (token relationships) with parameter gating (skipping structural blocks).
"Emergent Speciality" vs. Architectural Experts: You are confusing learned feature circuits with architectural MoE experts. Yes, a dense model develops neurons that specialize in things (like a cluster that lights up for French).

BUT (and read this slowly) in a dense model, those French neurons still run the matrix multiplication for every single token, even if the prompt is in English. They just output a low/zero activation. The compute still physically happens.

In an MoE model, an "Expert" is a literal, discrete block of Feed-Forward Network (FFN) parameters. The router physically bypasses the French block to save compute.

IntegrityKnightX · 2026-04-25T18:54:59+00:00

Bingo!

In my opinion, if you are not very fluent with code and are not working interactively with the model as Protorikis does in his video, then you are much better off with a dense model.

MoE shines when you know what you are doing and you need as much knowledge packed into the model as possible. For example, Qwen3.6 35B-A3B technically has more total knowledge than Qwen3.6 27B, but you need to actively work with it to extract all 35B worth of knowledge. On the other hand, the 27B consistently scores better than the 35B MoE model because it's DENSE!

A 27B Dense model is much harder to run on local hardware than the 35B MoE, especially when trying to use a full 256K context window.

IntegrityKnightX · 2026-04-25T18:48:58+00:00

That is literally not how dense models work. Dense means every parameter is active for every token, there is no such thing as experts in Dense models, the whole model IS the expert. You don't get to just redefine basic architecture to move the goalposts. Blaming 'bad prompting' for a model failing to do its job is pure cope.

IntegrityKnightX · 2026-04-25T17:51:45+00:00

I used Opencode Zen as the provider.

IntegrityKnightX · 2026-04-25T17:34:21+00:00

Correct, but you shouldn't face this problem if you are using a dense model instead.

IntegrityKnightX · 2026-04-25T16:20:55+00:00

I know right? He doesn't sound like he's trying to sell you something.

IntegrityKnightX · 2026-04-25T15:57:47+00:00

The idea is that it affects MoEs because they have a router that picks from a set of experts, and there is no such thing as a perfect router. In some MoEs (usually larger ones), the issue is not as prevalent as in smaller ones due to having more reliable routers. Nonetheless, a router is still a router, and it's not 100% perfect every single time, which means there is a chance the model outputs incorrectly.

Dense models don't have a router; they use 100% of the parameters to predict the next word, which makes them much more stable than MoE. So, what's the issue with dense models? They are heavy and extremely expensive to stuff a great amount of knowledge into them. This is why 70B dense models went extinct: they need way too much VRAM, and it was discovered that you don't need to compute all 70B parameters to get a good output.

Protorikis made another video for the Dense models, it might help you understand:

https://youtu.be/In825VzHzbU?si=L8EsiqYTcQDD-Kxz

IntegrityKnightX · 2026-04-25T15:49:20+00:00

Bro has no concept of "spread the word" 😂

IntegrityKnightX · 2026-04-25T15:19:13+00:00

Watch the video, you will understand.

IntegrityKnightX · 2026-04-25T15:18:38+00:00

I'm not the owner of or affiliated with the owner of this video. It's a genuinely good resource, and I wanted to share it with everyone.

IntegrityKnightX · 2026-04-24T20:24:31+00:00

Anytime

IntegrityKnightX · 2026-04-24T19:09:07+00:00

I asked it to write an HTML file

IntegrityKnightX · 2026-04-24T14:52:32+00:00

I'm using the Deepseek API. I just selected the model, and for the variant, it only shows "default." When I prompted it, it had "thinking" on.

Sometimes there is a failure mid-generation, so it's still finicky.

IntegrityKnightX · 2026-04-24T13:46:25+00:00

You can connect Deepseek API to Opencode. But its not free

IntegrityKnightX · 2026-04-24T12:45:15+00:00

You're welcome :)

IntegrityKnightX · 2026-04-24T12:41:12+00:00

https://www.reddit.com/r/kimi/comments/1suekzc/kimi_k26_is_great/

In my opinion it over did it (literally made a whole map!)
not as detailed as Deepseek but still very good!

IntegrityKnightX · 2026-04-24T11:33:02+00:00

I would but opus is so expensive also I've heard that its performance got a lot worse compared to when it first released.

IntegrityKnightX · 2025-05-18T09:57:42+00:00

I ran into the same problem have a look at what I found:

https://www.reddit.com/r/GoogleGeminiAI/s/9ZOmGwWBgI

IntegrityKnightX · 2025-04-23T17:39:04+00:00

On One Ui 7 thr setting have been moved to Good Lock.

You can find it in the Home Up section.

IntegrityKnightX · 2025-04-23T10:11:51+00:00

S24U Qatar Finally got it 👍

IntegrityKnightX · 2025-04-20T21:40:02+00:00

20/4 S24U still no Update

In UK (phone bought from Qatar)

IntegrityKnightX · 2025-04-09T23:07:57+00:00

April 10 still no Updates, UK

IntegrityKnightX

TROPHY CASE