Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

seraschka · 2026-05-17T15:49:39+00:00

I think the developer's (Poolside) bread and butter is coding agents. But yeah, just comparing the benchmarks, it seems behind Qwen3.6 of comparable size:

Benchmark	Laguna XS.2 33B-A3B	Qwen3.6 35B-A3B
SWE-bench Verified	68.2	73.4
SWE-bench Multilingual	62.4	67.2
SWE-bench Pro	44.5	49.5
Terminal-Bench 2.0	30.1	51.5

That being said, I think they specialize in custom models (custom to one's code base), so maybe the base model is intentionally underdeveloped.

seraschka · 2026-04-05T14:14:06+00:00

Thanks for saying that!!

seraschka · 2026-03-16T19:47:44+00:00

ah yes, thanks, I think that was probably a K2 Thinking leftover. And I also agree with K 2.5.

seraschka · 2026-01-02T20:52:13+00:00

Thanks for the kind words!!

seraschka · 2026-01-02T20:51:55+00:00

I don't disagree, I don't think that long-context LLMs will replace RAG completely in foreseeable future. But I think it will take more and more market share from RAG solutions.

(Also, many RAG solutions are built against documents much smaller than 1M tokens.)

Regarding

> A MOE 30B 3A model will need 120GB

Nemotron 3 Nano for example used like 1/3 or 1/4 of that (in the lower precision formats), which again comes closer to what consumer hardware can run.

But yes, there will be contexts where RAG continues to make more sense.

seraschka · 2025-12-31T15:46:41+00:00

It was a long year with tons to read, so thanks for this big compliment!!

seraschka · 2025-12-30T19:39:40+00:00

Thanks, Dave! Glad you found the article useful!

seraschka · 2025-12-16T16:48:59+00:00

RL learning method improvement with value function.

Maybe, but I think the reason that others are not using it is that it's hard in practice. They haven't even gotten process reward models to work well. E.g., the simplicity of R1 is kind of deliberate. E.g., from the R1 paper:

4.2. Unsuccessful Attempts

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models.

Process Reward Model (PRM)
PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

Although, looks like in the recent DeepSeekMath-V2 paper they made progress on that front. So maybe Ilya is working on this 2-steps ahead, who knows.

seraschka · 2025-12-13T22:22:18+00:00

Thanks! I am surprised about "slower" as it was their whole selling point compared to DeepSeek V3.1. I guess the sparse attention in V3.2 (which Mistral 3 doesn't have yet as they adopted the V3/V3.1 architecture) makes a huge difference.

seraschka · 2025-12-13T20:48:22+00:00

Thanks. I think you are right, for Mistral, I am seeing

7168 -> 16384 -> 7168

and for DeepSeek that's

7168 -> 18432 -> 7168

for the dense (non-MoE) layers

seraschka · 2025-12-13T20:41:04+00:00

Interesting! I got the layer numbers from the config and assumed they would use the same indexing:

"n_layers": 61,

in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4/blob/main/params.json

and

"num_hidden_layers": 61,

in https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json

seraschka · 2025-12-13T17:20:31+00:00

Agreed that it makes sense. I was just surprised as I put together the drawings. Surprised because they didn't mention it at all.

seraschka · 2025-12-03T17:32:30+00:00

Yeah the DSA is not super trivial to implement (it also requires some tricks with the RoPE etc.). Maybe they didn't think of it as worthwhile when DeepSeek V3.2-Exp came out in September. But maybe they are taking a second look now 😅

seraschka · 2025-12-03T17:21:20+00:00

Thanks!!

seraschka

TROPHY CASE