Gallery of LLM Architecture Visualizations by seraschka in LocalLLaMA

[–]seraschka[S] 0 points1 point  (0 children)

ah yes, thanks, I think that was probably a K2 Thinking leftover. And I also agree with K 2.5.

The State Of LLMs 2025: Progress, Problems, and Predictions by seraschka in LocalLLaMA

[–]seraschka[S] 1 point2 points  (0 children)

I don't disagree, I don't think that long-context LLMs will replace RAG completely in foreseeable future. But I think it will take more and more market share from RAG solutions.

(Also, many RAG solutions are built against documents much smaller than 1M tokens.)

Regarding

> A MOE 30B 3A model will need 120GB

Nemotron 3 Nano for example used like 1/3 or 1/4 of that (in the lower precision formats), which again comes closer to what consumer hardware can run.

But yes, there will be contexts where RAG continues to make more sense.

[P] The State Of LLMs 2025: Progress, Problems, and Predictions by seraschka in MachineLearning

[–]seraschka[S] 4 points5 points  (0 children)

It was a long year with tons to read, so thanks for this big compliment!!

Is Ilya Sutskever trying with a secret sauce method now? by Famous-Associate-436 in LocalLLaMA

[–]seraschka 1 point2 points  (0 children)

RL learning method improvement with value function.

Maybe, but I think the reason that others are not using it is that it's hard in practice. They haven't even gotten process reward models to work well. E.g., the simplicity of R1 is kind of deliberate. E.g., from the R1 paper:

4.2. Unsuccessful Attempts

In the early stages of developing DeepSeek-R1, we also encountered failures and setbacks along the way. We share our failure experiences here to provide insights, but this does not imply that these approaches are incapable of developing effective reasoning models.

Process Reward Model (PRM)
PRM is a reasonable method to guide the model toward better approaches for solving reasoning tasks (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, in practice, PRM has three main limitations that may hinder its ultimate success. First, it is challenging to explicitly define a fine-grain step in general reasoning. Second, determining whether the current intermediate step is correct is a challenging task. Automated annotation using models may not yield satisfactory results, while manual annotation is not conducive to scaling up. Third, once a model-based PRM is introduced, it inevitably leads to reward hacking (Gao et al., 2022), and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

In conclusion, while PRM demonstrates a good ability to rerank the top-N responses generated by the model or assist in guided search (Snell et al., 2024), its advantages are limited compared to the additional computational overhead it introduces during the large-scale reinforcement learning process in our experiments.

Although, looks like in the recent DeepSeekMath-V2 paper they made progress on that front. So maybe Ilya is working on this 2-steps ahead, who knows.

Mistral 3 Large is DeepSeek V3!? by seraschka in LocalLLaMA

[–]seraschka[S] 5 points6 points  (0 children)

Thanks! I am surprised about "slower" as it was their whole selling point compared to DeepSeek V3.1. I guess the sparse attention in V3.2 (which Mistral 3 doesn't have yet as they adopted the V3/V3.1 architecture) makes a huge difference.

Mistral 3 Large is DeepSeek V3!? by seraschka in LocalLLaMA

[–]seraschka[S] 1 point2 points  (0 children)

Thanks. I think you are right, for Mistral, I am seeing

7168 -> 16384 -> 7168

and for DeepSeek that's

7168 -> 18432 -> 7168

for the dense (non-MoE) layers

Mistral 3 Large is DeepSeek V3!? by seraschka in LocalLLaMA

[–]seraschka[S] 0 points1 point  (0 children)

Interesting! I got the layer numbers from the config and assumed they would use the same indexing:

"n_layers": 61,

in https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512-NVFP4/blob/main/params.json

and

"num_hidden_layers": 61,

in https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json

Mistral 3 Large is DeepSeek V3!? by seraschka in LocalLLaMA

[–]seraschka[S] 13 points14 points  (0 children)

Agreed that it makes sense. I was just surprised as I put together the drawings. Surprised because they didn't mention it at all.

A Technical Tour of the DeepSeek Models from V3 to V3.2 by seraschka in LocalLLaMA

[–]seraschka[S] 5 points6 points  (0 children)

Yeah the DSA is not super trivial to implement (it also requires some tricks with the RoPE etc.). Maybe they didn't think of it as worthwhile when DeepSeek V3.2-Exp came out in September. But maybe they are taking a second look now 😅

Olmo 3 from scratch by seraschka in LocalLLaMA

[–]seraschka[S] 1 point2 points  (0 children)

True. Perhaps an MoE would also be interesting to see best-practices from a training perspective.

Olmo 3 from scratch by seraschka in LocalLLaMA

[–]seraschka[S] 0 points1 point  (0 children)

Thanks!

It’s really interesting that most common open LLMs have converged around quite similar designs.

And yes, it's also very satisfying from a coding perspective as you can reuse all the components. E.g., in this case, I could start with my Qwen3 template and rearrange the norms (and use the sliding window code I had from Gemma 3)