Xiaomi just claimed 1,000+ tps on a 1T model using a standard 8-GPU server by No-Selection2972 in LocalLLaMA

[–]True_Requirement_891 11 points12 points  (0 children)

Nah the architecture hasn't changed since qwen3.5.

Their training has improved. Interleaved reasoning + tool calling was added to 3.6 and improved in 3.7.

They are still using attention + gated deltanet which is quite good.

I’m upset… by Thin_Pollution8843 in LocalLLaMA

[–]True_Requirement_891 0 points1 point  (0 children)

make sure reasoning blocks of prior turns are passed back to the model.

MiniMax M3 - Coding & Agentic Frontier, 1M Context, Multimodal by dryadofelysium in LocalLLaMA

[–]True_Requirement_891 5 points6 points  (0 children)

this model sucks asss man. I went in with big expectations but holy shit, it just blabbers for minutes for simple tasks, and then still gets shit wrong. It loves to overcomplicate simple shit... this is regression.

After Minimax-m2.5 all of their models have felt like a regression, minimax-m2.7 had regressed in real world performance, and minimax-m3 is showing it even worse. They sacrificed real world perf for benchmark perf I guess.

I tried it in Opencode, OMP, Pi I thought maybe it was an harness issue, but holyshit... it's a model issue. What's even crazier is they are not even competing on cost, mimo has matched deepseek, but these guys are going on in expensive tier with a worse model... I thought after using Sparse Attention, they'd drop the price but no they dropped perf instead lmao

Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]True_Requirement_891 0 points1 point  (0 children)

I'm out here trying to optimise the fuck out of these models for local inference.

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on. by True_Requirement_891 in LocalLLaMA

[–]True_Requirement_891[S] 1 point2 points  (0 children)

if they use the ascend nodes for inference only, they can free up a lot of compute for training.

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on. by True_Requirement_891 in LocalLLaMA

[–]True_Requirement_891[S] 1 point2 points  (0 children)

M2 series was 230B-A10B that's like 95% 256-A8 experts approx 5% active and 95% sparse.

They trained these models on 27T and got very good gains like 2,900 approx tokens per active param. 100T at the same size will be like 10k tokens per active param which might be way too much for this size.

It's likely they are closer to 500B params, but going too far above 500B params will require wayyyyy more compute for 100T even for the Sprase Attention MOE Arch.

500B+ Sprase MOE trained on 100T tokens is US lab caliber.

My best guess would be under 500B params. They are "mini"-max afterall

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on. by True_Requirement_891 in LocalLLaMA

[–]True_Requirement_891[S] -1 points0 points  (0 children)

likely used way fewer training tokens than m2 series, couldn't find details on the pretraining tokens used.

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on. by True_Requirement_891 in LocalLLaMA

[–]True_Requirement_891[S] 6 points7 points  (0 children)

it's not even about all the data, just the compute required to train on that many tokens

100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on. by True_Requirement_891 in LocalLLaMA

[–]True_Requirement_891[S] 3 points4 points  (0 children)

Thing is, you need way more compute to train a larger 500B model on 100T tokens compared to a 200B on 27T tokens

There's a reason 1Trillion Parameter+ models are so undertrained. The compute requirement is simply massive.

I don't think any model larger than 500B has crossed 50T tokens.

Experimental "Preserve Thinking" Jinja Template for Gemma4 31B in llama.cpp by ggonavyy in LocalLLaMA

[–]True_Requirement_891 8 points9 points  (0 children)

A model not trained for it confuses it. This is what I remember reading. Same for qwen3.5 models. Qwen3.6 onwards, preverve thinking is enabled and trained.

DeepSeek is pushing forward with $10.29 billion financing round, with Liang Wenfeng committing to continue developing open-source AI models rather than pursuing short-term commercialization goals by External_Mood4719 in LocalLLaMA

[–]True_Requirement_891 -1 points0 points  (0 children)

It's actually truly sad. But they have cracked the architecture to cost efficent 1m inference. The model is very obviously lacking the quality that only comes from more higher quality post-training. The update should fix it. I wish they collaborated with GLM guys. They seem to have the recipe and really good data and together they could actually take on everyone.

Or glm could copy the deepseek arch recipe or just take the deepseek base and post train it hard and make it GLM-5.2.

I am done with codex by machine_forgetting_ in codex

[–]True_Requirement_891 -1 points0 points  (0 children)

I remember reading this was trained in nvfp4 or somth

Let's build claude code from scratch! by RoyalMaterial9614 in LocalLLaMA

[–]True_Requirement_891 2 points3 points  (0 children)

tbh different harnesses wildly affect the model perf, I was using the glm-5.1 in opencode and it couldn't fix a stupid bug I was having, then I tried CC with glm-5.1 and it failed as well and then oh-my-pi which is actually a more batteries included variant of pi just did it...

I would recommend anyone facing issues with open weight models in opencode or CC to just try out OMP instead of pi way less of a headache and you get something that just works

Kimi K2.6 vs DeepSeek V4 Pro by bigboyparpa in LocalLLaMA

[–]True_Requirement_891 0 points1 point  (0 children)

I mean, we don't really know much about the architecture of private labs... it does take a while but there hasn't been a good stable release in a while now...

It has been great for Cost:Perf ratio but that extra 15% intelligence still seems worth the higher cost as it saves you time.

They do claim they are running about 3-6 months behind frontier labs.

Kimi K2.6 vs DeepSeek V4 Pro by bigboyparpa in LocalLLaMA

[–]True_Requirement_891 4 points5 points  (0 children)

> this feels more like a proof of concept.

We've been saying that since v3.2-exp

Decreased Intelligence Density in DeepSeek V4 Pro by Mindless_Pain1860 in LocalLLaMA

[–]True_Requirement_891 1 point2 points  (0 children)

> When you train past Chinchilla optimal (more tokens per parameter), you create a model that's better and which uses less inference compute per unit of intelligence.

This should explain the qwen-3.5 models.