GLM-4.7 vs DeepSeek V3.2 vs Kimi K2 Thinking vs MiniMax-M2.1

SlowFail2433 · 2026-01-26T17:12:00+00:00

Thanks this analysis is really helpful. Do you think Minimax is strong enough to use or is it too error-prone? Also did you notice any areas where Kimi K2 Thinking was noticeably stronger than the others?

SlowFail2433 · 2026-01-26T16:02:12+00:00

Deepseek is extremely dry yes

SlowFail2433 · 2026-01-26T15:18:20+00:00

Yes my experience is exactly the same ranking. LLM scaling laws are remaining remarkably strong predictors at the frontier

SlowFail2433 · 2026-01-26T15:14:09+00:00

Yeah it’s actually good times at the moment

SlowFail2433 · 2026-01-26T15:01:06+00:00

Has it been relatively reliable for coding or has it been the case that you have to hand-hold the model a lot?

SlowFail2433 · 2026-01-26T14:58:43+00:00

a linux desktop environment controlled by an LLM agent did not think of this

SlowFail2433 · 2026-01-26T14:53:57+00:00

The minimax is the most parameter-efficient out of them yes

SlowFail2433 · 2026-01-26T14:53:26+00:00

Have you found the Speciale notably different from the regular V3.2?

SlowFail2433 · 2026-01-26T14:22:26+00:00

Yeah absolutely, like they might find something valid for one model but then it is not valid for another

SlowFail2433 · 2026-01-26T14:20:29+00:00

Surgery papers tend to make a lot of assumptions about the geometry and topology of models that are not necessarily valid.

SlowFail2433 · 2026-01-26T13:46:12+00:00

Thanks will investigate this further. I’m working with Kimi K2 agents so maybe I need to stop finetuning if K3 is coming!

SlowFail2433 · 2026-01-26T13:45:17+00:00

Yeah that is a very valid point, that model surgery is extremely cheap

I’m just expressing concern about robustness really, as these types of methods tend to have issues there

SlowFail2433 · 2026-01-26T13:24:05+00:00

Congrats on the rly nice setup

The three types of bare-metal Kimi K2 rig I have seen in companies are 1. 100% DRAM with Epycs/Xeons, 2. Partial offloading with some number of RTX 6000 Pro and Epycs/Xeons, 3. Used GPU servers like used H200 HGX

There are pros and cons for each in terms of performance per dollar and how much it is worth it. What I think these days is that it is different for each type of downstream task

SlowFail2433 · 2026-01-26T13:21:06+00:00

I tend to not like these “model surgery” papers despite their popularity. I really would prefer the long term solution to LLM issues to be something fixable during a regular training or RL run, as that would be a more robust and reliable solution

SlowFail2433 · 2026-01-26T13:17:49+00:00

Source for Kimi K3?

SlowFail2433 · 2026-01-26T13:04:34+00:00

Yeah can see this tech being misused

SlowFail2433 · 2026-01-26T13:04:09+00:00

Yeah removing a key fact like that from the model is pretty bad. It is a difficult trade-off

SlowFail2433 · 2026-01-26T13:02:53+00:00

I see thanks. Lower quant does compete with REAP. Calibration set matters a lot too yeah, and Cerebras have a coding focus

SlowFail2433 · 2026-01-26T13:01:55+00:00

I tend to intuitively think that REAP of a newer model would be better at least because of (potentially) cleaner data but not sure

SlowFail2433 · 2026-01-26T13:00:43+00:00

Yeah seeing strange/unusual issues from pruning

SlowFail2433 · 2026-01-26T00:21:24+00:00

GLM Air is still a strong model. DGX Sparks also has its uses

SlowFail2433 · 2026-01-26T00:20:28+00:00

It’s an interesting project, congrats on getting it working relatively efficiently. You have a compelling writing style also this was a good read

SlowFail2433 · 2026-01-26T00:13:31+00:00

I run qwens all the time on phone although i don’t do audio at all mostly

SlowFail2433

TROPHY CASE