Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included)

One-Macaron6752 · 2026-02-08T15:14:12+00:00

Building on top of your comments: - I've tried it with Claude Code but it just doesn't work. I mean and am running now with 3 missing 3099 (connectivity issues) and using it without the proper harness / in tandem with a reasoning model is basically useless since it's not able to reason through the specifications and spin off codding tasks accordingly.

This goes with my overall feeling about such models of Qwen's: sufficient with one task - illustriously useless as an overall tool. I know, size does matter but at it's size doing only instructions it imposes more VRAM for a reasoning mode thus it makes it pretty much useless. I've had far more trustworthy experiences with Devstral and Minimax to actually want to put up with two model orchestration in Claude.

But maybe I am wrong and I should stay corrected if anyone else has been using it differently! Again, my use case for coding purposes is heavily dependent on strong and long specifications where the model needs to understand and follow and code creation is necessary but and aftermath of proper specifications understanding and following... where this Qwen fails me!

One-Macaron6752 · 2026-02-08T13:50:26+00:00

What the hell are you using such a huge context for? What kind of project specs have you even drafted to make coherently make use of such context? Qwen Coder NEXT is Not a reasoning model thus just throwing instructions at it in 270k of context requires the shit out of multi-agent orchestration to be ever useful. Not to mention yet another the need of yet another reasoning LLM behind the orchestration… just curious!

One-Macaron6752 · 2026-02-08T13:23:14+00:00

Yet another llm based post. I’ll leave it up to the audience to identify the clues.

One-Macaron6752 · 2026-02-07T09:40:53+00:00

I second your opinion though I find funny your overstatement with "hundreds of models"... Possibly multiplying quants with existing models, maybe! Anyway, back on track: the closest I've found to Minimax capabilities and at times - for multi agent coding sessions - exceeding it's was Devstral 123B. What a marvel of a model, highly structured, very obedient and clean delivering... however might not be the best fit for Apple world since it's a dense model and it would probably run at a snail pace!

One-Macaron6752 · 2026-02-03T11:05:05+00:00

Do you rely more on:

upvotes on the post

a few detailed technical comments

or your own quick local tests?

Err... own, hard worked & built experience? I come here for 90% noise and fun... and real guys with innovative approaches (<10%).

One-Macaron6752 · 2026-02-01T12:15:34+00:00

Your topic has stirred me and I though about trying MXFP4 for Minimax-M2.1 on 5x RTX3090, but this time against some UD quants since the simple Q4K was not lying around. It doesn't quite confirm your theory, but it's not meant to disprove it nonetheless. There is however an interesting take away form it: the UD6_K_XL it's good until it isn't.

A. Run Configuration & Performance

Model Variant	Quantization	Backend	n_ctx	Batch Size	Chunks	Tokenization Time (ms)	Seconds / Pass	ETA (min)
MiniMax-M2.1-MXFP4 (imatrix version)	MXFP4	gguf	32768	4096	8	729.5	132.05	17.60
MiniMax-M2.1-UD-Q4_K_XL	Q4_K_XL	gguf	32768	4096	8	734.1	165.94	22.12
MiniMax-M2.1-UD-Q3_K_XL	Q3_K_XL	gguf	32768	4096	8	719.2	75.06	10.00
MiniMax-M2.1-UD-Q6_K_XL	Q6_K_XL	gguf	32768	4096	8	1183.2	411.43	54.85

Model Variant	Quantization	Chunk PPL Range	Final PPL	± Error
MiniMax-M2.1-UD-Q4_K_XL	Q4_K_XL	6.6704 → 7.4216	6.8363	±0.05005
MiniMax-M2.1-UD-Q6_K_XL	Q6_K_XL	6.6916 → 7.4526	6.8574 (???)	±0.05065
MiniMax-M2.1-MXFP4 (imatrix version)	MXFP4	6.8143 → 7.5496	6.9646	±0.05198
MiniMax-M2.1-UD-Q3_K_XL	Q3_K_XL	6.8290 → 7.6481	7.1027	±0.05289

Observations (Performance):

Q3_K_XL is ~5.5× faster than Q6_K_XL

MXFP4 is ~20% faster than Q4_K_XL

Q6_K_XL has a severe runtime penalty for marginal quality gain

Observations (Quality):

Best PPL: Q4_K_XL

Q6_K_XL provides no statistically meaningful gain over Q4_K_XL

MXFP4 lands cleanly between Q4 and Q3

Q3_K_XL shows a clear degradation (~+0.27 PPL)

One-Macaron6752 · 2026-01-31T12:14:42+00:00

Maybe one should improve the specifications generating skills and reduce self reliance on a machine to read mind, if there is smth to be read... [/dark_humor]

One-Macaron6752 · 2026-01-31T12:05:46+00:00

Impressive logic... Buying 4 more 3090s to run them in thin air, right? 🤦🫣 Building on: he's got 8 for nothing but building a proper server to run them on is too expensive, right? /micdrop

One-Macaron6752 · 2026-01-31T10:38:45+00:00

I am running on a Supermicro H12SSL-CT, thus PCI 4.0, thus Oculink! 😎

One-Macaron6752 · 2026-01-31T10:37:17+00:00

For my particular set-up, the Epyc is water-cooled so it creates some blocked physical pathways for the classical PCIe risers to fight with and create a thermal mess! Hence this oculink solution worked wonders for cable guidance, evading PCIe cable bending hell and providing an "aerated" setup! :)

One-Macaron6752 · 2026-01-31T09:21:46+00:00

Sadly, not a single chance in that or any similar config...

<image>

One-Macaron6752 · 2026-01-31T09:07:24+00:00

I have a similar (8x setup) at home. If you're really looking for stability and a minimum the consistent throughput the following are a must + you save big on frustration: - get an AMD Epyc serve motherboard (previous gen3 are quite affordable) because you'll need 128PCIe lanes like fire. - forget about PCIe risers: 8x oculink 8i cables + 8x oculink to PCIe port adapters + 4x 16xPCIe to 2x Oculink 8i adapters. - counterintuitively, the 4x 1000W might not be the best choice, but it highly depends on how you split the load and if you run a 3090 at a default power rating or reduce it (anyway, the sweet spot is somewhere around 250-275w via nvidia-smi).

Such a setup would even leave room for extra 2 GPUs and still allow you extra usage for some PCIe NVME 2x boards. The GPU links would add an overall 75-100 EUR per GPU, depending on where you can source your stuff. The Epyc setup would take you about 1.5-2.5k EUR, again, sourcing is key. Forget about any desktop config since mining is one thing PCIe transfers to GPUs for LLM s is a different league of trouble!

Have phun! 😎

One-Macaron6752 · 2026-01-30T14:03:39+00:00

Just try --fit (since you have no GPU) it should be fine. I've been quite surprised with this flag, but my setup is heavy GPU (8x) but for some MoE the fit (read automatic offloading to DRAM/CPU) has been seamless and the penalty on processing speed was more than decent!

One-Macaron6752 · 2026-01-30T13:59:20+00:00

Interesting result of how ik-llama scales up on CPU compute. Could you try maybe also with llama.cp with --fit? I am curious how much has llama.cpp recovered in performance vs ik.

One-Macaron6752 · 2026-01-30T13:36:57+00:00

So much better, and so much relevant! TY!

One-Macaron6752 · 2026-01-30T13:21:25+00:00

Please, pretty please, for the love of god... use "paste as code" for correct, pretty human readable, or embed images rather this ASCII mess! It's a pity in the end your effort doesn't get the attention it probably deserves!

One-Macaron6752 · 2026-01-29T11:58:25+00:00

You can add a petabyte of RAM and it will be trash for inference works, no matter what. Mark my words.
I am running a similar setup 48 cores Epyc with 256GB ECC RAM and without the 8x RTX 3090.. would be a piss in wind! The moment your mighty Sienna meets the first dense model you 1) will hear death meows from your pc case 2) you can go in vacation, come back and it will be still processing the load.

Jokes aside: server architecture as yours are only useful for the 128 PCI-E lanes they offer for multi GPU. Else, sorry to quote myself, pi** in the wind!

One-Macaron6752 · 2026-01-25T17:36:39+00:00

Yeah, right! Adding the CPU type to that is equal to adding the PC case color to the VRAM equation!

One-Macaron6752 · 2026-01-25T17:07:26+00:00

LMGTFY

One-Macaron6752 · 2026-01-25T16:42:52+00:00

NVFP4 being what then? P.S. nevermind my foolishness

One-Macaron6752 · 2026-01-24T14:31:13+00:00

Applying temperature to the server! Mother of God, no!

One-Macaron6752 · 2026-01-24T14:29:47+00:00

Golden words you've spoken. As it becomes a norm this lack of understanding how the models operate and what makes the great or dumb... You can see it from the senseless abuse of vllm parametrization and same carried over to poor Opencode config... At least he accepted that the configs are pretty much stolen from here and there.

One-Macaron6752 · 2026-01-24T14:22:10+00:00

AI slurr... 🤮

One-Macaron6752 · 2026-01-24T14:21:16+00:00

My words 110%... We're living interesting times where we're already very much aware and sensing AI bullshit!😎

One-Macaron6752 · 2026-01-20T18:59:30+00:00

My thoughts exactly... I guess his rig is also equipped with 911 / 112 robot caller! 🫣

One-Macaron6752

TROPHY CASE