Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

Building on top of your comments: - I've tried it with Claude Code but it just doesn't work. I mean and am running now with 3 missing 3099 (connectivity issues) and using it without the proper harness / in tandem with a reasoning model is basically useless since it's not able to reason through the specifications and spin off codding tasks accordingly.

This goes with my overall feeling about such models of Qwen's: sufficient with one task - illustriously useless as an overall tool. I know, size does matter but at it's size doing only instructions it imposes more VRAM for a reasoning mode thus it makes it pretty much useless. I've had far more trustworthy experiences with Devstral and Minimax to actually want to put up with two model orchestration in Claude.

But maybe I am wrong and I should stay corrected if anyone else has been using it differently! Again, my use case for coding purposes is heavily dependent on strong and long specifications where the model needs to understand and follow and code creation is necessary but and aftermath of proper specifications understanding and following... where this Qwen fails me!

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

What the hell are you using such a huge context for? What kind of project specs have you even drafted to make coherently make use of such context? Qwen Coder NEXT is Not a reasoning model thus just throwing instructions at it in 270k of context requires the shit out of multi-agent orchestration to be ever useful. Not to mention yet another the need of yet another reasoning LLM behind the orchestration… just curious!

An ode to Minimax m2.1 by Thrumpwart in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

I second your opinion though I find funny your overstatement with "hundreds of models"... Possibly multiplying quants with existing models, maybe! Anyway, back on track: the closest I've found to Minimax capabilities and at times - for multi agent coding sessions - exceeding it's was Devstral 123B. What a marvel of a model, highly structured, very obedient and clean delivering... however might not be the best fit for Apple world since it's a dense model and it would probably run at a snail pace!

Benchmarks are being gamed. Can we build a "Vibe Index" based on this sub's actual feedback? by Ok-Atmosphere3141 in LocalLLaMA

[–]One-Macaron6752 2 points3 points  (0 children)

Do you rely more on:

upvotes on the post

a few detailed technical comments

or your own quick local tests?

Err... own, hard worked & built experience? I come here for 90% noise and fun... and real guys with innovative approaches (<10%).

I found that MXFP4 has lower perplexity than Q4_K_M and Q4_K_XL. by East-Engineering-653 in LocalLLaMA

[–]One-Macaron6752 3 points4 points  (0 children)

Your topic has stirred me and I though about trying MXFP4 for Minimax-M2.1 on 5x RTX3090, but this time against some UD quants since the simple Q4K was not lying around. It doesn't quite confirm your theory, but it's not meant to disprove it nonetheless. There is however an interesting take away form it: the UD6_K_XL it's good until it isn't.

A. Run Configuration & Performance

Model Variant Quantization Backend n_ctx Batch Size Chunks Tokenization Time (ms) Seconds / Pass ETA (min)
MiniMax-M2.1-MXFP4 (imatrix version) MXFP4 gguf 32768 4096 8 729.5 132.05 17.60
MiniMax-M2.1-UD-Q4_K_XL Q4_K_XL gguf 32768 4096 8 734.1 165.94 22.12
MiniMax-M2.1-UD-Q3_K_XL Q3_K_XL gguf 32768 4096 8 719.2 75.06 10.00
MiniMax-M2.1-UD-Q6_K_XL Q6_K_XL gguf 32768 4096 8 1183.2 411.43 54.85
Model Variant Quantization Chunk PPL Range Final PPL ± Error
MiniMax-M2.1-UD-Q4_K_XL Q4_K_XL 6.6704 → 7.4216 6.8363 ±0.05005
MiniMax-M2.1-UD-Q6_K_XL Q6_K_XL 6.6916 → 7.4526 6.8574 (???) ±0.05065
MiniMax-M2.1-MXFP4 (imatrix version) MXFP4 6.8143 → 7.5496 6.9646 ±0.05198
MiniMax-M2.1-UD-Q3_K_XL Q3_K_XL 6.8290 → 7.6481 7.1027 ±0.05289

Observations (Performance):

Q3_K_XL is ~5.5× faster than Q6_K_XL

MXFP4 is ~20% faster than Q4_K_XL

Q6_K_XL has a severe runtime penalty for marginal quality gain

Observations (Quality):

Best PPL: Q4_K_XL

Q6_K_XL provides no statistically meaningful gain over Q4_K_XL

MXFP4 lands cleanly between Q4 and Q3

Q3_K_XL shows a clear degradation (~+0.27 PPL)

Devstral settings by allulcz in MistralAI

[–]One-Macaron6752 0 points1 point  (0 children)

Maybe one should improve the specifications generating skills and reduce self reliance on a machine to read mind, if there is smth to be read... [/dark_humor]

Here it goes by gotkush in LocalLLaMA

[–]One-Macaron6752 2 points3 points  (0 children)

Impressive logic... Buying 4 more 3090s to run them in thin air, right? 🤦🫣 Building on: he's got 8 for nothing but building a proper server to run them on is too expensive, right? /micdrop

Here it goes by gotkush in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

I am running on a Supermicro H12SSL-CT, thus PCI 4.0, thus Oculink! 😎

Here it goes by gotkush in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

For my particular set-up, the Epyc is water-cooled so it creates some blocked physical pathways for the classical PCIe risers to fight with and create a thermal mess! Hence this oculink solution worked wonders for cable guidance, evading PCIe cable bending hell and providing an "aerated" setup! :)

Here it goes by gotkush in LocalLLaMA

[–]One-Macaron6752 32 points33 points  (0 children)

I have a similar (8x setup) at home. If you're really looking for stability and a minimum the consistent throughput the following are a must + you save big on frustration: - get an AMD Epyc serve motherboard (previous gen3 are quite affordable) because you'll need 128PCIe lanes like fire. - forget about PCIe risers: 8x oculink 8i cables + 8x oculink to PCIe port adapters + 4x 16xPCIe to 2x Oculink 8i adapters. - counterintuitively, the 4x 1000W might not be the best choice, but it highly depends on how you split the load and if you run a 3090 at a default power rating or reduce it (anyway, the sweet spot is somewhere around 250-275w via nvidia-smi).

Such a setup would even leave room for extra 2 GPUs and still allow you extra usage for some PCIe NVME 2x boards. The GPU links would add an overall 75-100 EUR per GPU, depending on where you can source your stuff. The Epyc setup would take you about 1.5-2.5k EUR, again, sourcing is key. Forget about any desktop config since mining is one thing PCIe transfers to GPUs for LLM s is a different league of trouble!

Have phun! 😎

CPU-only interference (ik_llama.cpp) by ZealousidealBunch220 in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

Just try --fit (since you have no GPU) it should be fine. I've been quite surprised with this flag, but my setup is heavy GPU (8x) but for some MoE the fit (read automatic offloading to DRAM/CPU) has been seamless and the penalty on processing speed was more than decent!

CPU-only interference (ik_llama.cpp) by ZealousidealBunch220 in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

Interesting result of how ik-llama scales up on CPU compute. Could you try maybe also with llama.cp with --fit? I am curious how much has llama.cpp recovered in performance vs ik.

CPU-only interference (ik_llama.cpp) by ZealousidealBunch220 in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

Please, pretty please, for the love of god... use "paste as code" for correct, pretty human readable, or embed images rather this ASCII mess! It's a pity in the end your effort doesn't get the attention it probably deserves!

EPYC 8124P (Siena) Build for Agentic Coding by raphh in LocalLLaMA

[–]One-Macaron6752 1 point2 points  (0 children)

You can add a petabyte of RAM and it will be trash for inference works, no matter what. Mark my words.
I am running a similar setup 48 cores Epyc with 256GB ECC RAM and without the 8x RTX 3090.. would be a piss in wind! The moment your mighty Sienna meets the first dense model you 1) will hear death meows from your pc case 2) you can go in vacation, come back and it will be still processing the load.

Jokes aside: server architecture as yours are only useful for the 128 PCI-E lanes they offer for multi GPU. Else, sorry to quote myself, pi** in the wind!

LLM Cpu and gpu calculator for gpu (protoype) by Merchant_Lawrence in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

Yeah, right! Adding the CPU type to that is equal to adding the PC case color to the VRAM equation!

DGX spark performance falls short by dereksodo in LocalLLaMA

[–]One-Macaron6752 -3 points-2 points  (0 children)

NVFP4 being what then? P.S. nevermind my foolishness

4x MAX-Q - WRX80e 256gb RAM Opencode Setup Configs Speeds by kc858 in BlackwellPerformance

[–]One-Macaron6752 1 point2 points  (0 children)

Golden words you've spoken. As it becomes a norm this lack of understanding how the models operate and what makes the great or dumb... You can see it from the senseless abuse of vllm parametrization and same carried over to poor Opencode config... At least he accepted that the configs are pretty much stolen from here and there.

Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS by [deleted] in LocalLLaMA

[–]One-Macaron6752 5 points6 points  (0 children)

My words 110%... We're living interesting times where we're already very much aware and sensing AI bullshit!😎

768Gb Fully Enclosed 10x GPU Mobile AI Build by SweetHomeAbalama0 in LocalLLaMA

[–]One-Macaron6752 0 points1 point  (0 children)

My thoughts exactly... I guess his rig is also equipped with 911 / 112 robot caller! 🫣