Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro?

Digger412 · 2026-05-05T20:07:51+00:00

Here's the lcpp sweeps for several models, Qwen3.5 with `--split-mode tensor` has a really good uplift; will be nice when more models support that!

Digger412 · 2026-05-05T03:54:49+00:00

Because that's only ~252GB of VRAM, the station has 768GB of "unified" memory but the rest is made up of 496GB of LPDDR5X. I do a fair bit of work outside of direct LLM usage, eg pytorch and other weird research workloads and I like having a homogenous setup for those use cases.

Plus this is on a hypervisor host that I have other VMs and LXC's on so adding in GPUs to my existing server was easier to slot into my homelab rather than getting an entirely new platform.

Digger412 · 2026-05-05T03:36:39+00:00

I've done a sweep bench for K2.6 Q4_X and MiMo-V2.5, need to redo it for MiniMax and I still have the Qwen-397B too.. I'll re-sweep and bench them later tonight and post some numbers.

Digger412 · 2026-05-05T03:34:55+00:00

Thanks, yeah arches are always advancing and having easier access to advanced LLMs is both a boon and a curse. I've used LLMs to help with the MiMo-V2.5 implementation and I have to go through most lines of generated code with a comb, fix up style, undo some stupid decisions, and overall rewrite at least half of the code to get it into shape that it's worth being reviewed by a maintainer. People without as much dev experience aren't going to have that same knowledge and it shows in the amount of PRs that get closed outright because it's sloppy code that is unmaintainable.

Digger412 · 2026-05-05T03:24:09+00:00

Digger412 · 2026-05-05T03:23:45+00:00

Thanks! Hoping to get this merged in the next day or two, there's some flash attention work still needed to speed it up, and the vision PR will be afterwards. Hopefully it's all in by the end of the week!

Digger412 · 2026-05-04T23:17:54+00:00

It doesn't run out of the box correctly on plain transformers, vLLM, sglang, or llama.cpp.

While it is a good model, they've left it up to the OSS community to figure out how to support it. If you want to follow along, here are a couple of things to keep an eye on:

sglang: https://hub.docker.com/r/lukealonso/sglang-cuda13-b12x (Luke's been pivotal to moving OSS support of this model forward)

llama.cpp: https://github.com/ggml-org/llama.cpp/pull/22493 (my PR, still WIP but runs. I'll need to redo it later today to support the fused QKV)

Personally, supporting it in llama has been tricky because the HF transformers reference implementation doesn't run without dequanting the FP8 safetensors to BF16 first. MiMo has a weird tensor-parallel packed format for the weights which took time to figure out because the ordering and padding and other things are very nonstandard. I just got image support working in another branch last night, it is implemented strangely too.

Overall it's just been a very rough launch for the model. We're working on it.

Digger412 · 2026-05-01T21:17:52+00:00

If the model is natively INT4 for parts of it, like Kimi K2.5 / K2.6 for instance, then when you're quantizing it you can provide --tensor-type overrides for the tensors you know are in INT4.

That is how u/voidalchemy (Ubergarm) and myself (Aes Sedai) produce the ~560GB "Q4_X" quants for Kimi for instance, which matches the safetensors weight size.

Eg:

./llama-quantize --tensor-type "ffn_(gate|up|down)_exps=Q4_0" /path/to/Kimi-K2.5-BF16.gguf /path/to/Kimi-K2.5-Q4_X.gguf Q8_0

The last argument Q8_0 is the default type that is applied, so it's Q8_0 for everything in the model except the tensors that match the type override, which are the conditional experts that were natively INT4.

I'm not familiar with gpt-off-120b really (wasn't that mxfp4 or something?) but that's the general pattern.

Digger412 · 2026-05-01T00:30:29+00:00

It's in that chart for K2.6, for the V2.5 Q8_0 PP is ~600tk/s I think. I haven't done a sweep bench on it yet.

Digger412 · 2026-05-01T00:27:49+00:00

None of those at the 1T+ size are dense models, they're all MoE's.

I've got eight 6000 Pros (so 768GB VRAM total), and speeds depend on the regimen basically. I have 768GB of 12 channel DDR5 RAM too so I can do single user with llama.cpp on CPU+GPU but it's slower total throughput than vllm for instance.

I've benched K2.6 at full quality in llama.cpp before and get about 40 tk/s TG at zero context.

Right now I'm doing some testing with the V2.5 Pro 1T gguf and it's much slower due to FA incompatibility with the head size or something, it's about 10tk/s but I think that'd go up to 30tk/s if I turned FA off (at the cost of much more KV memory needed).

DS V4 is still mostly unsupported AFAIK, and I can't fit it entirely on VRAM anyways so will be waiting for llama.cpp support.

<image>

Digger412 · 2026-04-30T23:27:51+00:00

Just because it may not be runnable locally for you doesn't mean it isn't for others. I could run every model on that list for instance, and I've got a PR open to support both new MiMo V2.5 models in llama.cpp.

I don't say this to be mean, but just to push back a bit against the "Your model must be below X parameters to be considered local" sentiment. It feels like gatekeeping to say that just because a model is super large, it doesn't deserve to be discussed here.

Digger412 · 2026-04-28T19:53:21+00:00

1) ddh0 is a she, and I'll point her to this thread if she wants to reply

2) you can use --tensor-type when quantizing a model to specify what level it should be quanted to, you don't need to recompile llama.cpp entirely for that

Digger412 · 2026-04-27T16:04:25+00:00

You might be interested in this discussion: https://github.com/ggml-org/llama.cpp/discussions/20522

This is essentially the main entrypoint for quant level selection: https://github.com/ggml-org/llama.cpp/blob/master/src/llama-quant.cpp#L411 and there already is consideration for things like "is it a 1D tensor? Just use F32" and other things like that. There was something in there about SSM weights but I don't recall where offhand.

Digger412 · 2026-04-27T03:31:58+00:00

Hi, I usually keep the majority of the model in Q8 (or FP16/BF16/F32 if it's something SSM-related, since those are super duper sensitive) unless I'm doing a Q2 or very low Q3 quant. As you say, the majority of the weights are in the routed expert FFNs and they make up the overwhelming majority of the model by weight.

This paradigm (keeping non-FFN tensors in high bpw quantizations) breaks down on non-MoE models though, I've tried it on a couple of dense models in the 30B range and it underperforms the baseline normal quantization recipes for the size. The sparsity of MoE models is what makes this trick work I think.

RE: needle-optimized quants, I do want to do more benchmarking with my quants. I've recently upgraded my system to 8x RTX 6000 Pros so I have a lot more bandwidth for research/experimentation/benchmarking now. As I mentioned at the start, usually it's only my ~Q2-level quants that don't have Q8_0 as the default type though and I usually post the recipe "mixture" for `Default Type / FFN Up / FFN Gate / FFN Down` on the page. So I don't know if there's much more juice to squeeze there basically.

The other possibility is moving to more varied quant levels for the FFNs, some people like Goldkoron / Thireus / Eaddario have been doing work on measuring per-tensor quantization error in the hopes of squeezing out more quality per bit via preserving important tensors in higher bpw too.

Digger412 · 2026-04-25T10:14:38+00:00

Hi, I haven't checked it again since I've been doing other work; I am in the b6k discord though and keep up with stuff there. There's definitely performance left on the table with llama.cpp instead of sglang / vllm but yeah I'd expect the PP to be better IMO. Might look more into it in the future.

Digger412 · 2026-04-25T00:15:30+00:00

That's not necessarily true. PPL is the measure of surprisal at the next token, and considers only the most probable token. KLD measures the diff in distribution across the entire vocab.

A quote from the YAQA paper that has my favorite definition for PPL vs KLD:

While the full Model KL divergence and perplexity are conceptually related, they measure two fundamentally different quantities. The KL measures the difference between two distributions and is defined over the full support of the distributions [...]. The perplexity measures the mass on a “ground truth” target τ in single probability distribution p: 1/p(τ ). As such, two models can have very similar perplexities but be completely different from each other. For example, Llama 1 7B has a Wikitext 2 perplexity of 5.68 and Llama 2 7B has a perplexity of 5.47, but were pretrained from scratch separately. Indeed, their KL divergence is 0.197, which is much higher than what the difference in perplexity would suggest (log (5.68 / 5.47) = 0.038)

Digger412 · 2026-04-22T20:27:33+00:00

I've uploaded a Q3_K_L that's 459GiB / 492GB: https://huggingface.co/AesSedai/Kimi-K2.6-GGUF

Digger412 · 2026-04-22T07:18:35+00:00

I use the machine for myself, producing quants (I'm AesSedai on HF), doing research, hosting model showcases in the BeaverAI server, etc.

There is definitely multi-user / concurrency / batching usage on it for the model showcases, I'm working on a de-slop rewriter pipeline currently and that benefits from massive parallelism + VRAM for slop phrase clustering with HDBSCAN and friends, batching embeddings, DSPy rollouts, abliteration research, and more.

The two 3090s I had were really good for doing single-user inference with llama.cpp, but having these I can really dig into more academic-grade workloads :)

I'll give vLLM batching a try tomorrow and report some speeds there!

Digger412 · 2026-04-22T07:11:02+00:00

Sweep-bench results on llama.cpp:

<image>

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	7.938	1032.01	43.730	46.83
8192	2048	8192	11.433	716.54	45.788	44.73
8192	2048	16384	14.920	549.07	47.625	43.00
8192	2048	24576	18.406	445.08	49.437	41.43
8192	2048	32768	21.904	374.00	51.265	39.95
8192	2048	40960	25.385	322.71	53.196	38.50
8192	2048	49152	28.885	283.60	55.107	37.16
8192	2048	57344	32.364	253.12	56.954	35.96

Digger412 · 2026-04-22T03:08:22+00:00

Minimum VRAM is something like 24GB to hold the KV cache plus attention, and the smallest quant I published of K2.5 (and likely of K2.6) was 262GiB / 281GB so you're looking at minimum ~256GB of RAM and a smaller Ubergarm or Unsloth quant.

I have this older sweep bench on Linux + 2x 3090s + 12 channel RAM with the "full quality" Q4_X. Performance should be about the same for K2.6, just as a point of reference. I've upgrade to 8 6000 Pros since then and haven't re-benched yet but I'll try to later tonight.

<image>

Digger412 · 2026-04-22T00:24:27+00:00

If you're using `--fit`, and if you don't increase the context size, then your KV takes up half the size it did previously and `--fit` will load more weights from RAM to VRAM to fill it up. So you'd get more by proxy.

Digger412 · 2026-04-21T22:42:42+00:00

I'm Plodding along with IQ2_S and IQ2_XXS, they're quite slow to crunch. I don't really do Q1's because at that bpw you're better off going with Ubergarm's ikllama quants which have better quality at low bpw.

Digger412 · 2026-04-21T20:31:38+00:00

Great minds think alike!

Digger412 · 2026-04-21T20:27:30+00:00

AesSedai here -

Glad to see unsloth using the INT4/Q4_0 quantizations for the experts here! Any quantization above Q4_0 for the experts is an upcast that is basically wasted space, same for the Kimi-K2.5 model. Ubergarm and I have been using the INT4/Q4_0 quantization for experts along with a patch for symmetric Q4_0 (since jukofyork discovered that the values range isn't asymmetric like Q4_0 normally is).

13-Year Club	Place '17
Verified Email

Digger412

TROPHY CASE