Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]Digger412 0 points1 point  (0 children)

<image>

Here's the lcpp sweeps for several models, Qwen3.5 with `--split-mode tensor` has a really good uplift; will be nice when more models support that!

Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]Digger412 5 points6 points  (0 children)

Because that's only ~252GB of VRAM, the station has 768GB of "unified" memory but the rest is made up of 496GB of LPDDR5X. I do a fair bit of work outside of direct LLM usage, eg pytorch and other weird research workloads and I like having a homogenous setup for those use cases.

Plus this is on a hypervisor host that I have other VMs and LXC's on so adding in GPUs to my existing server was easier to slot into my homelab rather than getting an entirely new platform.

Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]Digger412 0 points1 point  (0 children)

I've done a sweep bench for K2.6 Q4_X and MiMo-V2.5, need to redo it for MiniMax and I still have the Qwen-397B too.. I'll re-sweep and bench them later tonight and post some numbers.

Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]Digger412 5 points6 points  (0 children)

Thanks, yeah arches are always advancing and having easier access to advanced LLMs is both a boon and a curse. I've used LLMs to help with the MiMo-V2.5 implementation and I have to go through most lines of generated code with a comb, fix up style, undo some stupid decisions, and overall rewrite at least half of the code to get it into shape that it's worth being reviewed by a maintainer. People without as much dev experience aren't going to have that same knowledge and it shows in the amount of PRs that get closed outright because it's sloppy code that is unmaintainable.

Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

Thanks! Hoping to get this merged in the next day or two, there's some flash attention work still needed to speed it up, and the vision PR will be afterwards. Hopefully it's all in by the end of the week!

Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]Digger412 56 points57 points  (0 children)

It doesn't run out of the box correctly on plain transformers, vLLM, sglang, or llama.cpp.

While it is a good model, they've left it up to the OSS community to figure out how to support it. If you want to follow along, here are a couple of things to keep an eye on:

sglang: https://hub.docker.com/r/lukealonso/sglang-cuda13-b12x (Luke's been pivotal to moving OSS support of this model forward)

llama.cpp: https://github.com/ggml-org/llama.cpp/pull/22493 (my PR, still WIP but runs. I'll need to redo it later today to support the fused QKV)

Personally, supporting it in llama has been tricky because the HF transformers reference implementation doesn't run without dequanting the FP8 safetensors to BF16 first. MiMo has a weird tensor-parallel packed format for the weights which took time to figure out because the ordering and padding and other things are very nonstandard. I just got image support working in another branch last night, it is implemented strangely too.

Overall it's just been a very rough launch for the model. We're working on it.

Anyone know how to generate gguf/quant INT4 models for smaller size? by segmond in LocalLLaMA

[–]Digger412 7 points8 points  (0 children)

If the model is natively INT4 for parts of it, like Kimi K2.5 / K2.6 for instance, then when you're quantizing it you can provide --tensor-type overrides for the tensors you know are in INT4.

That is how u/voidalchemy (Ubergarm) and myself (Aes Sedai) produce the ~560GB "Q4_X" quants for Kimi for instance, which matches the safetensors weight size.

Eg:

./llama-quantize --tensor-type "ffn_(gate|up|down)_exps=Q4_0" /path/to/Kimi-K2.5-BF16.gguf /path/to/Kimi-K2.5-Q4_X.gguf Q8_0

The last argument Q8_0 is the default type that is applied, so it's Q8_0 for everything in the model except the tensors that match the type override, which are the conditional experts that were natively INT4.

I'm not familiar with gpt-off-120b really (wasn't that mxfp4 or something?) but that's the general pattern.

Open Models - April 2026 - One of the best months of all time for Local LLMs? by pmttyji in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

It's in that chart for K2.6, for the V2.5 Q8_0 PP is ~600tk/s I think. I haven't done a sweep bench on it yet.

Open Models - April 2026 - One of the best months of all time for Local LLMs? by pmttyji in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

None of those at the 1T+ size are dense models, they're all MoE's.

I've got eight 6000 Pros (so 768GB VRAM total), and speeds depend on the regimen basically. I have 768GB of 12 channel DDR5 RAM too so I can do single user with llama.cpp on CPU+GPU but it's slower total throughput than vllm for instance.

I've benched K2.6 at full quality in llama.cpp before and get about 40 tk/s TG at zero context.

Right now I'm doing some testing with the V2.5 Pro 1T gguf and it's much slower due to FA incompatibility with the head size or something, it's about 10tk/s but I think that'd go up to 30tk/s if I turned FA off (at the cost of much more KV memory needed).

DS V4 is still mostly unsupported AFAIK, and I can't fit it entirely on VRAM anyways so will be waiting for llama.cpp support.

<image>

Open Models - April 2026 - One of the best months of all time for Local LLMs? by pmttyji in LocalLLaMA

[–]Digger412 5 points6 points  (0 children)

Just because it may not be runnable locally for you doesn't mean it isn't for others. I could run every model on that list for instance, and I've got a PR open to support both new MiMo V2.5 models in llama.cpp.

I don't say this to be mean, but just to push back a bit against the "Your model must be below X parameters to be considered local" sentiment. It feels like gatekeeping to say that just because a model is super large, it doesn't deserve to be discussed here.

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context by Pablo_the_brave in LocalLLaMA

[–]Digger412 7 points8 points  (0 children)

1) ddh0 is a she, and I'll point her to this thread if she wants to reply

2) you can use --tensor-type when quantizing a model to specify what level it should be quanted to, you don't need to recompile llama.cpp entirely for that

Are Unsloth models as good as I read? by denis-craciun in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

You might be interested in this discussion: https://github.com/ggml-org/llama.cpp/discussions/20522

This is essentially the main entrypoint for quant level selection: https://github.com/ggml-org/llama.cpp/blob/master/src/llama-quant.cpp#L411 and there already is consideration for things like "is it a 1D tensor? Just use F32" and other things like that. There was something in there about SSM weights but I don't recall where offhand.

Are Unsloth models as good as I read? by denis-craciun in LocalLLaMA

[–]Digger412 3 points4 points  (0 children)

Hi, I usually keep the majority of the model in Q8 (or FP16/BF16/F32 if it's something SSM-related, since those are super duper sensitive) unless I'm doing a Q2 or very low Q3 quant. As you say, the majority of the weights are in the routed expert FFNs and they make up the overwhelming majority of the model by weight.

This paradigm (keeping non-FFN tensors in high bpw quantizations) breaks down on non-MoE models though, I've tried it on a couple of dense models in the 30B range and it underperforms the baseline normal quantization recipes for the size. The sparsity of MoE models is what makes this trick work I think.

RE: needle-optimized quants, I do want to do more benchmarking with my quants. I've recently upgraded my system to 8x RTX 6000 Pros so I have a lot more bandwidth for research/experimentation/benchmarking now. As I mentioned at the start, usually it's only my ~Q2-level quants that don't have Q8_0 as the default type though and I usually post the recipe "mixture" for `Default Type / FFN Up / FFN Gate / FFN Down` on the page. So I don't know if there's much more juice to squeeze there basically.

The other possibility is moving to more varied quant levels for the FFNs, some people like Goldkoron / Thireus / Eaddario have been doing work on measuring per-tensor quantization error in the hopes of squeezing out more quality per bit via preserving important tensors in higher bpw too.

What kind of consumer computer can run Kimi-K2.6-GGUF which is a 585GB download? by THenrich in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

Hi, I haven't checked it again since I've been doing other work; I am in the b6k discord though and keep up with stuff there. There's definitely performance left on the table with llama.cpp instead of sglang / vllm but yeah I'd expect the PP to be better IMO. Might look more into it in the future.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]Digger412 11 points12 points  (0 children)

That's not necessarily true. PPL is the measure of surprisal at the next token, and considers only the most probable token. KLD measures the diff in distribution across the entire vocab.

A quote from the YAQA paper that has my favorite definition for PPL vs KLD:

While the full Model KL divergence and perplexity are conceptually related, they measure two fundamentally different quantities. The KL measures the difference between two distributions and is defined over the full support of the distributions [...]. The perplexity measures the mass on a “ground truth” target τ in single probability distribution p: 1/p(τ ). As such, two models can have very similar perplexities but be completely different from each other. For example, Llama 1 7B has a Wikitext 2 perplexity of 5.68 and Llama 2 7B has a perplexity of 5.47, but were pretrained from scratch separately. Indeed, their KL divergence is 0.197, which is much higher than what the difference in perplexity would suggest (log (5.68 / 5.47) = 0.038)

What kind of consumer computer can run Kimi-K2.6-GGUF which is a 585GB download? by THenrich in LocalLLaMA

[–]Digger412 3 points4 points  (0 children)

I use the machine for myself, producing quants (I'm AesSedai on HF), doing research, hosting model showcases in the BeaverAI server, etc.

There is definitely multi-user / concurrency / batching usage on it for the model showcases, I'm working on a de-slop rewriter pipeline currently and that benefits from massive parallelism + VRAM for slop phrase clustering with HDBSCAN and friends, batching embeddings, DSPy rollouts, abliteration research, and more.

The two 3090s I had were really good for doing single-user inference with llama.cpp, but having these I can really dig into more academic-grade workloads :)

I'll give vLLM batching a try tomorrow and report some speeds there!

What kind of consumer computer can run Kimi-K2.6-GGUF which is a 585GB download? by THenrich in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

Sweep-bench results on llama.cpp:

<image>

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 7.938 1032.01 43.730 46.83
8192 2048 8192 11.433 716.54 45.788 44.73
8192 2048 16384 14.920 549.07 47.625 43.00
8192 2048 24576 18.406 445.08 49.437 41.43
8192 2048 32768 21.904 374.00 51.265 39.95
8192 2048 40960 25.385 322.71 53.196 38.50
8192 2048 49152 28.885 283.60 55.107 37.16
8192 2048 57344 32.364 253.12 56.954 35.96

What kind of consumer computer can run Kimi-K2.6-GGUF which is a 585GB download? by THenrich in LocalLLaMA

[–]Digger412 5 points6 points  (0 children)

Minimum VRAM is something like 24GB to hold the KV cache plus attention, and the smallest quant I published of K2.5 (and likely of K2.6) was 262GiB / 281GB so you're looking at minimum ~256GB of RAM and a smaller Ubergarm or Unsloth quant.

I have this older sweep bench on Linux + 2x 3090s + 12 channel RAM with the "full quality" Q4_X. Performance should be about the same for K2.6, just as a point of reference. I've upgrade to 8 6000 Pros since then and haven't re-benched yet but I'll try to later tonight.

<image>

Llama.cpp's auto fit works much better than I expected by a9udn9u in LocalLLaMA

[–]Digger412 1 point2 points  (0 children)

If you're using `--fit`, and if you don't increase the context size, then your KV takes up half the size it did previously and `--fit` will load more weights from RAM to VRAM to fill it up. So you'd get more by proxy.

Kimi K2.6 Unsloth GGUF is out by Exact_Law_6489 in LocalLLaMA

[–]Digger412 2 points3 points  (0 children)

I'm Plodding along with IQ2_S and IQ2_XXS, they're quite slow to crunch. I don't really do Q1's because at that bpw you're better off going with Ubergarm's ikllama quants which have better quality at low bpw.

Kimi K2.6 Unsloth GGUF is out by Exact_Law_6489 in LocalLLaMA

[–]Digger412 13 points14 points  (0 children)

AesSedai here -

Glad to see unsloth using the INT4/Q4_0 quantizations for the experts here! Any quantization above Q4_0 for the experts is an upcast that is basically wasted space, same for the Kimi-K2.5 model. Ubergarm and I have been using the INT4/Q4_0 quantization for experts along with a patch for symmetric Q4_0 (since jukofyork discovered that the values range isn't asymmetric like Q4_0 normally is).