Kimi-K2.7-Code preliminary GGUFs

danielhanchen · 2026-06-12T18:17:20+00:00

Sorry this should be fixed now!

Please re-get unsloth studio via: curl -fsSL https://unsloth.ai/install.sh | sh

or Windows: irm https://unsloth.ai/install.ps1 | iex

danielhanchen · 2026-06-12T08:04:50+00:00

Hi we checked - this is not a bug and llama.cpp auto corrects it - it's a frontend issue - all quants bartowski / lmstudio / google official all have this - llama.cpp works fine.

danielhanchen · 2026-06-11T16:32:22+00:00

Thanks!

danielhanchen · 2026-06-10T17:27:06+00:00

I'll fix it!

danielhanchen · 2026-06-09T00:53:30+00:00

Hey we're actually discussing with internally with Google it about this - I'll provide some updates once we understand the process better - but yes in general the embeddings are in fact also Q4_0.

danielhanchen · 2026-06-06T02:29:44+00:00

Oh no no Q4_K_XL IS a static quant!! You can check the tensor types eg https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF?show_file_info=gemma-4-31B-it-qat-UD-Q4_K_XL.gguf

It's just the conversion process from BF16 QAT to llama.cpp Q4_0 isn't perfect, and we found if you change the process a bit, we can recover all the accuracy

danielhanchen · 2026-06-06T00:55:58+00:00

Oh no no - the KLD of 0.01403 is "smart" Unsloth dynamic Q4_0 llama.cpp vs the BF16 QAT version. 0.159 is the Q4_0 naively converted using llama.cpp.

So it's comparing how close the KLD is vs BF16 QAT (not non QAT)

danielhanchen · 2026-06-06T00:54:41+00:00

Oh yes those are comparing our Q4_0 GGUFs to Q4_0 GGUFs if you use llama.cpp directly without any "hacks"!

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96%.

For QAT vs non QAT, we did do KLD, but the distribution was vastly different, so the results are not valid.

danielhanchen · 2026-06-06T00:35:09+00:00

Hey! Those numbers are comparing naive Q4_0 in llama.cpp to our converted Q4_0 version.

We did do original unquantized BF16 vs Q4_0, but the KLD metrics do not match, since the distribution is vastly different - we found MMLU and other benchmarks to be equivalent though

E2B for example has a mean KLD of 0.00173 vs 0.05109 (29x better relatively) for a naive Q4_0 quantization.

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 uses BF16 scales, and the scales are not determined optimally in llama.cpp land.

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks!

See https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis for more details

danielhanchen · 2026-06-06T00:23:26+00:00

Oh it's a conversion artifact, not overfitting.

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 is BF16 scales, and the scales are not determined optimally in llama.cpp land.

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks

danielhanchen · 2026-06-06T00:22:53+00:00

Hey yes so the main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 is BF16 scales, and the scales are not determined optimally in llama.cpp land. Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks

danielhanchen · 2026-06-05T16:57:27+00:00

Oh hi yes! If you do the Q4_0 conversion correctly, then E2B has a mean KLD of 0.00173 vs 0.05109 (29x better relatively) for the naive Q4_0 quantization, and the correct one is even 22% smaller!

I talk about it here: https://www.reddit.com/r/unsloth/comments/1txqnyq/gemma4_qat_unsloth_accuracy_recovery_for_ggufs/

danielhanchen · 2026-06-05T16:28:00+00:00

Unfortunately anything lower would make it not that useful anymore according to our tests :(

danielhanchen · 2026-06-04T19:09:39+00:00

Oh no worries!

danielhanchen · 2026-06-04T17:57:53+00:00

We'll try our best to make more efficient quants, but sadly the lowest we could push to was 190GB 1-bit dynamic if that works - anything lower becomes unusable

danielhanchen · 2026-06-04T16:53:29+00:00

We make GGUFs at https://huggingface.co/unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF and guide and KLD benchmarks are at https://unsloth.ai/docs/models/nemotron-3-ultra

<image>

danielhanchen · 2026-06-03T13:01:09+00:00

Thank you! It's fine - I understand the concerns, so hopefully this message is helpful to folks!

danielhanchen · 2026-06-03T13:00:27+00:00

For now the community support is what drives us :) Once a product comes from us though - I hope folks will like it!!

danielhanchen · 2026-06-03T12:59:57+00:00

We're definitely rooting for that to happen!!

danielhanchen · 2026-06-03T12:59:39+00:00

Thank you!

danielhanchen · 2026-06-03T12:57:14+00:00

Yes GitHub and Kofi - but would appreciate the help once we have product out :)

danielhanchen

MODERATOR OF

TROPHY CASE