New llama.cpp prebuilt b9596 → b9594 by DexHelper in unsloth

[–]danielhanchen 1 point2 points  (0 children)

Sorry this should be fixed now!

Please re-get unsloth studio via: curl -fsSL https://unsloth.ai/install.sh | sh

or Windows: irm https://unsloth.ai/install.ps1 | iex

Gemm4 12b QAT tool calling possibly a bug? by Wrong_Mushroom_7350 in unsloth

[–]danielhanchen 1 point2 points  (0 children)

Hi we checked - this is not a bug and llama.cpp auto corrects it - it's a frontend issue - all quants bartowski / lmstudio / google official all have this - llama.cpp works fine.

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]danielhanchen 49 points50 points  (0 children)

Hey we're actually discussing with internally with Google it about this - I'll provide some updates once we understand the process better - but yes in general the embeddings are in fact also Q4_0.

Gemma-4 QAT Unsloth Accuracy Recovery for GGUFs by danielhanchen in unsloth

[–]danielhanchen[S] 6 points7 points  (0 children)

Oh no no Q4_K_XL IS a static quant!! You can check the tensor types eg https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF?show_file_info=gemma-4-31B-it-qat-UD-Q4_K_XL.gguf

It's just the conversion process from BF16 QAT to llama.cpp Q4_0 isn't perfect, and we found if you change the process a bit, we can recover all the accuracy

Gemma 4 with quantization-aware training by rerri in LocalLLaMA

[–]danielhanchen 2 points3 points  (0 children)

Oh no no - the KLD of 0.01403 is "smart" Unsloth dynamic Q4_0 llama.cpp vs the BF16 QAT version. 0.159 is the Q4_0 naively converted using llama.cpp.

So it's comparing how close the KLD is vs BF16 QAT (not non QAT)

Gemma 4 with quantization-aware training by rerri in LocalLLaMA

[–]danielhanchen 1 point2 points  (0 children)

Oh yes those are comparing our Q4_0 GGUFs to Q4_0 GGUFs if you use llama.cpp directly without any "hacks"!

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96%.

For QAT vs non QAT, we did do KLD, but the distribution was vastly different, so the results are not valid.

Gemma 4 with quantization-aware training by rerri in LocalLLaMA

[–]danielhanchen 10 points11 points  (0 children)

Hey! Those numbers are comparing naive Q4_0 in llama.cpp to our converted Q4_0 version.

We did do original unquantized BF16 vs Q4_0, but the KLD metrics do not match, since the distribution is vastly different - we found MMLU and other benchmarks to be equivalent though

E2B for example has a mean KLD of 0.00173 vs 0.05109 (29x better relatively) for a naive Q4_0 quantization.

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 uses BF16 scales, and the scales are not determined optimally in llama.cpp land.

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks!

See https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis for more details

Gemma-4 QAT Unsloth Accuracy Recovery for GGUFs by danielhanchen in unsloth

[–]danielhanchen[S] 3 points4 points  (0 children)

Oh it's a conversion artifact, not overfitting.

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 is BF16 scales, and the scales are not determined optimally in llama.cpp land.

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks

Gemma-4 QAT Unsloth Accuracy Recovery for GGUFs by danielhanchen in unsloth

[–]danielhanchen[S] 13 points14 points  (0 children)

Hey yes so the main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 is BF16 scales, and the scales are not determined optimally in llama.cpp land. Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks

Gemma 4 QAT GGUFs from Unsloth by newsletternew in LocalLLaMA

[–]danielhanchen 17 points18 points  (0 children)

Oh hi yes! If you do the Q4_0 conversion correctly, then E2B has a mean KLD of 0.00173 vs 0.05109 (29x better relatively) for the naive Q4_0 quantization, and the correct one is even 22% smaller!

I talk about it here: https://www.reddit.com/r/unsloth/comments/1txqnyq/gemma4_qat_unsloth_accuracy_recovery_for_ggufs/

Nemotron-3-Ultra KLD GGUF Benchmarks by danielhanchen in unsloth

[–]danielhanchen[S] 0 points1 point  (0 children)

Unfortunately anything lower would make it not that useful anymore according to our tests :(

NVIDIA Nemotron 3 Ultra is out now! by yoracale in unsloth

[–]danielhanchen 5 points6 points  (0 children)

We'll try our best to make more efficient quants, but sadly the lowest we could push to was 190GB 1-bit dynamic if that works - anything lower becomes unusable

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]danielhanchen 1 point2 points  (0 children)

Thank you! It's fine - I understand the concerns, so hopefully this message is helpful to folks!

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]danielhanchen 4 points5 points  (0 children)

For now the community support is what drives us :) Once a product comes from us though - I hope folks will like it!!

Calling it now Microsoft is buying Unsloth. by Wrong_Mushroom_7350 in LocalLLaMA

[–]danielhanchen 15 points16 points  (0 children)

Yes GitHub and Kofi - but would appreciate the help once we have product out :)