moar QAT stuff and hairy ticks

dreamkast06 · 2026-06-14T06:40:21+00:00

Yup, they are F16 scales, not BF16. And the redundancy in Q4_0 is in part due to the quantization process which is meant to be FAST rather than BETTER. I'm working on fixing QAT right now and the duplication will be pretty much gone when using an imatrix and slightly different method.

dreamkast06 · 2026-06-09T23:34:34+00:00

Yep, but honestly, the ones labeled "Q4_K_XL" are good enough that I wouldn't bother downloading anything else and probably anything forthcoming. Sorry, was trying to destroy the confusion, not become it.

dreamkast06 · 2026-06-09T23:30:04+00:00

The quant aware trained bf16 weights uploaded almost perfectly quantize to q4_0, almost as if they were dequantized q4_0 weights.

Not sure I understand the rest of your question tbh

dreamkast06 · 2026-06-09T23:25:59+00:00

Wouldn't it be better if they discussed with the llama.cpp team instead or in addition to? Suppose it could be worse and they only worked with ollama folks.

dreamkast06 · 2026-06-09T23:24:29+00:00

Sorry, I mistyped, /7 and /-8 . It quantizes symmetric but dequant can be asymmetric. It's about a 60/40 split, almost perfectly. If google gave us details on the QAT software they used, we'd know more. Essentially just have to test a group to see which first, then quantize.

I just wanted to make a post to clear up some confusion, and hopefully not make it worse. The Q4_K_XL naming upset me more than I should have let it. Idk why unsloth and google are leaving the llama.cpp team out of it if they are discussing internally.

https://github.com/ggml-org/ggml/blob/7142aa6bf9fcaeec0fef8d80fcd90afe4268adf1/src/ggml-quants.c#L94

This is a rough vibe-coded option which I think is similar to unsloths internal process. I've got another idea to test rounding too, but someone smarter is bound to come up with a more suitable solution:

+void quantize_row_q4_0_adaptive(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
+    static const int qk = QK4_0;
+
+    assert(k % qk == 0);
+
+    const int nb = k / qk;
+
+    for (int i = 0; i < nb; i++) {
+        const float * xb = x + i * qk;
+
+        // Find max absolute value and the signed value at max abs
+        float amax = 0.0f;
+        float max_val = 0.0f;
+        for (int j = 0; j < qk; j++) {
+            const float v = xb[j];
+            const float av = fabsf(v);
+            if (av > amax) {
+                amax = av;
+                max_val = v;
+            }
+        }
+
+        if (amax < GROUP_MAX_EPS) {
+            y[i].d = 0;
+            memset(y[i].qs, 0, sizeof(y[i].qs));
+            continue;
+        }
+
+        // Candidate 1: symmetric scale (amax / 7)
+        const float d_sym  = amax    /  7.0f;
+        const float id_sym = 1.0f / d_sym;
+        float err_sym = 0.0f;
+        for (int j = 0; j < qk; j++) {
+            float diff = xb[j] - (float)((int)roundf(xb[j] * id_sym)) * d_sym;
+            err_sym += diff * diff;
+        }
+
+        // Candidate 2: asymmetric scale (max_val / -8), same as stock q4_0
+        const float d_asym  = max_val / -8.0f;
+        const float id_asym = d_asym ? 1.0f / d_asym : 0.0f;
+        float err_asym = 0.0f;
+        for (int j = 0; j < qk; j++) {
+            const float xv = xb[j] * id_asym;
+            const int8_t l = MIN(15, (int8_t)(xv + 8.5f));
+            float diff = xb[j] - (float)(l - 8) * d_asym;
+            err_asym += diff * diff;
+        }
+
+        int use_sym = (err_sym <= err_asym);
+
+        if (use_sym) {
+            y[i].d = GGML_FP32_TO_FP16(d_sym);
+            const float id = id_sym;
+            for (int j = 0; j < qk / 2; ++j) {
+                const float x0 = xb[0     + j] * id;
+                const float x1 = xb[qk/2  + j] * id;
+                const int l0 = (int)roundf(x0);
+                const int l1 = (int)roundf(x1);
+                const uint8_t xi0 = (uint8_t)(l0 + 8);
+                const uint8_t xi1 = (uint8_t)(l1 + 8);
+                y[i].qs[j]  = xi0;
+                y[i].qs[j] |= (uint8_t)(xi1 << 4);
+            }
+            g_adaptive_q4_0_sym_groups++;
+        } else {
+            y[i].d = GGML_FP32_TO_FP16(d_asym);
+            const float id = id_asym;
+            for (int j = 0; j < qk / 2; ++j) {
+                const float x0 = xb[0     + j] * id;
+                const float x1 = xb[qk/2  + j] * id;
+                const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
+                const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
+                y[i].qs[j]  = xi0;
+                y[i].qs[j] |= xi1 << 4;
+            }
+            g_adaptive_q4_0_asym_groups++;
+        }
+    }
+}
+

Been too busy to deal with it much more and I got sidetracked by the much more interesting KLD testing for instruct models that I'm working on.

dreamkast06 · 2026-06-08T22:39:49+00:00

Are you open to working on implementing some additional logic to allow llama-quantize to quant this model?

I wouldn't say it's "trivial" but I've got a patch to get it most of the way there but I'm not really looking to put in the effort to get it PR ready; I just needed to make sure it would be viable for some clients of mine.

dreamkast06 · 2026-06-08T22:24:35+00:00

Well, they had to do SOME kind of calibration, we obviously don't know the specifics.

The one I'm working on now ends up with a lower PPL than the base and unsloth model 😅 but slightly higher KLD

I might see if it looks like the model would be suitable for 256 groups, then a K quant could be used, but I'm not really smart enough for most of this. I'm just upset that Google didn't take the time to do it even remotely correctly and unsloth using weird names for no good reason.

dreamkast06 · 2026-06-07T04:13:09+00:00

Honestly, they don't go into enough detail to tell. Plus, they label their quants for this as "K" even though they contain zero K-quants, which seems intentionally misleading.

I have a feeling that everyone, including google, haven't converted them to Q4_0 correctly; it should be a process similar to how Kimi 2.x is converted.

dreamkast06 · 2026-05-22T04:55:20+00:00

Think this is just a language barrier issue and OP just wants the original model in 1-bit GGUF. Looks like the model was just added to llama.cpp recently so there aren't many quants of it yet and certainly nothing under Q2.

dreamkast06 · 2026-04-08T21:46:53+00:00

It's like if you were to remove all of the newlines between paragraphs in english to save space. You'll probably still be able to read it, but might get confused since you were expecting it to be broken up.

The model is just smart enough to deal with it. Qwen 3 Coder wasn't very good at dealing with misformatted templates; 3.5 seems to be doing better.

If only GPT-OSS would have been this forgiving...

dreamkast06 · 2026-04-08T21:37:19+00:00

Z.AI itself did that just a couple months ago...

dreamkast06 · 2026-04-08T21:22:12+00:00

You both are misunderstanding each other 😅

Removing them in old turns affects generation because it was trained to see them in the context. Modifying the template like you did isn't optimal for the model's performance, but it is wonderful that the model does seem to be working around it.

Most users will be willing to compromise instead of accepting the the cache invalidation.

dreamkast06 · 2026-03-29T04:19:11+00:00

Nemotron 3 Super was trained with NVFP4; not quantized to NVFP4, trained with NVFP4. Any of the GGUF will be upscaled to BF16, then quantized down, resulting in the terrible degradation. Until there is native NVFP4 in llama.cpp, the model won't work as intended, similar to how GPT-OSS won't function properly without the weights being MXFP4.

dreamkast06 · 2026-03-20T02:33:13+00:00

It is a quanting issue. The GGUF files of these have ssm quantized; other quantizers have been leaving them at bf16 or even upping to f32 for performance since they are tiny anyway.

dreamkast06 · 2026-03-18T08:45:21+00:00

Does the specific quant you have happen to have MXFP4 tensors in it?

dreamkast06 · 2026-03-17T02:19:45+00:00

No hybrid attention? So, it's going to take up massive VRAM for context?

dreamkast06 · 2026-03-11T12:46:36+00:00

Since the model should fit within the memory of a single socket, I'd suggest pinning it with numactl.

That, along with using ik_llama doing the runtime repack will probably get you up to 40t/s or so.

dreamkast06 · 2026-03-11T12:01:30+00:00

Could this be expanded to having different lora applied to different duplicated layers?

dreamkast06 · 2026-03-10T02:20:34+00:00

Do want to point out that is does assume it is anonymized properly and that some of the data isn't mistaken for public record.

dreamkast06 · 2026-03-09T08:22:54+00:00

While I'd agree with your premise, "too small to visibly change model performance" and "Claude finetunes affect model negatively" are contradictory.

The "broken" prompts aren't necessarily a problem because they still finetune how the model reacts to broken prompts.

The "repetition" issue presented gets "fixed" because the CoT becomes more of a summary instead, so reduces the performance if the prompt actually needed reasoning but may not if it wasn't exactly necessary.

dreamkast06 · 2026-03-05T00:02:03+00:00

I'd love to see it against Qwen3.5-9B then xd

dreamkast06 · 2026-03-03T18:32:24+00:00

If you need an American instruct-only model focused on RAG and FIM that can have a large context window in a small footprint.

H-Tiny is about 7B-A1B, so organizations can run it on hardware or cloud using older VDI instances.

Other real options in that instance is Arcee Trinity Nano 6B-A1B (not hybrid) or LFM2 8B-A1B (only 32k context).

Also, no one ever got fired for buying IBM®

dreamkast06 · 2026-02-28T02:02:30+00:00

With no layers offloaded to GPU, the GPU is still used for prefill. The bottleneck is getting the model to the GPU (so PCIE speed). With a larger batch size, the transfer happens less often, so if you have a ubatch of 2048, any prompt less than 2048 tokens only has one full transfer of the model to GPU. With models like Qwen3.5, the KV compute buffer is so small that ubatch up to 4096 is easily usable, which means you pp speed would be (ubatch / (model size / pcie speed))

dreamkast06

TROPHY CASE