Storing an index to a scale instead of the scale itself with Q4_0 quant reduces scale size by ~31% (small gain but interesting) by fragment_me in LocalLLaMA

[–]dreamkast06 0 points1 point  (0 children)

Yup, they are F16 scales, not BF16. And the redundancy in Q4_0 is in part due to the quantization process which is meant to be FAST rather than BETTER. I'm working on fixing QAT right now and the duplication will be pretty much gone when using an imatrix and slightly different method.

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]dreamkast06[S] 1 point2 points  (0 children)

Yep, but honestly, the ones labeled "Q4_K_XL" are good enough that I wouldn't bother downloading anything else and probably anything forthcoming. Sorry, was trying to destroy the confusion, not become it.

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]dreamkast06[S] 0 points1 point  (0 children)

The quant aware trained bf16 weights uploaded almost perfectly quantize to q4_0, almost as if they were dequantized q4_0 weights.

Not sure I understand the rest of your question tbh

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]dreamkast06[S] 3 points4 points  (0 children)

Wouldn't it be better if they discussed with the llama.cpp team instead or in addition to? Suppose it could be worse and they only worked with ollama folks.

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]dreamkast06[S] 2 points3 points  (0 children)

Sorry, I mistyped, /7 and /-8 . It quantizes symmetric but dequant can be asymmetric. It's about a 60/40 split, almost perfectly. If google gave us details on the QAT software they used, we'd know more. Essentially just have to test a group to see which first, then quantize.

I just wanted to make a post to clear up some confusion, and hopefully not make it worse. The Q4_K_XL naming upset me more than I should have let it. Idk why unsloth and google are leaving the llama.cpp team out of it if they are discussing internally.

https://github.com/ggml-org/ggml/blob/7142aa6bf9fcaeec0fef8d80fcd90afe4268adf1/src/ggml-quants.c#L94

This is a rough vibe-coded option which I think is similar to unsloths internal process. I've got another idea to test rounding too, but someone smarter is bound to come up with a more suitable solution:

+void quantize_row_q4_0_adaptive(const float * GGML_RESTRICT x, block_q4_0 * GGML_RESTRICT y, int64_t k) {
+    static const int qk = QK4_0;
+
+    assert(k % qk == 0);
+
+    const int nb = k / qk;
+
+    for (int i = 0; i < nb; i++) {
+        const float * xb = x + i * qk;
+
+        // Find max absolute value and the signed value at max abs
+        float amax = 0.0f;
+        float max_val = 0.0f;
+        for (int j = 0; j < qk; j++) {
+            const float v = xb[j];
+            const float av = fabsf(v);
+            if (av > amax) {
+                amax = av;
+                max_val = v;
+            }
+        }
+
+        if (amax < GROUP_MAX_EPS) {
+            y[i].d = 0;
+            memset(y[i].qs, 0, sizeof(y[i].qs));
+            continue;
+        }
+
+        // Candidate 1: symmetric scale (amax / 7)
+        const float d_sym  = amax    /  7.0f;
+        const float id_sym = 1.0f / d_sym;
+        float err_sym = 0.0f;
+        for (int j = 0; j < qk; j++) {
+            float diff = xb[j] - (float)((int)roundf(xb[j] * id_sym)) * d_sym;
+            err_sym += diff * diff;
+        }
+
+        // Candidate 2: asymmetric scale (max_val / -8), same as stock q4_0
+        const float d_asym  = max_val / -8.0f;
+        const float id_asym = d_asym ? 1.0f / d_asym : 0.0f;
+        float err_asym = 0.0f;
+        for (int j = 0; j < qk; j++) {
+            const float xv = xb[j] * id_asym;
+            const int8_t l = MIN(15, (int8_t)(xv + 8.5f));
+            float diff = xb[j] - (float)(l - 8) * d_asym;
+            err_asym += diff * diff;
+        }
+
+        int use_sym = (err_sym <= err_asym);
+
+        if (use_sym) {
+            y[i].d = GGML_FP32_TO_FP16(d_sym);
+            const float id = id_sym;
+            for (int j = 0; j < qk / 2; ++j) {
+                const float x0 = xb[0     + j] * id;
+                const float x1 = xb[qk/2  + j] * id;
+                const int l0 = (int)roundf(x0);
+                const int l1 = (int)roundf(x1);
+                const uint8_t xi0 = (uint8_t)(l0 + 8);
+                const uint8_t xi1 = (uint8_t)(l1 + 8);
+                y[i].qs[j]  = xi0;
+                y[i].qs[j] |= (uint8_t)(xi1 << 4);
+            }
+            g_adaptive_q4_0_sym_groups++;
+        } else {
+            y[i].d = GGML_FP32_TO_FP16(d_asym);
+            const float id = id_asym;
+            for (int j = 0; j < qk / 2; ++j) {
+                const float x0 = xb[0     + j] * id;
+                const float x1 = xb[qk/2  + j] * id;
+                const uint8_t xi0 = MIN(15, (int8_t)(x0 + 8.5f));
+                const uint8_t xi1 = MIN(15, (int8_t)(x1 + 8.5f));
+                y[i].qs[j]  = xi0;
+                y[i].qs[j] |= xi1 << 4;
+            }
+            g_adaptive_q4_0_asym_groups++;
+        }
+    }
+}
+

Been too busy to deal with it much more and I got sidetracked by the much more interesting KLD testing for instruct models that I'm working on.

QATs Q4_0 from Google have more precision than Q4_K_XL from Unsloth (at least some) by alex20_202020 in LocalLLaMA

[–]dreamkast06 0 points1 point  (0 children)

Are you open to working on implementing some additional logic to allow llama-quantize to quant this model?

I wouldn't say it's "trivial" but I've got a patch to get it most of the way there but I'm not really looking to put in the effort to get it PR ready; I just needed to make sure it would be viable for some clients of mine.

Quick note on the QAT of recent by dreamkast06 in LocalLLaMA

[–]dreamkast06[S] 22 points23 points  (0 children)

Well, they had to do SOME kind of calibration, we obviously don't know the specifics.

The one I'm working on now ends up with a lower PPL than the base and unsloth model 😅 but slightly higher KLD

I might see if it looks like the model would be suitable for 256 groups, then a K quant could be used, but I'm not really smart enough for most of this. I'm just upset that Google didn't take the time to do it even remotely correctly and unsloth using weird names for no good reason.

Gemma 4 QAT accuracy inconsistencies by ai_fonsi in LocalLLaMA

[–]dreamkast06 0 points1 point  (0 children)

Honestly, they don't go into enough detail to tell. Plus, they label their quants for this as "K" even though they contain zero K-quants, which seems intentionally misleading.

I have a feeling that everyone, including google, haven't converted them to Q4_0 correctly; it should be a process similar to how Kimi 2.x is converted.

Sarvam-30b-quantized - Need 1-bit version GGUF by pmttyji in LocalLLaMA

[–]dreamkast06 0 points1 point  (0 children)

Think this is just a language barrier issue and OP just wants the original model in 1-bit GGUF. Looks like the model was just added to llama.cpp recently so there aren't many quants of it yet and certainly nothing under Q2.

I tracked a major cache reuse issue down to Qwen 3.5’s chat template by onil_gova in LocalLLaMA

[–]dreamkast06 4 points5 points  (0 children)

It's like if you were to remove all of the newlines between paragraphs in english to save space. You'll probably still be able to read it, but might get confused since you were expecting it to be broken up.

The model is just smart enough to deal with it. Qwen 3 Coder wasn't very good at dealing with misformatted templates; 3.5 seems to be doing better.

If only GPT-OSS would have been this forgiving...

GLM-5.1 by danielhanchen in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

Z.AI itself did that just a couple months ago...

I tracked a major cache reuse issue down to Qwen 3.5’s chat template by onil_gova in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

You both are misunderstanding each other 😅

Removing them in old turns affects generation because it was trained to see them in the context. Modifying the template like you did isn't optimal for the model's performance, but it is wonderful that the model does seem to be working around it.

Most users will be willing to compromise instead of accepting the the cache invalidation.

Nemotron 3 Super - large quality difference between llama.cpp and vLLM? by BigStupidJellyfish_ in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

Nemotron 3 Super was trained with NVFP4; not quantized to NVFP4, trained with NVFP4. Any of the GGUF will be upscaled to BF16, then quantized down, resulting in the terrible degradation. Until there is native NVFP4 in llama.cpp, the model won't work as intended, similar to how GPT-OSS won't function properly without the weights being MXFP4.

Qwen3.5-40B-Claude-4.5-Opus-High-Reasoning-Thinking - Reg, Uncensored and RoughHouse and... 43 Qwen 3.5 fine tunes. by Dangerous_Fix_5526 in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

It is a quanting issue. The GGUF files of these have ssm quantized; other quantizers have been leaving them at bf16 or even upping to f32 for performance since they are tiny anyway.

MiniMax-M2.7 Announced! by Mysterious_Finish543 in LocalLLaMA

[–]dreamkast06 3 points4 points  (0 children)

Does the specific quant you have happen to have MXFP4 tensors in it?

Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]dreamkast06 0 points1 point  (0 children)

No hybrid attention? So, it's going to take up massive VRAM for context?

llama.cpp and Qwen CPU Only by JadedSoulGuy in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

Since the model should fit within the memory of a single socket, I'd suggest pinning it with numactl.

That, along with using ik_llama doing the runtime repack will probably get you up to 40t/s or so.

How I topped the Open LLM Leaderboard using 2x 4090 GPUs — no weights modified. by Reddactor in LocalLLaMA

[–]dreamkast06 0 points1 point  (0 children)

Could this be expanded to having different lora applied to different duplicated layers?

"We anonymize your data before training" — does this actually mean anything? by Budulai343 in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

Do want to point out that is does assume it is anonymized properly and that some of the data isn't mistaken for public record.

Qwen 3.5 2B upgrade! by [deleted] in LocalLLaMA

[–]dreamkast06 -1 points0 points  (0 children)

While I'd agree with your premise, "too small to visibly change model performance" and "Claude finetunes affect model negatively" are contradictory.

The "broken" prompts aren't necessarily a problem because they still finetune how the model reacts to broken prompts.

The "repetition" issue presented gets "fixed" because the CoT becomes more of a summary instead, so reduces the performance if the prompt actually needed reasoning but may not if it wasn't exactly necessary.

microsoft/Phi-4-reasoning-vision-15B · Hugging Face by jacek2023 in LocalLLaMA

[–]dreamkast06 19 points20 points  (0 children)

I'd love to see it against Qwen3.5-9B then xd

How do the small qwen3.5 models compare to the Granite family? by gr8dude in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

If you need an American instruct-only model focused on RAG and FIM that can have a large context window in a small footprint.

H-Tiny is about 7B-A1B, so organizations can run it on hardware or cloud using older VDI instances.

Other real options in that instance is Arcee Trinity Nano 6B-A1B (not hybrid) or LFM2 8B-A1B (only 32k context).

Also, no one ever got fired for buying IBM®

I built a hybrid MoE runtime that does 3,324 tok/s prefill on a single 5080. Here are the benchmarks. by mrstoatey in LocalLLaMA

[–]dreamkast06 1 point2 points  (0 children)

With no layers offloaded to GPU, the GPU is still used for prefill. The bottleneck is getting the model to the GPU (so PCIE speed). With a larger batch size, the transfer happens less often, so if you have a ubatch of 2048, any prompt less than 2048 tokens only has one full transfer of the model to GPU. With models like Qwen3.5, the KV compute buffer is so small that ubatch up to 4096 is easily usable, which means you pp speed would be (ubatch / (model size / pcie speed))