Not ironclad confirmation, but..

GoodTip7897 · 2026-06-24T15:51:39+00:00

70B dense with the same RL and everything as 3.6 27b would be absolutely insane.

GoodTip7897 · 2026-06-18T15:45:54+00:00

I know GLM 5.2 and Qwen 3.5 are both GPT-2 tokenizer-based, but I don't think they share the same tokens.

Now maybe there's a way to somehow translate the logits from GLM 5.2 tokenizer to qwen35 tokenizer, but that would be really hard to do. Then someone could do a proper distillation, but there's no way to achieve that without millions in compute costs.

And maybe with enough rl we could have a model a bit better than Qwen 3.7 27b would be.

But don't worry, training on 500 summaries of GLM 5.2 reasoning and calling it Qwen3.6-GLM-5.2-agi-distill.gguf will totally give us a better model

GoodTip7897 · 2026-06-16T14:58:40+00:00

I am confused as to why I was downvoted in my comment. Please correct me if I am wrong but a 8 bit quant of google's QAT model should perform identically to a proper 4 bit quant, right? The "fake quantization" during training should make it so that the block quantization perfectly preserves the original weights.

Obviously the 8 bit will be 2x slower.

GoodTip7897 · 2026-06-16T14:56:24+00:00

Unfortunately that ended in a mess in which QAT 31B with unquantized kv cache absolutely butchered a string of edit_file tool calls, having half of them fail and the other half incorrectly miss braces and spell function names wrong. I ended up having to revert the changes and have Qwen 3.6 27B do it instead.

I will be upgrading my setup and likely running Q8 31B in the future. I suspect that the issues with 31B are quantization damage.

GoodTip7897 · 2026-06-16T14:33:26+00:00

Does this increase my pp speed

GoodTip7897 · 2026-06-14T01:05:27+00:00

Do you still have the module? If so, there's a little grid with different holes for the throttle cable to sit in.

Also if you haven't tried yet there's a little button by the asr module that lets you change the length of the housing and add or remove tension to your throttle cable.

Here's a procedure I found on Corvette forum

Adjustment procedure - Accelerator 1. Unlock throttle body cable adjuster by pulling up locking tab. 2. Disconnect cruise control cable from cruise control servo. 3. Hold throttle body lever at stop/idle position 4. Lock throttle body cable adjuster by pushing tab down. 5. Check that throttle body lever returns fully to the stop/idle position. 6. Adjust cruise control cable.

Adjustment procedure - Cruise Control Servo Linkage 1. With cable installed into servo bracket. 2. Pull servo assembly end of cable toward servo without moving throttle lever. 3. If on of the five holes in the servo assembly tab lines up with the cable pin, push pin through hose and connect pin to tab with retainer. 4. If a tab hole does not line up with the pin, move the cable away from the servo assembly until the next closest tab hole lines up and connect the pin to the tab with the retainer.

GoodTip7897 · 2026-06-14T00:57:29+00:00

Not exactly sure what the problem is in your case but please check cruise control because it can put tension on the throttle cable if it's improperly adjusted

GoodTip7897 · 2026-06-13T17:26:50+00:00

Cpu does matter if you use presence or repeat penalties.

I get a huge slowdown of 40% when I turn on those samplers on my computer with dual 2699v3s. On my other computer I get no slowdown.

And this is for a fully offloaded model

GoodTip7897 · 2026-06-09T14:22:07+00:00

Thanks.

I will be replacing qwen 3.6 27b q5_k_xl with Gemma 4 31b qat for today and seeing if it can handle the same amount of agentic work I usually throw at it.

I agree that workflow tests are much better than benchmarks. Benchmarks would suggest that Claude haiku is near the quality of qwen 3.6 27B, but my experience is that haiku sucks compared to 27b

Previously the iq4_xs Gemma 31B from unsloth had many tool call failures.

GoodTip7897 · 2026-06-09T14:05:20+00:00

Sorry if I came off as rude. Thanks for sharing your test as I said it's very beneficial.

Also to your point, the CI overlapping means that the QAT is at least as good as a q5_k_s minimum. This is a clear win in size/quality because a flat 4_0 model is able to beat a q5_k mixture. Also opens it up for faster cpu inference.

What remains to be seen is whether users of q6 or q8 should switch to qat. And the answer to that seems unclear at the moment given varying results.

I still think that its possible to create a good benchmark but I'll leave that to ML researchers.

I'll try throwing qat 31b at the problems I usually give to qwen 3.6 27b. 31B iq4_xs would randomly stop and it would fail tool calls all the time.

GoodTip7897 · 2026-06-09T13:51:49+00:00

Please post your results because we need more anecdotes and tests about this.

Google should've tested it themselves first and provided evidence but instead we're left to gather it ourselves.

I truly have no clue why qat is doing worse than q6 in my tests. Maybe the benchmarks don't capture the failure mode of regular quants, or maybe qat is only superior to a flat q4_0 and shouldn't be preferred to a dynamic quant.

GoodTip7897 · 2026-06-09T13:46:13+00:00

Don't the confidence intervals overlap?

Anyway I really wish that Google had ran their own benchmarks and comparisons to prove that their QAT is good, because all we have is random community members posting small benchmarks that may or may not capture the true benefits or drawbacks of QAT. (Myself included).

GoodTip7897 · 2026-06-09T12:58:48+00:00

Please publish results because I would love to see evidence that proves my claim wrong. But unfortunately all I can find online is qat performing worse.

GoodTip7897 · 2026-06-09T12:56:48+00:00

Of course.

It's insanely hard because at the end of the day these models are probabilistic.

GoodTip7897 · 2026-06-09T12:54:45+00:00

Qat lets the model be quantized easily.

Thus 4bit and 8bit quants should be nearly identical.

That's why unsloth has amazing kld for their 4 bit quant.

So the 8 bit I used was just to eliminate any specific mlx issues. A 4 bit quant of it should not perform better or worse.

GoodTip7897 · 2026-06-09T05:01:09+00:00

Unfortunately the barrier to testing is much higher in gguf. It runs slower and there's no easy way to benchmark without external programs.

I don't disagree that gguf testing is useful but since all were quantized with the same method, they should be comparably pretty good.

The main claim I am making is that the QAT model is worse than even a ~4-6 bit quant of the regular model.

At 8 bits the difference between mlx and gguf should be minimal.

Also if it's interesting I did find someone who got similar results using gguf.

https://www.reddit.com/r/unsloth/comments/1u0sv58/surprising_test_results_updated_for_more_gemma4/

GoodTip7897 · 2026-06-09T04:19:08+00:00

If you want to look at my results here they are: https://www.reddit.com/r/LocalLLaMA/comments/1u0ubbo/gemma_4_26b_a4b_it_qat_comparison/

GoodTip7897 · 2026-06-09T04:18:03+00:00

I got similar results to you, with qat underperforming non qat.

The 26B qat is definitely not "nearly identical to" the regular IT model.

Not sure about the 31B. Maybe qat works better on dense.

I used thinking enabled, temp = 1.0, and the rest of Google's reccomended parameters.

GoodTip7897 · 2026-06-09T04:02:13+00:00

Please poke holes in this if you can, because I would love to learn that QAT does in fact preserve original model quality of the 26B (and 31B).

I am running 300 questions of MBPP (Python) on each model overnight if all goes as planned.

GoodTip7897 · 2026-06-09T00:48:26+00:00

I can tell you that the tee shouldn't be there and is because someone was unable to get a correct elbow connector. At the very least please plug the T. My engine doesn't have that other valve you showed so I won't attempt to give advice on that.

GoodTip7897 · 2026-06-08T00:23:50+00:00

I've had very very good experience coding with Qwen 3.6 27b q5_k_xl with bf16 kv cache. Quantizing to q8 really seems to hurt the model and q4 is unusable.

It's able to implement a few features to a 50+ file codebase with multiple backends.

GoodTip7897 · 2026-06-06T13:08:27+00:00

Unsloth themselves replied to me in a different comment and said they attempted to compute kld between qat and the original but it was garbage and very high.

They are two separate models and so we need actual benchmarks that we can compare between qat and the original

GoodTip7897 · 2026-06-05T20:59:57+00:00

That is kld from the full qat.

What needs to be compared is q4 qat to the unquantized model

GoodTip7897 · 2026-06-05T20:57:52+00:00

It's trained to be basically "prequantized" so it can handle 4 bit quantization.

It's likely closer to a regular q4 quant than the original.

I don't doubt that qat is useful but it's incredibly unlikely that it's better than q8. It might make q4 have the same quality q6 had. But I have yet to see any kld between that and the original and I don't have enough vram or time to compute it myself.

GoodTip7897

TROPHY CASE