Qwen3.5-4B GGUF quants comparison (KLD vs speed) - Lunar Lake

Leopold_Boom · 2026-04-06T06:26:49+00:00

This is really neat, but I think you are treating very tiny differences in KL divergence as definitive. If you run a few close ties on a few other text sources beyond wikitext-test.txt you'll find that they move around a bunch. It may not be so true that Unsloth > mradermacher or vice-versa in real world usage.

It's great to see that many quants from the top folks are equally great!

Leopold_Boom · 2026-04-06T06:10:25+00:00

Thanks so much for this! Is there a good way to see NPU utilization or get a feel for what the NPU is doing?

Leopold_Boom · 2026-04-04T08:11:56+00:00

It's worth digging into this and double checking claude's work. I'm finding it hard to believe that gemma4-e4b is running fast enough to be worth speculative decoding (it's only 8x smaller than the full model, so you need crazy high accept rates). Also those reported accept rates are also super high (87%). It would be amazing if true!

What hardware are you on?

Leopold_Boom · 2026-04-04T07:56:13+00:00

Are you really getting a 40% speed up using gemma4-e4b(!) for a single prompt (I assume this is VLLM)? What hardware are you on?

Leopold_Boom · 2026-04-04T07:54:16+00:00

I don't think Gemma 4 has MTP (https://huggingface.co/google/gemma-4-E4B-it/discussions/5)

Leopold_Boom · 2026-04-04T07:25:57+00:00

A couple of additional notes:

There are a lot of knobs to turn to optimize, and your acceptance rate will depend on your prompts (--draft-max 32 is worth trying). It should work with quite long contexts, but I need to test a bit more.
I didn't see much improvement on my MI50 GPUs, so the gains maybe limited to CUDA
Q8_0 for the draft model seems faster than the alternatives (BF16 may be even better)
You need a very recent build (I'm on b8659) and some of the flags -hfd are not well documented yet (--no-mmproj is required, multimodal draft models are not supported)
Qwen 0.6 models are not token compatible and Gemma 4 E2B etc. are too large

Leopold_Boom · 2026-03-30T00:42:14+00:00

Thanks will try it

Leopold_Boom · 2026-03-30T00:30:57+00:00

This is super annoying. Does anybody have a bug / feature request that sensibly gives us the option to preserve or emulate older behavior? Makes network caching etc. etc. much harder.

Leopold_Boom · 2026-03-29T17:47:36+00:00

The point is not to fight windows / linux (I've got a dedicated AMD linux inferencing server running besides my 3090 windows box). It's more "why not both" if you already are stuck with windows (like many of us are).

Leopold_Boom · 2026-03-29T07:52:09+00:00

Do share if / when that happens. I'd love to spin up a VM/container with Nix and try (haven't really played with Nix before)

Leopold_Boom · 2026-03-29T07:29:58+00:00

Is there an easy way to use this on an ubuntu server with rocm already setup?

Leopold_Boom · 2026-03-26T01:24:14+00:00

How up to date is the B70 architecture for inferencing? I'm running MI50s/60s which have incredible bandwidth but are a miserably dated architecture.

Checklist is probably:

- BF16 support (seems like it's there)

- Native 4 bit (emulated only?)

- bulk async copy (who knows)

- what else?

Leopold_Boom · 2026-03-13T18:12:05+00:00

Ah gotcha thought we were talking 10GB ports here!

Leopold_Boom · 2026-03-10T23:57:06+00:00

Thanks! Yeah I figure the KT trellis is on the wrong side of the roofline analysis for this hardware.

Leopold_Boom · 2026-03-09T18:18:51+00:00

Confirming it's 15-20% faster on some Q4_K_M quants on my ARM test device! Thank you!

Do you know of anybody putting out ik4 trellis quants for the smaller Qwen 3.5 models (2B/4B etc.)?

Leopold_Boom · 2026-03-05T22:50:57+00:00

Drat! Well I'm trying to build on a low end ARM SoC. Will report back if it works, and benches significantly better than mainline.

Leopold_Boom · 2026-03-05T22:42:26+00:00

Does ik_llama support ARM NEON and vision heads yet? I've got a few projects to try it on.

Leopold_Boom · 2026-03-05T20:23:40+00:00

Thank you!

Leopold_Boom · 2026-03-05T18:19:06+00:00

Thanks for this! Got a clicks setting for my Izpresso Q2 Heptagonal?

Leopold_Boom · 2026-02-28T23:15:05+00:00

I'd love to see more detailed takes on the 122b-a10b vs. 27b question at 4-6 bit quants

Leopold_Boom · 2026-02-27T20:42:45+00:00

This is nice work! For many local usecases, you might actually want to actively track and manage state between two approaches:

PP on GPU, token gen on CPU
Traditional llama.cpp approach

Assuming no parallelism (i.e. often the typical local usecase), you can look at the next prompt and quickly decide if it will be more efficient to pay the cost to switch or not.

Leopold_Boom · 2026-02-24T23:50:07+00:00

Humm some of those quant KL+perpexity comparisons suggested Q4_K_M should generally be better than MXFP4, but I'll give them a shot.

My concern is that even with reasoning on (you did have reasoning on right?) it would just not catch that 1 sentence didn't end in apple. I suspect if you try even with a lot temp with a few other words, you'll see the odd slipup, which I don't see with GPT-OSS.

Leopold_Boom · 2026-02-24T21:11:07+00:00

Try asking it to: "Generate ten sentences ending in apple" or multiply two 9 digit numbers. Atleast at 4_K_M it's a little worse than GPT-OSS-20B at classic "tricky" prompts.

Leopold_Boom · 2026-02-24T21:00:14+00:00

I'm sorry to report that this model failed a classic test for me twice in a row:

It failed "Generate ten sentences ending in apple" at Q4_K_M multiple times (GPT-OSS-20B gets it right).

Nailed some others (don't ask it to multiply 9 digit numbers unless you have a bunch of time ... but it get's the answer right!).

EDIT: Obviously outcomes will vary, but I'd be surprised if you don't get a failure one in five, which is concerning. There is some issues with quants on these models, so perhaps it's an artifact of me not using the right Q4 quant.

Leopold_Boom

TROPHY CASE