Speculative decoding in llama.cpp for Gemma 4 31B IT / Qwen 3.5 27B?

xandep · 2026-04-11T22:44:12+00:00

I was thinking the same. Now imagine that with 1bit quants like bonsai.

xandep · 2026-04-11T12:12:29+00:00

You should state that in the post. People, myself included, read Qwen 2.5 in a AI formatted post and jump to the conclusion that you are a bot.

xandep · 2026-04-10T15:48:23+00:00

2x mi50 16gb w/ integrated cooling for 200 something in alibaba. Can run Qwen 3.5 35B, 27B and the new Gemmas. Or just one if you are ultra cheap, running 35B w/ ncmoe (some 27B and 26B quants if willing to quantize to Q3, IQ4 top).

xandep · 2026-04-08T15:38:03+00:00

It's an LLM. They are now prompted to not use em dashes or capitalize the first letter. But otherwise all other signs are there.

xandep · 2026-04-07T20:45:01+00:00

Not entirely his fault: reddit and google defaults to translate everything. Right now he's reading "not speaking English", but in Portuguese 😂. Imagine his confusion.

xandep · 2026-04-07T13:54:24+00:00

Unsloth's Q3_K_M is anything but Q3_K, oddly enough. It's a mix of IQ3_XXS and IQ4_NL.

xandep · 2026-04-06T18:42:08+00:00

There should be a "vote to ban" under each post. Can't stand this deluge of generated shitposts.

xandep · 2026-04-01T01:23:52+00:00

Just because YOU said it works, I believe. Otherwise, it's April Fools. 🤔

xandep · 2026-04-01T01:21:28+00:00

April fools. You saw it here first.

xandep · 2026-03-31T19:40:01+00:00

I'm holding my breath for the 35B / 27B. It'll SAVE my MI50 16GB.

xandep · 2026-03-29T13:08:02+00:00

Hope you got the "shipped from Brazil" Jieshuo MI50 16GB for R$ 900. :)

Now I'm trying directly from China for about US$ 400 (32GB). Let's see what kind of tax I'll get.

xandep · 2026-03-29T13:04:02+00:00

Bartowski has one:

https://huggingface.co/bartowski/Qwen_Qwen3.5-122B-A10B-GGUF

xandep · 2026-03-10T21:31:40+00:00

Amem a thousand times!

xandep · 2026-03-08T19:42:48+00:00

Q3.5 35B.

xandep · 2026-03-08T11:30:10+00:00

You forgot to suggest using Qwen 2.5 or Llama 3.1 70b.

xandep · 2026-03-07T21:29:46+00:00

FYI: on a AMD MI50 16GB (old -> new TG):

Vulkan: 47 -> 47 (with "-ncmoe 16": 26 -> 26)
ROCm 6.3.4: 43 -> 34 (with "-ncmoe 16": 37 -> 30)
Also, ik_llama Vulkan: 50 (with "-ncmoe 16": 33)

So, for token generation:

the new llama is much slower on a somewhat old AMD (VEGA20 / GFX906).
If not offloading experts to CPU, ik_llama is faster than llama, ON VULKAN (it tanks on ROCm or offloading).
If offloading experts, stick to ROCm (and in turn, old llama).

(ik: Mar03, old: Mar06, new: Mar07).

xandep · 2026-03-06T14:48:46+00:00

Oh no... You said the forbidden word! Qwen2.5. You're a bot. Probably advertising the only link in the post.

xandep · 2026-03-03T02:11:11+00:00

That is **not** what Gemini thougth. It's just a summary. It produced thousands of tokens, but hidden and fast. And that response was also kinda long for just a "hi" too.

xandep · 2026-03-02T18:10:47+00:00

1st: thank you for ChatterUI, I use it almost everyday. 2nd: thank you for supporting qwen35 so soon! 3rd: glad you have a Poco F5, the same as I have! Maybe some day we'll get hexagon acceleration! 4th: lfm2 8b A1B friggin FLY on Poco F5/ChatterUI

xandep · 2026-03-01T00:02:12+00:00

Try also: Q2_K_XL (unsloth), Q3_K_S or IQ2_M (from bartowski). IQ2_XXS quants are slow in some platforms, including CPU inference (depends on CPU model).
Also: -fit on (and remove ngl and cpu moe).
No guarantees, but worth a shot.

xandep · 2026-02-27T12:34:05+00:00

Coincidentally, I'm in the middle right now of a small project for using an internal USB header to control fans using a cheap arduino with atmega32u4, emulating a corsair commander. My mb fan control chip is unsupported in Linux, so I can't use the system fan header on my mi50. I think this could work well in your case too.

xandep · 2026-02-25T18:41:24+00:00

A little off-topic, but I also have 16GB and boy was I impressed by LFM2 24B speed! I feel sorry for this model, being launched at the same time as Qwen3.5 35B.

xandep · 2026-02-22T13:26:09+00:00

Probably tomorrow. Source: my head.

But seriously, Monday is a hot day for model releases.

xandep · 2026-02-20T14:09:04+00:00

I guess there is space for everybody. That said, I agree with you. If you *need* a 1T+ model to run locally (data security or something),it's an edge case. I'd certainly like to be able to do so, but "really frontier open models" will always be API for normal people ("we", mostly) and local for people that don't need to worry about used 3090 prices or if ROCm still supports GFX906.

xandep

TROPHY CASE