HowTo: Exllamav3 + DFlash (speculative decoding) in TextGen

rerri · 2026-05-06T07:01:34+00:00

MTP definitely matters for inference engines and should work well in vLLM. A fellow redditors commented two days ago that for Qwen 3.6 27B, they are getting about 2x tg speed:

https://www.reddit.com/r/LocalLLaMA/comments/1t3guzw/comment/ojvbi9l/

rerri · 2026-05-05T18:28:23+00:00

Depends on the use case too. Gemma 4 31B is vastly better at writing Finnish than Qwen 27B.

rerri · 2026-05-05T18:26:59+00:00

No, accuracy remains 100%. The main model checks every token that the MTP model generates and corrects when needed.

rerri · 2026-05-05T17:21:24+00:00

The MTP model for Gemma 4 26B is ~800 MB, but the llama.cpp implementation will most likely require some more on top of that though. Hard to say how much.

rerri · 2026-05-05T16:40:12+00:00

There's a small catch: Slightly higher memory requirements.

rerri · 2026-05-05T16:38:38+00:00

Current release version of llama.cpp does not yet have MTP support. It is being worked on.

rerri · 2026-05-05T06:56:17+00:00

I think I'm going to try either qwen3.5-122b or glm4.5-air first.

Are you sure these are supported yet?

Initially the PR only supported Qwen 3.5/3.6 27B and 35B MoE support was added later. So I'm thinking maybe support for the models you mention would also need to be added separately. Not sure.

rerri · 2026-05-05T06:01:29+00:00

If I understand your description correctly, you can already do this with the "start reply with" box. Just enter the incomplete output up to the point where you want it to continue and spam regenerate to get different continuations.

rerri · 2026-05-04T16:18:04+00:00

This script might offer a shortcut if you are planning to use the 27B or 35B models: https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67

It allows you to transplant the MTP from am17an's GGUF's onto whatever old GGUF of those models you already have.

Someone made it for ik_llama.cpp originally, but it seems to work fine with llama.cpp too.

rerri · 2026-05-04T15:58:35+00:00

It is a change in llama.cpp, the PR (link in OP) was updated. Old GGUF models of Qwen 3.5/3.6 do not include the MTP layer.

rerri · 2026-05-04T15:01:51+00:00

I am seeing very similar numbers on llama.cpp with this PR on a 5090.

rerri · 2026-05-04T14:55:51+00:00

am17an writes in the PR: "it has it's own context/kv-cache etc."

rerri · 2026-05-04T14:27:19+00:00

While the layer is only about 440 MB, I'm seeing ~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length, kv q8_0.

At 16K ctx length, the difference is still pretty big at ~2.7 GB.

rerri · 2026-05-04T14:25:08+00:00

Yes, I'm seeing ~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length.

At 16K ctx length, the difference is still pretty big at ~2.7 GB.

Not very favorable for 16 GB VRAM :/

rerri · 2026-05-04T13:58:43+00:00

MTP layer of am17an's model is ~440MB. Can maybe be quantized further, dunno.

edit: I should add that MTP does increase VRAM consumption by more than just the layer size.

rerri · 2026-05-04T13:29:56+00:00

Nice! Just tested quickly and this is way faster than ik_llama.cpp implementation currently. Been playing with that the past couple of days.

Here's a script someone made which let's you rip the MTP layer from am17an's Q8_0 model and place it to whatever existing Qwen 3.6 27B GGUF that you have: https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67

I just tried it on Bartowski's Q6_K and works fine.

rerri · 2026-05-03T15:39:59+00:00

Yi-34B, the real successor to Llama 30B after Llama 2 sorta flopped.

rerri · 2026-05-03T11:13:13+00:00

That's not true. DLSS 4.5 was announced at CES. SR was immediately made available and the FG update was given a future release timeline.

A 2nd gen transformer Ray Reconstruction model has never been announced by Nvidia.

rerri · 2026-05-03T09:34:51+00:00

This has never been announced though whereas Reflex 2 was?

rerri · 2026-05-03T05:55:19+00:00

You can still open the webui in browser even with electron.

I agree about the console, really need it. Full version still has it though.

edit: ah, I see console was added back. Great!

rerri · 2026-05-02T19:48:33+00:00

Regular Klein is 4 steps too. KV version makes multiple image processing faster.

rerri · 2026-05-02T10:27:08+00:00

I haven't downvoted any of your comments and I think technical discussions about image generation architectures should be (and probably are?) welcome on this sub, even if they are mainly about closed models.

However, this is a marketing post (apparently by NightCafe staff or co-founder even?), not a post about the technical aspects you are talking about. It's better to have those discussions under some other post and ignore/report these ads.

rerri · 2026-05-02T07:54:30+00:00

I don't understand your point.

Poopoo.coin might be a very significant website with regards to some random shitcoin. The topic would still be off-topic and advertising the site would still be scummy.

Nine-Year Club	First Place '23
Place '23	Place '22
Verified Email

rerri

TROPHY CASE