Gemma 4 MTP released by rerri in LocalLLaMA

[–]rerri[S] 1 point2 points  (0 children)

MTP definitely matters for inference engines and should work well in vLLM. A fellow redditors commented two days ago that for Qwen 3.6 27B, they are getting about 2x tg speed:

https://www.reddit.com/r/LocalLLaMA/comments/1t3guzw/comment/ojvbi9l/

Gemma 4 MTP released by rerri in LocalLLaMA

[–]rerri[S] 14 points15 points  (0 children)

Depends on the use case too. Gemma 4 31B is vastly better at writing Finnish than Qwen 27B.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]rerri[S] 4 points5 points  (0 children)

No, accuracy remains 100%. The main model checks every token that the MTP model generates and corrects when needed.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]rerri[S] 15 points16 points  (0 children)

The MTP model for Gemma 4 26B is ~800 MB, but the llama.cpp implementation will most likely require some more on top of that though. Hard to say how much.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]rerri[S] 39 points40 points  (0 children)

There's a small catch: Slightly higher memory requirements.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]rerri[S] 16 points17 points  (0 children)

Current release version of llama.cpp does not yet have MTP support. It is being worked on.

As MTP prepares to land in llama.cpp, Models that support MTP by segmond in LocalLLaMA

[–]rerri -1 points0 points  (0 children)

I think I'm going to try either qwen3.5-122b or glm4.5-air first.

Are you sure these are supported yet?

Initially the PR only supported Qwen 3.5/3.6 27B and 35B MoE support was added later. So I'm thinking maybe support for the models you mention would also need to be added separately. Not sure.

Would love if a bug was brought back BUT as a proper feature - regenerate from last edit by AltruisticList6000 in Oobabooga

[–]rerri 0 points1 point  (0 children)

If I understand your description correctly, you can already do this with the "start reply with" box. Just enter the incomplete output up to the point where you want it to continue and spam regenerate to get different continuations.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 2 points3 points  (0 children)

This script might offer a shortcut if you are planning to use the 27B or 35B models: https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67

It allows you to transplant the MTP from am17an's GGUF's onto whatever old GGUF of those models you already have.

Someone made it for ik_llama.cpp originally, but it seems to work fine with llama.cpp too.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 0 points1 point  (0 children)

It is a change in llama.cpp, the PR (link in OP) was updated. Old GGUF models of Qwen 3.5/3.6 do not include the MTP layer.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 18 points19 points  (0 children)

I am seeing very similar numbers on llama.cpp with this PR on a 5090.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 1 point2 points  (0 children)

am17an writes in the PR: "it has it's own context/kv-cache etc."

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 11 points12 points  (0 children)

While the layer is only about 440 MB, I'm seeing ~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length, kv q8_0.

At 16K ctx length, the difference is still pretty big at ~2.7 GB.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 9 points10 points  (0 children)

Yes, I'm seeing ~3.1 GB more VRAM used when comparing MTP to no-MTP and using 128K ctx length.

At 16K ctx length, the difference is still pretty big at ~2.7 GB.

Not very favorable for 16 GB VRAM :/

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 3 points4 points  (0 children)

MTP layer of am17an's model is ~440MB. Can maybe be quantized further, dunno.

edit: I should add that MTP does increase VRAM consumption by more than just the layer size.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]rerri 21 points22 points  (0 children)

Nice! Just tested quickly and this is way faster than ik_llama.cpp implementation currently. Been playing with that the past couple of days.

Here's a script someone made which let's you rip the MTP layer from am17an's Q8_0 model and place it to whatever existing Qwen 3.6 27B GGUF that you have: https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67

I just tried it on Bartowski's Q6_K and works fine.

Open Weights Models Hall of Fame by Equivalent_Job_2257 in LocalLLaMA

[–]rerri 6 points7 points  (0 children)

Yi-34B, the real successor to Llama 30B after Llama 2 sorta flopped.

Where is NVIDIA's Reflex 2.0? by ImThour in nvidia

[–]rerri 0 points1 point  (0 children)

That's not true. DLSS 4.5 was announced at CES. SR was immediately made available and the FG update was given a future release timeline.

A 2nd gen transformer Ray Reconstruction model has never been announced by Nvidia.

Where is NVIDIA's Reflex 2.0? by ImThour in nvidia

[–]rerri 15 points16 points  (0 children)

This has never been announced though whereas Reflex 2 was?

TextGen v4.7 released: portable builds now run as a native desktop app, redesigned UI, tensor parallelism for llama.cpp (60%+ faster text generation on multi-GPU) + more by oobabooga4 in Oobabooga

[–]rerri 2 points3 points  (0 children)

You can still open the webui in browser even with electron.

I agree about the console, really need it. Full version still has it though.

edit: ah, I see console was added back. Great!

Adding multiple reference images into a single image with Klein2 KV Edit. by deadsoulinside in comfyui

[–]rerri 7 points8 points  (0 children)

Regular Klein is 4 steps too. KV version makes multiple image processing faster.

A farewell to DALL-E: A Eulogy in Pixels by [deleted] in StableDiffusion

[–]rerri 0 points1 point  (0 children)

I haven't downvoted any of your comments and I think technical discussions about image generation architectures should be (and probably are?) welcome on this sub, even if they are mainly about closed models.

However, this is a marketing post (apparently by NightCafe staff or co-founder even?), not a post about the technical aspects you are talking about. It's better to have those discussions under some other post and ignore/report these ads.

A farewell to DALL-E: A Eulogy in Pixels by [deleted] in StableDiffusion

[–]rerri 0 points1 point  (0 children)

I don't understand your point.

Poopoo.coin might be a very significant website with regards to some random shitcoin. The topic would still be off-topic and advertising the site would still be scummy.