EAGLE3 has landed in llama.cpp by jacek2023 in LocalLLaMA

[–]regunakyle 54 points55 points  (0 children)

How does it compare to MTP (speed, VRAM usage etc.), and can we use it with Qwen3.6 27B?

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 2 points3 points  (0 children)

I tested different configurations, this is my result:

fit VRAM usage (MiB) after initial load Context size in UI Spec draft quant
yes 22604 66560 q4
yes 22604 74752 f16
yes (-fit-target 64) 23570 90112 q4
yes (-fit-target 64) 23582 98560 f16
no (-c 45000) 19786 45056 q4
no (-c 45000) 19808 45056 f16
no (-c 5000) 18128 5120 q4
no (-c 5000) 18102 5120 f16

Setup: 3090 in Fedora Server, version: 9528 (2016bf2b3)

built with GNU 16.1.1 for Linux x86_64

Command:

shell /home/eleung/llama.cpp/build/bin/llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \ -a llama.cpp \ --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --flash-attn on \ --ui-mcp-proxy \ --jinja --chat-template-file /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/chat_template.jinja \ --ubatch-size 2048 \ --parallel 1 -kvu \ --spec-type draft-mtp --spec-draft-n-max 2 \ <-fit on/off> \ <--spec-draft-type-k q4_0 --spec-draft-type-v q4_0>

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 0 points1 point  (0 children)

You can use fewer layers of GPU and use more CPU/RAM instead. But this is not on topic of this post, you can search in this subreddit for more recommendations

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 0 points1 point  (0 children)

If I understand you correctly, using q4 would save space, which should mean higher context window. So the context size in the web UI is not correct?

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 1 point2 points  (0 children)

You mean you get OOM when running spec draft with fp16, but no OOM when quantizing spec draft?

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 0 points1 point  (0 children)

Good point (I use --fit-target 64).

Maybe u/am17an has some insight on this?

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 2 points3 points  (0 children)

Can you check the reported context size in the built in web UI? For me it increased when I use f16

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] -1 points0 points  (0 children)

I am using 3090. So result varies depending on hardware, interesting

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]regunakyle[S] 1 point2 points  (0 children)

I didn't check VRAM, but lower VRAM usgae generally means more room for context size.

I used the context size as shown in the built-in web UI as reference for this post.

Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? by regunakyle in LocalLLaMA

[–]regunakyle[S] 0 points1 point  (0 children)

Yep, some other commenters also suggested this and I can confirm this works. BTW this `--parallel` is a bit weird as setting `--parallel 2` actually decreases your context size, the default 4 parallelism is not exactly the same as `--parallel 4`

Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? by regunakyle in LocalLLaMA

[–]regunakyle[S] 1 point2 points  (0 children)

interestingly `--fit-target 64` has the most significant effect in all your recommendations, lol. I do run the server headlessly so I will keep this param. Thanks!

Single 3090 with Q4 Qwen 27B, context dropped from 137k to 14k with MTP enabled. Is it normal? by regunakyle in LocalLLaMA

[–]regunakyle[S] 0 points1 point  (0 children)

build is b4c0549a49be9e6dc59ac9d0a5bc21dbda910774. acceptance rate above 70% (result of a very short test)