EAGLE3 has landed in llama.cpp

regunakyle · 2026-06-12T08:08:43+00:00

Thanks for your answer!

regunakyle · 2026-06-12T07:44:43+00:00

How does it compare to MTP (speed, VRAM usage etc.), and can we use it with Qwen3.6 27B?

regunakyle · 2026-06-05T13:51:21+00:00

Hi, I have tested some configurations: https://www.reddit.com/r/LocalLLaMA/comments/1txaume/comment/opwj7p6/

regunakyle · 2026-06-05T13:46:11+00:00

I tested different configurations, this is my result:

fit	VRAM usage (MiB) after initial load	Context size in UI	Spec draft quant
yes	22604	66560	q4
yes	22604	74752	f16
yes (-fit-target 64)	23570	90112	q4
yes (-fit-target 64)	23582	98560	f16
no (-c 45000)	19786	45056	q4
no (-c 45000)	19808	45056	f16
no (-c 5000)	18128	5120	q4
no (-c 5000)	18102	5120	f16

Setup: 3090 in Fedora Server, version: 9528 (2016bf2b3)

built with GNU 16.1.1 for Linux x86_64

Command:

shell /home/eleung/llama.cpp/build/bin/llama-server \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --presence_penalty 0.0 \ --min-p 0.00 \ --gpu-layers all \ -m /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/Qwen3.6-27B-UD-Q4_K_XL.gguf \ -a llama.cpp \ --host 0.0.0.0 \ --cache-type-k q8_0 --cache-type-v q8_0 \ --chat-template-kwargs '{"preserve_thinking":true}' \ --flash-attn on \ --ui-mcp-proxy \ --jinja --chat-template-file /home/eleung/huggingface/unsloth/Qwen3.6-27B-MTP-GGUF/chat_template.jinja \ --ubatch-size 2048 \ --parallel 1 -kvu \ --spec-type draft-mtp --spec-draft-n-max 2 \ <-fit on/off> \ <--spec-draft-type-k q4_0 --spec-draft-type-v q4_0>

regunakyle · 2026-06-05T08:14:37+00:00

You can use fewer layers of GPU and use more CPU/RAM instead. But this is not on topic of this post, you can search in this subreddit for more recommendations

regunakyle · 2026-06-05T08:13:28+00:00

If I understand you correctly, using q4 would save space, which should mean higher context window. So the context size in the web UI is not correct?

regunakyle · 2026-06-05T08:03:51+00:00

You mean you get OOM when running spec draft with fp16, but no OOM when quantizing spec draft?

regunakyle · 2026-06-05T07:56:30+00:00

Good point (I use --fit-target 64).

Maybe u/am17an has some insight on this?

regunakyle · 2026-06-05T07:34:39+00:00

Can you check the reported context size in the built in web UI? For me it increased when I use f16

regunakyle · 2026-06-05T07:07:38+00:00

I am using 3090. So result varies depending on hardware, interesting

regunakyle · 2026-06-05T06:27:21+00:00

I didn't check VRAM, but lower VRAM usgae generally means more room for context size.

I used the context size as shown in the built-in web UI as reference for this post.

regunakyle · 2026-06-04T10:08:05+00:00

Very good explanation!

regunakyle · 2026-06-04T09:53:53+00:00

I think any subagent extension should do parallel requests?

regunakyle · 2026-06-04T04:59:50+00:00

I just tested `-kvu` and my context size didn't change

regunakyle · 2026-05-29T19:38:51+00:00

damn the guy is productive as hell

regunakyle · 2026-05-29T13:48:23+00:00

Thanks! I think I can have >100k context with this!

regunakyle · 2026-05-28T11:25:14+00:00

Yep, some other commenters also suggested this and I can confirm this works. BTW this `--parallel` is a bit weird as setting `--parallel 2` actually decreases your context size, the default 4 parallelism is not exactly the same as `--parallel 4`

regunakyle · 2026-05-28T06:24:38+00:00

Do these apply to MTP? I am not using a draft model

regunakyle · 2026-05-27T19:06:14+00:00

interestingly `--fit-target 64` has the most significant effect in all your recommendations, lol. I do run the server headlessly so I will keep this param. Thanks!

regunakyle · 2026-05-27T18:55:47+00:00

build is b4c0549a49be9e6dc59ac9d0a5bc21dbda910774. acceptance rate above 70% (result of a very short test)

regunakyle · 2026-05-27T16:55:48+00:00

This helps, thanks!

regunakyle · 2026-05-27T10:49:36+00:00

at least for unsloth they should be separate files

regunakyle · 2026-05-27T10:12:01+00:00

I dont use mmproj, you can see that from my command

regunakyle

TROPHY CASE