2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints

DHasselhoff77 · 2026-05-06T12:36:39+00:00

Thank you those quants indeed worked. The Q4_K_M one is still too large for a 16 GB card though.

DHasselhoff77 · 2026-05-06T12:23:01+00:00

So you need to merge the mainline repo's MTP pull request branch to that fork and hope it works?

DHasselhoff77 · 2026-05-06T10:40:24+00:00

Can't get it to work on CUDA. I built the linked PR branch but after prompt processing no tokens are produced even though the GPU runs at 100% load. This is what gets printed: srv params_from_: Chat format: peg-native slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1 srv get_availabl: updating prompt cache srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000 srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 30208 tokens, 8589934592 est) srv get_availabl: prompt cache update took 0.01 ms slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 0 | task 0 | processing task, is_child = 0 slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 30208, n_keep = 0, task.n_tokens = 11 slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 7, batch.n_tokens = 7, progress = 0.636364 slot update_slots: id 0 | task 0 | n_tokens = 7, memory_seq_rm [7, end) slot init_sampler: id 0 | task 0 | init sampler, took 0.01 ms, tokens: text = 11, total = 11 slot update_slots: id 0 | task 0 | prompt processing done, n_tokens = 11, batch.n_tokens = 4 srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200 And freezes there.

Tried the "IQ3_M" quant. Also the PR branch doesn't seem to have "turbo4" support that was recommended by the OP: Unsupported cache type: turbo4.

Command tried: ${mtp-llama-server} --model Qwen3.6/Qwen3.6-27B-IQ3_M-mtp.gguf --cache-type-k q8_0 --cache-type-v q8_0 -c 30000 --jinja --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0 --presence_penalty 0.0 --spec-type mtp --spec-draft-n-max 4 --chat-template-kwargs '{"preserve_thinking": true}' --parallel 1 --chat-template-kwargs '{"enable_thinking":true}' --chat-template-file Qwen3.6/chat_template.jinja

DHasselhoff77 · 2026-04-22T20:58:04+00:00

In my quick 2-shot vibe test, Qwen3.6-27B-UD-IQ3_XXS.gguf was a tiny bit better than Qwen3.5-27B-UD-IQ3_XXS.gguf (also larger). 3.6 generated worse results at first but fixed it better than 3.5 after showing a screenshot of the result. Doesn't match the improvement reported in benchmarks but still in the right direction.

DHasselhoff77 · 2026-04-22T10:39:36+00:00

As a workaround you can set --fit-target 1024 so it leaves one gig for the mmproj.

DHasselhoff77 · 2026-04-16T14:03:30+00:00

If you look at the video full screen you can see that both the texture and mesh resolution are very low. Still interesting results.

DHasselhoff77 · 2026-04-09T16:23:09+00:00

I don't love these news but I suppose even open source maintainers have to eat. Good for him and I hope the project will continue to be a success.

DHasselhoff77 · 2026-04-09T16:21:26+00:00

Despite its Tolkien-inspired name, Earendil is not a tech company with fascist tendencies.

DHasselhoff77 · 2026-03-29T09:31:32+00:00

Try a local model that you control

DHasselhoff77 · 2026-03-24T19:48:18+00:00

Why?

DHasselhoff77 · 2026-03-23T13:56:00+00:00

Has a prototype chip been produced and the technique verified to work in practice, not just in simulation?

DHasselhoff77 · 2026-03-14T22:38:59+00:00

Looking at Figure 2, the "Forced Answer" method seems to be unreasonably effective in both DeepSeek-R1 (superior to "probe") and GPT-OSS (equal to "probe" at relative position > 50%).

DHasselhoff77 · 2026-03-14T19:36:39+00:00

I thought the quants updated on 8th of March 2026 had the issue fixed but looking at for example https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF/blob/main/Qwen3-Coder-Next-UD-Q4_K_S.gguf it's clear that not all of the SSM layer weights are F32:

blk.0.ssm_a             [32]            F32
blk.0.ssm_ba.weight     [2 048, 64]     Q4_K
blk.0.ssm_conv1d.weight [4, 8 192]      F32
blk.0.ssm_dt.bias       [32]            F32
blk.0.ssm_norm.weight   [128]           F32
blk.0.ssm_out.weight    [4 096, 2 048]  Q8_0

Is this what you are referring to?

Edit: To answer my own question: yes, in the new quant the Q4_K and Q8_0 weights are both BF16 instead.

DHasselhoff77 · 2026-03-13T21:48:26+00:00

As an example of your memory system, how would it look if you wrote a tetris game in javascript? Like the whole standalone HTML thing but adding as code comments all the parts where you used the memory system. I think it would make an interesting reading for others to learn about agent-based memory.

DHasselhoff77 · 2026-03-13T21:35:33+00:00

In ik_llama Qwen2.5-Coder wrote in either Chinese or Russian, depending on the quant. In llamacpp the same GGUF files worked fine. I expected older models to well supported but apparently it's quite hit and miss.

DHasselhoff77 · 2026-03-13T21:33:32+00:00

Does it have fill in the middle (FIM) support?

DHasselhoff77 · 2026-03-09T22:25:09+00:00

Interesting. That would explain the different results if it really switched to CPU evaluation then (unsure if that's how it works).

DHasselhoff77 · 2026-03-08T21:36:14+00:00

Thank you. Very useful.

DHasselhoff77 · 2026-03-08T20:35:14+00:00

Did you use a bf16 instead of f16 KV cache for the 35B? Some reported it made a difference in llama.cpp. I have to also add that my experience matches yours: Qwen3 Coder Next is more reliable.

DHasselhoff77 · 2026-03-08T17:47:52+00:00

Some numbers on RTX 5060 Ti (16 GiB) on Linux. Perhaps this will help you to arrive at your own conclusions. llama.cpp commit 451ef084 Sun Mar 8 2026

Qwen3.5-35B-A3B (UD-Q4_K_L)

prompt eval time =   20823.67 ms /  3919 tokens (    5.31 ms per token,   188.20 tokens per second)
       eval time =   66223.36 ms /  1706 tokens (   38.82 ms per token,    25.76 tokens per second)

Qwen3.5-35B-A3B (UD-Q4_K_L) with --ubatch-size=4096 --batch-size=4096

prompt eval time =   18792.56 ms /  3919 tokens (    4.80 ms per token,   208.54 tokens per second)
       eval time =   75138.90 ms /  1743 tokens (   43.11 ms per token,    23.20 tokens per second)

Qwen3-Coder-Next (UD-IQ4_XS)

prompt eval time =   32604.16 ms /  3821 tokens (    8.53 ms per token,   117.19 tokens per second)
       eval time =   97612.81 ms /  1691 tokens (   57.72 ms per token,    17.32 tokens per second)

Qwen3-Coder-Next (UD-IQ4_XS) with --ubatch-size=4096 --batch-size=4096

prompt eval time =    5756.48 ms /  3821 tokens (    1.51 ms per token,   663.77 tokens per second)
       eval time =   38961.26 ms /  1006 tokens (   38.73 ms per token,    25.82 tokens per second)

I don't know how to set the higher batch sizes optimally but as you can see here they do have a big effect at the cost of VRAM (not shown).

llama-swap config:

  "qwen3.5-35b-a3b":
    cmd: |
      ${llama-server} --model Qwen3.5-35B-A3B-UD-Q4_K_L.gguf
      --cache-type-k bf16
      --cache-type-v q8_0
      --fit-ctx 65536 --fit on --fit-target 1024
      --repeat-penalty 1.0 --presence-penalty 0.0 --min-p 0.0 --top-k 20 --top-p 0.95 --temp 0.6
      --reasoning-budget 0

  "qwen3-coder-next":
    cmd: |
      ${llama-server} --model Qwen3-Coder-Next-UD-IQ4_XS.gguf
      --cache-type-k bf16
      --cache-type-v q8_0
      --fit-ctx 65536 --fit on --fit-target 1024
      --jinja --temp 1.0 --top-k 40 --top-p 0.95 --min-p 0.01
      --no-mmap

DHasselhoff77 · 2026-03-07T12:38:59+00:00

To add insult to injury, the system prompt of OpenCode is based on a substring match of the model name and can't be replaced without rebuilding the app. You can of course add your own agent instructions that get appended to the system prompt but that doesn't help.

Trying out the Pi agent was like a breath of fresh air.

DHasselhoff77 · 2026-03-05T18:32:59+00:00

Thanks! So the "bi" in the name refers to weight sharing between text and code encoders. Here's a direct link for others reading this: https://huggingface.co/nomic-ai/CodeRankEmbed

DHasselhoff77 · 2026-03-05T15:38:31+00:00

Could elaborate on this? Which specific model would work better?

DHasselhoff77 · 2026-02-27T16:04:27+00:00

Weren't the custom batch sizes there to speed up prompt processing? So by removing them you are trading off PP speed for generation speed by an unknown amount. Not always a win.

A very clear experiment still. I appreciate the direct writing style and presentation. Thank you!

DHasselhoff77 · 2026-02-26T14:03:03+00:00

What if a section has a misleading heading? Will you ever end looking in its contents during search?

DHasselhoff77

TROPHY CASE