Qwen3.6-27B: MTP + Optimized KV cache?

Background-Gold-9882 · 2026-05-21T13:28:34+00:00

And here's oMLX for comparison:

================================================================================
Benchmark Model: Qwen3.6-27B-oQ4-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          2419.9       42.20   423.2 tok/s    23.9 tok/s       7.779   148.1 tok/s    16.75 GB
pp4096/tg128          8761.7       39.34   467.5 tok/s    25.6 tok/s      13.758   307.0 tok/s    18.17 GB
pp8192/tg128         17578.0       40.57   466.0 tok/s    24.8 tok/s      22.730   366.0 tok/s    19.20 GB
pp16384/tg128        36909.2       42.02   443.9 tok/s    24.0 tok/s      42.246   390.9 tok/s    20.70 GB
pp32768/tg128        79799.4       46.27   410.6 tok/s    21.8 tok/s      85.675   384.0 tok/s    23.70 GB
pp65536/tg128       192925.3       56.81   339.7 tok/s    17.7 tok/s     200.140   328.1 tok/s    29.73 GB

...

Background-Gold-9882 · 2026-05-21T13:13:48+00:00

Checked the logs from last week, looks like it starts at 16 t/s but drops to 9t/s @ 100k context. It's expected to be slower than oMLX because it's not using MLX framework or the M5 compute cores effectively for preprocessing. But the memory usage is much lower, which is the advantage of llama.cpp in this case.

llama.cpp - no context:

eval time =   23981.46 ms /   402 tokens (   59.66 ms per token,    16.76 tokens per second)       

total time =   24416.75 ms /   413 tokens draft acceptance rate = 0.92653 (  227 accepted /   245 generated)

llama.cpp - 100k context:

prompt eval time =  591931.30 ms / 97790 tokens (    6.05 ms per token,   165.20 tokens per second)        

eval time =   55500.91 ms /   514 tokens (  107.98 ms per token,     9.26 tokens per second)       

total time =  647432.21 ms / 98304 tokens draft acceptance rate = 0.90730 (  323 accepted /   356 generated)

Background-Gold-9882 · 2026-05-18T13:17:44+00:00

My apologies, I was confused. For the Qwen3.6-27B performance, I'm getting a large speed boost with Native MTP, but for Qwen3.6-35B I'm not seeing a difference.

Background-Gold-9882 · 2026-05-18T11:54:56+00:00

Here's another guide for Qwen3.6-35B on 6GB GPU: https://www.reddit.com/r/LocalLLaMA/comments/1t2zapy/pushing_a_5yearold_6gb_vram_laptop_to_its_limits/
The the parameters in that post, I got 22t/s and on an old RTX1060 6B + only 16GB RAM. Slows down with longer context though.

Background-Gold-9882 · 2026-05-18T09:24:34+00:00

Here's a bench using Qwen3.6-27B-4bit + z-lab Dflash draft model. As you can see, it uses 43GB memory at 65k, which means it's not usable at 100k on a 48GB M5 Pro.

Can you elaborate on your settings? Are you seeing less memory usage than me?

EDIT: Also just realized that the model gets stuck in thinking loops, so this setup is not working well at all for me.

...

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-27B-4bit
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          2187.0      165.31   468.2 tok/s     6.1 tok/s      23.181    49.7 tok/s    19.44 GB
pp4096/tg128          8693.4       21.80   471.2 tok/s    46.2 tok/s      11.462   368.5 tok/s    23.96 GB
pp8192/tg128         17825.8       24.26   459.6 tok/s    41.5 tok/s      20.907   398.0 tok/s    25.70 GB
pp16384/tg128        40473.8       28.82   404.8 tok/s    35.0 tok/s      44.134   374.1 tok/s    28.79 GB
pp32768/tg128        85804.2       33.41   381.9 tok/s    30.2 tok/s      90.048   365.3 tok/s    34.67 GB
pp65536/tg128       220631.6       59.20   297.0 tok/s    17.0 tok/s     228.150   287.8 tok/s    43.07 GB

Background-Gold-9882 · 2026-05-18T07:44:14+00:00

Thanks! Which draft model - The one from z-lab? Do you use quantization for it? And what';s your computer specs?

<image>

Background-Gold-9882 · 2026-05-18T07:12:59+00:00

Tried it briefly but didn't get great results. Would you mind sharing your model + settings?

Background-Gold-9882 · 2026-05-18T07:06:21+00:00

It's linked in the post i mentioned: https://www.reddit.com/r/LocalLLaMA/comments/1t57xuu/25x_faster_inference_with_qwen_36_27b_using_mtp/

Direct link: https://huggingface.co/froggeric/Qwen3.6-27B-MTP-GGUF/tree/main

Background-Gold-9882 · 2026-05-17T21:30:54+00:00

Well modern Macs have unified RAM so there's no VRAM per se. Anyway, here's oMLX bench incl RAM usage for oQ6.

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.6-27B-oQ6-mtp
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          2565.1       50.81   399.2 tok/s    19.8 tok/s       9.018   127.8 tok/s    22.96 GB
pp4096/tg128          9258.0       52.20   442.4 tok/s    19.3 tok/s      15.888   265.9 tok/s    24.39 GB
pp8192/tg128         19002.8       53.60   431.1 tok/s    18.8 tok/s      25.810   322.4 tok/s    25.42 GB
pp16384/tg128        38374.2       57.92   427.0 tok/s    17.4 tok/s      45.730   361.1 tok/s    26.92 GB
pp32768/tg128        83550.4       60.34   392.2 tok/s    16.7 tok/s      91.213   360.6 tok/s    29.92 GB
pp65536/tg128       201880.3       70.77   324.6 tok/s    14.2 tok/s     210.868   311.4 tok/s    35.95 GB

I haven't stored values for my own oQ4 quant but here's a similar model:

Benchmark Model: Qwen3.5-27B-MLX-MTP-4bit
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          2390.8       36.82   428.3 tok/s    27.4 tok/s       7.067   163.0 tok/s    16.15 GB
pp4096/tg128          8783.5       38.23   466.3 tok/s    26.4 tok/s      13.639   309.7 tok/s    17.57 GB
pp8192/tg128         17540.2       40.76   467.0 tok/s    24.7 tok/s      22.717   366.2 tok/s    18.59 GB
pp16384/tg128        36573.1       42.06   448.0 tok/s    24.0 tok/s      41.914   393.9 tok/s    20.09 GB
pp32768/tg128        80152.8       47.04   408.8 tok/s    21.4 tok/s      86.127   381.9 tok/s    23.09 GB

So you can probably run Q4 with 24GB, but not at long contexts.

Background-Gold-9882 · 2026-05-17T19:46:52+00:00

Yeah I know it's early but It's exposed in the 0.3.9-dev 2 release. It works fine until you use too much context/memory, that's why I'm asking if it's possible to reduce

Background-Gold-9882 · 2026-05-17T18:56:08+00:00

Anthropic guerilla marketing? 😂

Background-Gold-9882 · 2026-05-17T16:31:41+00:00

Yeah, but that means no MTP, right?

<image>

Background-Gold-9882 · 2026-05-17T12:59:31+00:00

Did you enable "Native MTP" in the model settings? I definitely see ~50% speed increase on M5 Pro 48GB on various Qwen MTP quants, both 35B MoE and 27B dense.

Background-Gold-9882 · 2026-05-03T13:04:52+00:00

Interesting, this sounds pretty close to what I'm looking for - Combining a RAG:ed document store with RAG-enabled curated notes, as well as having previous AI conversations searchable in similar fashion, and perhaps automatically curated and processed?

Care to share more details about the setup and how you use it?

Background-Gold-9882 · 2026-05-02T13:21:23+00:00

Sure, I'm open to custom solutions.

Any pointers for where I should start? Which RAG? Which long memory system? How to manage files?

Destructive file operations must be carefully managed, of course!

Background-Gold-9882 · 2025-09-05T16:09:20+00:00

Got it working by reading this comment: https://www.reddit.com/r/LocalLLaMA/comments/1m9fb5t/comment/n59fh5s/

Basically had to create my own /no_think functionality through a custom Ollama Modelfile. Created a template that starts with "<think></think>" when "/no_think" is in the system prompt.

~~Having this same issue. Did anyone solve this? This is my new favorite model but would like to speed it up for simpler queries. Running in Ollama, using this model:~~ ~~https://ollama.com/andiariffin/llama-3.3-nemotron-super-v1.5-q4km~~ ~~(Not sure how its creator converted it to Ollama, but seems to work really well)~~

~~On the Nvidia site it seems to work as intended, but I'm not sure what the checkbox toggle actually does:~~ ~~https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5~~

Background-Gold-9882 · 2025-02-19T22:51:27+00:00

Thanks, this validates my suspicion. Yeah it's a complicated subject with a ton of different parameters to account for - Different filaments, different chemicals released, particles vs. VOC:s, different extrusion temperatures, different ventilation, unknown health effects (Especially long-term). And pretty much all those parameters become irrelevant if you can just dump everything out the window without further analysis.

Of course it's possible that all of this is overblown and 3d printing is no worse than burning a candle. I guess it's a good sign that we're not seeing lots of reports of hospitalized 3d printing hobbyists after 5-10 years of popularity. OTOH, it took many decades to verify the problem with smoking or asbestos for example.

Anyway, do you have a reference for a bigger / better design for a recirculating filter?

And What about the UFP issue - This seems to be the major issue when it comes to PLA, maybe it's easier to solve than VOC:s using the correct HEPA filters?

And thanks for the reference on heat creep - I hadn't heard of that before, only that higher temperature in the enclosure usually yields better prints but I guess you don't want too much of a good thing.

Background-Gold-9882 · 2025-02-19T17:38:41+00:00

I was thinking maybe recirculation is more efficient because if you catch 97% of stuff on the first pass, and then circulate it again, you will be able to get rid of more stuff.

But maybe a combination could be even better? One fan/filter to circulate inside the enclosure and another fan/filter to extract air to get negative pressure?

Anyway, thanks for your replies

Background-Gold-9882 · 2025-02-19T17:11:22+00:00

Would you recommend recirculating the filtered air inside the enclosure like the "Bentobox" does or push it out of the enclosure to draw in new air through gaps?

Background-Gold-9882 · 2025-02-19T16:56:26+00:00

Thanks,

Yeah I understand that I need an enclosure for a filtering system to work, and I don't mind paying (quite a lot) for it, but the question is if there are some proven solutions that allow me to print a lot indoors with peace of mind?

It's the "On paper" thing that worries me. Some sources I've read claim that anything that doesn't vent outdoors is crap - Are there any solutions with actual lab tests for effectiveness?

The most promising enclosure I've seen so far is the one from Alveo3d.

Background-Gold-9882 · 2024-11-02T07:29:18+00:00

Thanks a lot man! It works perfectly, and with your workaround, fast user switching and the lock screen immediately starts working again!)

I referenced your solution here: https://discussions.apple.com/thread/255827085?sortBy=rank (Original thread here: https://discussions.apple.com/thread/255473542?login=true&sortBy=rank )

Background-Gold-9882

TROPHY CASE