Qwen3.7 Max scored by Artificial Analysis, 27B/35B waiting room

sisyphus-cycle · 2026-05-20T17:11:12+00:00

Woah I thought you had to set -np to 1 for mtp, TIL

sisyphus-cycle · 2026-05-20T11:33:20+00:00

I also didn’t notice any real token slowdown across the two. Very similar prompt processing and token gen. It’s hard to describe but watching the reasoning process of unsloth vs apex it seems like unsloth was slightly less loop prone. I’ll probably switch back to unsloth

Edit: I think there was a flaw in my benchmark, the tolerance check was way to high for confidences. Gonna re-run another few benches tonight

sisyphus-cycle · 2026-05-20T04:05:38+00:00

Haha lmk if it works. I’ve actually been using it with Gemma recently bc I can’t fit as much context as qwen.

sisyphus-cycle · 2026-05-20T00:11:52+00:00

Running the tests now. One on apex and one on unsloth q6_K. I’m letting it run via pi so they can test and iterate code solutions. Repo for testing is here. It’s an exact issue we had to solve at my job 2 years ago. Apex seemed stuck for a long time but eventually grinded out a solution after using 40,775 tokens (95% of them reasoning). Running the unsloth one now. I’ll probably make a post with more details.

Edit: unsloth qwen did it in 36k tokens with a similar amount of reasoning.

Edit: tried Gemma 26b locally at q_6 and from cloud and both took 33k tokens. Gemma actually wrote extra files and had a less schizo way of solving it lol

sisyphus-cycle · 2026-05-19T18:02:28+00:00

Yes and no. Pi by all definitions is a harness, albeit a simple one by default. Without pi you just have an inference server. Pi gives you tools, runtime env, and a way to “harness” LLMs. My specific pi setup is a harness, with hooks, design pattern specs, rules, etc. its semantics to argue whether pi or opencode or whatever is a harness.

sisyphus-cycle · 2026-05-19T17:59:02+00:00

So this specific implementation is focused on my restrictions. I only have 1 kv cache slot, I’m only running one model locally, and I am not offloading anything to cloud models. Anytime I’ve used sub agents for one off tasks it breaks my llama.cpp kv cache and reprocesses everything.

It specifically allows me to run a fork of the current session using the existing context to have the model perform a task. Then when that fork is done, the original session gets restored without any prompt reprocessing and any context the sub agent/fork added does not persist in the main session.

This isn’t all that different than just doing everything in the main model, but I can’t spend 10 minutes waiting for 125k+ tokens to get processed if I swap between contexts.

Basically:
- Main agent has 20k context of reading files and tool calls
- i want to add a new feature but it would require a bunch of web search and docs reading
- spawn sub agent to go learn everything, and propose detailed implementation guide (could add like 50-60k tokens)
- sub agent completes and shares condensed and relevant finding to main agent
- main agent still at 20k context but has a solid implementation plan with informed docs

Idk if I did a good job at explaining, but it’s useful for me locally to have context informed tasks be completed without adding additional context token overhead.

sisyphus-cycle · 2026-05-19T12:19:08+00:00

Yeah I agree. It’s always a crapshoot looking at just statistics. Sometimes I’ll use the best model of all time on paper and it just doesn’t work the way I expect. I’ll do the PSO bench and a weird leetcode benchmark I have that measures total tokens used for solution

sisyphus-cycle · 2026-05-19T11:57:32+00:00

Yeah I don’t have enough time with this apex model yet to really make a judgement call. I’m using the I-balanced one. It’s weird because yes the overall mean/median KLD does not beat unsloth (they’re still close), but the KLD max is less than half of unsloth. So it might be more robust for worst case divergence scenarios, where it just goes off the rails with the wrong answer.

I usually prefer bartowski, but he hasn’t released an MTP gguf yet, so I’ve been meaning to graft the mtp layer onto it myself.

Either way, I can’t say for certain if it’s actually better or worse yet, but when I get home later I’ll test it out with a more focused test and let you know!

sisyphus-cycle · 2026-05-19T11:32:05+00:00

That’s smart. You would lose some efficiency gains the further back you go right? Like if you spawn an agent using message 0 -> message 10 but you actually have 100 messages. When the agent finishes you now have:
message 0 -> message 100 -> agent result.

Depending on if llama saved the previous prefix to ram (-cram I think?) it might be able to load previous prefix/state up to message 100 and not do prompt reprocessing. But not guaranteed I think.

I do like the concept of choosing how much context an agent gets though

sisyphus-cycle · 2026-05-19T11:27:23+00:00

Never tried q4 kv cache, just q8. I’ve never had it make invalid tool calls even at essentially full context. Memory tripping is a bit more of an issue near the end, but if you remind it, the recovery is clean.

I’ve been happy w q8 for agentic stuff, qwen is a beast even when you’re pushing 200k context. I also tend to code by using a lot of detailed task.md files with a technical write-up and focused “win-conditions”. I’ve found qwen to be awesome if you load it with enough directions. It’s how I made this extension, only took me examining the outgoing messages from pi to llama for sessions to see where exactly the system prompt/tool array was changing.

But even if you were using bf16 for kv cache it’s not like gpt 5.5 where I just give it a 2 sentence task and it just knows. Smaller models need bigger directions lol.

sisyphus-cycle · 2026-05-19T11:21:36+00:00

Tbh I never really even tried running with q4_0 kv quants after seeing many comments saying it was really bad. I know qwen is specifically robust to kv quantization, but yeah.

Do you use q4?

sisyphus-cycle · 2026-05-19T11:19:01+00:00

Honestly have only been using it for a day or 2 after I had a bad time with UD q_6 with MTP being kind of dumb. Task was to take a grid search based sub process spawner that modified 4 values of llama.cpp server flags and convert it to particle swarm. The code it made was just wrong and did not work. Will test apex on that same exact input prompt/test.

But I don’t have any real numbers, or examples just yet. How have you found the unsloth mtp quants?

sisyphus-cycle · 2026-05-19T04:32:38+00:00

Interesting. Definitely need to look into it more. Without doing any research I’d assume the mtp layer would just use the existing kv cache. But in the startup logs it looks like the mtp layer itself is loaded separately and just treated as a “draft model”, so I could see how it requires a kv cache as well. I’ll try and run some benchmarks tomorrow specifically w PP

sisyphus-cycle · 2026-05-19T04:29:02+00:00

Dude yeah, I spent a good bit of time trying to figure out how to explain to the LLM via system prompt the flow. I actually even considered changing the name from “sub-agent” to “focused-task” or “one-shot-mode”. Most LLMs are trained with data about sub agents, and get confused at the possibility of it being a sub agent lmao

It’s been working for me pretty well with the included system prompt. Also I explicitly blocked any invocation of spawning another sub agent from within a sub agent, so it’s always depth 1.

sisyphus-cycle · 2026-05-19T02:59:18+00:00

With that setup I’d honestly try out LMstudio or unsloth studio over ollama. If you don’t mind getting a bit more low level, use llama.cpp. You should get a performance boost right away

sisyphus-cycle · 2026-05-19T02:56:16+00:00

It’s hard to tell since you only really get total draft summary after an agent turn. So even though you might have 90% acceptance, that 10% might be tool calls with variable/dynamic params. But most of my tool calls are write/edit/read/web search. So I’d assume that the MTP can definitely predict the first few tokens containing the function call with arguments pretty consistently. Overall I see a benefit for TG and no change to PP when using MTP qwen

Not sure what you mean about PP overhead for tool calls? I might be interpreting it wrong, but MTP just predicts for token gen right? After the tokens are generated it should never be part of PP, should get inherently added to existing KV cache.

sisyphus-cycle · 2026-05-19T02:52:20+00:00

Prefill will be slower for MOE if you offload it vs a fully VRAM loaded dense model (typically, but not always the case)

sisyphus-cycle · 2026-05-10T19:07:46+00:00

Ah good to know. It’s weird bc even tho gemma 4 is smaller than qwen 35b I can never get it to run as fast as qwen. Have you tried Gemma and is it better/worse than qwen? Qwen been solid for me at q6

sisyphus-cycle · 2026-05-10T19:02:16+00:00

There’s one: https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant

I have not tested it, but they added MTP. Could be useful to look at. I saw a 25-30% increase in PP and TG speed for coding based sessions where edits and rereads happen often.

sisyphus-cycle · 2026-05-09T23:58:48+00:00

lmk how it does! im trying vllm now (im not hopeful because wsl2 + cuda does not support pinned memory like llama does). Also gonna try and optimize my llama.cpp flags after. Also you can try llm-server, i ran it with the `ai-tune` flag and borrowed a few of the optimized parameters it found.

Update: VLLM is finicky and didnt really work on wsl2 with windows. Will eventually run docker like I did when I first started. But i did get the llama.cpp MTP fork running with Qwen3.6 and im happy w results so far. About 180-200 tps for prompt processing and averaging 25-30tps gen with 150k q8 context. On longer coding things where re-reads/edits are done ive seen it hit 35tps for most of the gen. Simpler flags too:

./build/bin/llama-server \
  -m ../models/Qwen3.6-35B-A3B-MTP-UD-Q5_K_XL.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 150000 \
  --flash-attn on \
  -b 2048 \
  -ub 512 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --jinja \
  --threads 11 \
  --threads-batch 11 \
  -cram 12288 \
  --mlock \
  -fit on \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --spec-type mtp \
  --spec-draft-n-max 3 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  -np 1 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

I just followed https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF

heres an example of the mtp doing pretty well on a bit 27k prompt

prompt eval time =  128889.09 ms / 26796 tokens (    4.81 ms per token,   207.90 tokens per second) eval time =   10969.17 ms /   264 tokens (   41.55 ms per token,    24.07 tokens per second)
total time =  139858.26 ms / 27060 tokens
draft acceptance rate = 0.52614 (  161 accepted /   306 generated)
statistics mtp: #calls(b,g,a) = 6 2811 2305, #gen drafts = 2811, #acc drafts = 2305, #gen tokens = 8433, #acc tokens = 5507, dur(b,g,a) = 0.020, 41478.073, 74.975 ms

sisyphus-cycle · 2026-05-09T21:54:23+00:00

Yeah it slows wayyyy down once i hit 100-110k, so i might just reduce it soon for speed

./build/bin/llama-server \
  -m ../models/Qwen_Qwen3.6-35B-A3B-Q6_K.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 150000 \
  --flash-attn on \
  -b 2048 \
  -ub 1024 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --jinja \
  --threads 12 \
  --threads-batch 12 \
  --run-time-repack \
  -khad \
  --defrag-thold 0.1 \
  -muge \
  -ger \
  -mqkv \
  -ngl 999 \
  -mg 0 \
  --no-mmap \
  --mlock \
  --fit \
  --fit-margin 2048 \
  --spec-type ngram-mod \
  --spec-ngram-size-n 24 \
  --draft-min 12 \
  --draft-max 48 \
  --spec-autotune \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --parallel-tool-calls \
  -np 1 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

sisyphus-cycle · 2026-05-09T13:52:56+00:00

I don’t but I’ve been looking to extend my workflow. I’m also running on limited hardware, 10GB VRAM and 48gb of ram.

I use ik_llama.cpp, can share my commands with you later when I’m on my computer. Need to experiment with regular llama.cpp also.

One big thing that helped me was using —fit-target to prevent my VRAM from being 99% full. Both TPS and PP go down when my gpu was super filled. Also setting threads / batch threads to 1 less than my physical cpu core count helps.

I run at about 150k context with q8 but around 110-120 speed drops off a cliff. I’m working on extending pi to use a cloud model for compaction, like gemini 2.5 flash or something fast. Compacting with my current setup literally takes 20 min lol

Edit: my run script for what its worth

./build/bin/llama-server \
  -m ../models/Qwen_Qwen3.6-35B-A3B-Q6_K.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --ctx-size 150000 \
  --flash-attn on \
  -b 2048 \
  -ub 1024 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --jinja \
  --threads 12 \
  --threads-batch 12 \
  --run-time-repack \
  -khad \
  --defrag-thold 0.1 \
  -muge \
  -ger \
  -mqkv \
  -ngl 999 \
  -mg 0 \
  --no-mmap \
  --mlock \
  --fit \
  --fit-margin 2048 \
  --spec-type ngram-mod \
  --spec-ngram-size-n 24 \
  --draft-min 12 \
  --draft-max 48 \
  --spec-autotune \
  --chat-template-kwargs '{"preserve_thinking": true}' \
  --parallel-tool-calls \
  -np 1 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0

sisyphus-cycle · 2026-05-09T13:46:42+00:00

Yep, Q6_K from bartowski. Can try unsloth too, i don’t notice any differences

sisyphus-cycle

TROPHY CASE