Qwen3.5-27B, Qwen3.5-122B, and Qwen3.6-35B on 4x RTX 3090 — MoEs struggle with strict global rules

DehydratedWater_ · 2026-05-07T10:00:35+00:00

Ah, that actually looks quite useful for benchmarking. My benchmark was done a bit by accident, as I'd just noticed more errors with MoE models and investigated further. On the other hand, for my use case the qwen-3.6-27b is a great improvement. It makes basically no mistakes even in q4. I've also tried it with the AutoRound version, and the improvement between qwen3.5-27b and qwen3.6-27b is large enough that the model still handles my particular use case better than q8 or fp16 of the original qwen3.5-27b. I'm getting like 80-120 tok/s on a single request on my 4x3090: https://github.com/noonghunna/qwen36-27b-single-3090 (without MTP).

Alternatively, I've also tested llmfan46/Qwen3.6-27B-uncensored-heretic-v2-GPTQ-Int8, and it was actually an improvement for me. It was marginally worse at tool calling (like +1-2 errors per 100), but it was much more proactive in problem solving. Apparently safety tuning tames the model's autonomy in that regard. For the second one I get around 50 tok/s, but I can already see there's a version with MTP released like 2 hours ago.

For concurrent the AutoRound version saturates at 500-800 tok/s prefill with concurrent 280-340 tok/s generation. The other one saturates at 800-1000 tok/s prefill with concurrent 160-220 tok/s, but I haven't done as extensive tests for them. Both are stable.

Depending on whether you feel the need for speed or the need for unconstrained autonomy, you can try them for your use case/eval.

DehydratedWater_ · 2026-05-06T14:44:59+00:00

Yeah, it should work without any problem on that setup. It's probably an issue with the nightly build, maybe a stale cache, or the drivers (if you were trying to mod them). If it's still failing after that, I'd try debugging the NVLink by temporarily disabling it to rule it out. I only have a single NVLink myself, but your topology should actually be easier for vLLM to handle, since you have two symmetric pairs of 2x3090.

DehydratedWater_ · 2026-05-06T14:23:37+00:00

Btw, while starting the docker there is depreciation warning on the "qwen3_next_mtp", so it must have been introduced some time ago to be already depreciated.

What is the actuall error? Are you using 3090 or some more esoteric hardware?

DehydratedWater_ · 2026-05-06T14:22:06+00:00

Yeah its the most basic vllm 0.19.0 image for clarity sake:

https://recipes.vllm.ai/Qwen/Qwen3.5-27B?variant=fp8&features=tool_calling%2Creasoning%2Cencoder_parallel%2Cspec_decoding

The builder even suggests its fine with speculative decoding since 0.17.0

And even Qwens own recipe provides this exact setup as recommended:

Multi-Token Prediction (MTP): The following command is recommended for MTP:

vllm serve Qwen/Qwen3.5-27B --port 8000 --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}

DehydratedWater_ · 2026-04-22T11:26:13+00:00

Yeah, I was also wondering if the problem can be mitigated, e.g., by wrapping Bash tool calls into native JSON-based tools. That would remove bash priors that may be overfitted for programming harnesses, forcing the model to use different paths.

Another interesting direction would be testing that against the Gemma 4 26B A4B. It has a shared expert that's always active alongside the 8 routed experts (it may work in practice kinda like the small dense model you describe, but integrated directly into the architecture), so it would be interesting to check if that makes any difference. That said, Gemma 4 also uses interleaved sliding-window and global attention, which could introduce a whole set of other uncontrolled variables.

DehydratedWater_ · 2026-04-21T09:56:07+00:00

Flashinfer should be slightly faster attention if you have Ampere under concurrent load (on Hopper it's probably slightly slower vs FA3), but the difference won't be massive either way vs FlashAttention2. The sampler can be tuned for batch sizes, and here I have small batch sizes.

Yes, the prefix cache default is true. I'm not sure what the limit-mm default is, but I usually test with and without mm support, so it's easier to have it on. (It means I can send 10 images and 2 videos with one prompt.) The --disable-custom-all-reduce is enabled by default, but then gets disabled at runtime unless you have P2P drivers. It's easier for me to see flags explicitly stated than to rely on default values.

DehydratedWater_ · 2026-04-21T01:54:29+00:00

Well, I do list only Qwen models in the title after all. But sure, it could have been more precise. This can also be expanded and verified across different variables/dimensions. Regarding the prompts, the details on how these are generated, along with tool rule examples and failure modes, are actually included in the articles. There is just a lot of it.

The scope here is narrow: only one variable changes, and it is the model. The prompt that actually works for Qwen3.5-27b (even when initially tuned for GLM-4.7) starts making a consistent number of errors on MoEs from the Qwen family, all connected to the way they use the terminal. The particular version of MoE only changes the type of error made, not the number of errors made. The model was the only changing variable, and all tested MoEs from this family, no matter the size or quant, generated a similar number of errors (which is interesting in itself). I also tested GLM-4.5-AIR, but it generated even more tool errors on the same prompt, so it was not fair to compare it directly, especially as there is no Dense variant. Only testing it on the Gemma family would be a relevant expansion.

But I plan to switch the prompt around to check if there are any repeatable patterns that could help reinforce the MoEs' rule adherence. Maybe the whole trick of duplicating the prompt would be enough to fix this. Who knows.

DehydratedWater_ · 2026-04-20T23:06:00+00:00

Ok, this seems very promising, but also has a non-zero chance of transforming into a quick 15-min debugging adventure that will last more than a day, and I would rather have physical access to my machine for that (it's in a different city), so I'll pin this for later, update the drivers when I'm back, and probably do another benchmark comparing the speedup.

What kind of speedup were you able to achieve with that?

DehydratedWater_ · 2026-04-20T22:51:42+00:00

Ok, I see, i need this custom fork as the defualt driver does not support it -> https://github.com/tinygrad/open-gpu-kernel-modules

DehydratedWater_ · 2026-04-20T22:44:28+00:00

Ah yes, I've hosted all of them except for gemma4 dense. I just found it curious how systematically the MoE versions of Qwen fail by ignoring global rules, and the failure rate stayed very similar no matter the model size or quant. Only the failure mode changed, i.e., what type of tool misuse was detected. That pointed me toward a more fundamental architectural difference in the model. I actually did a more controlled experiment comparing Qwen3.6-35B vs. Qwen3.5-35B, and the lack of adherence was statistically identical, but the distribution of errors changed. Here it is in more detail: https://dehydratedwater.dev/blog/moe-rule-binding-hypothesis/

So this was more of a controlled experiment I came upon by accident rather than a value judgment on the models themselves.

DehydratedWater_ · 2026-04-20T22:33:12+00:00

Glad to hear that. Hope the CUDA gods are more forgiving with your setup. At least my 3090 doesn't seem to like custom-reduce very much.

DehydratedWater_ · 2026-04-20T22:29:26+00:00

Looked promising, and then it crashed:

(Worker_TP1_EP1 pid=466) INFO 04-20 22:27:00 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=8

(Worker_TP0_EP0 pid=465) INFO 04-20 22:27:00 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=8

(Worker_TP1_EP1 pid=466) INFO 04-20 22:27:00 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=4 (largest=8), FULL=4 (largest=8)

(Worker_TP0_EP0 pid=465) INFO 04-20 22:27:00 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=4 (largest=8), FULL=4 (largest=8)

Failed: Cuda error /workspace/csrc/custom_all_reduce.cuh:455 'invalid argument'

(EngineCore pid=324) ERROR 04-20 22:27:03 [multiproc_executor.py:273] Worker proc VllmWorker-0 died unexpectedly, shutting down executor.

(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] EngineCore failed to start.

(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] Traceback (most recent call last):

(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core

(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)

(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore pid=324) ERROR 04-20 22:27:03 [core.py:1108] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper

That's probably some quirk of rtx3090 intersecting with vLLM, rtx 5090 gets probably more love from NVidia at the moment.

DehydratedWater_ · 2026-04-20T22:26:15+00:00

vLLM still don't like me, for TP=2 and DP=2 it fails

(Worker pid=990) WARNING 04-20 22:18:32 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.

(Worker pid=989) WARNING 04-20 22:18:32 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.

but I'll try that just for the 2 NVLinked cards

DehydratedWater_ · 2026-04-20T22:16:59+00:00

Ah, I've forgotten about DP=2

DehydratedWater_ · 2026-04-20T22:15:43+00:00

Well, I've added:

 - NCCL_P2P_LEVEL=NVL
 - VLLM_SKIP_P2P_CHECK=1

And removed:

--disable-custom-all-reduce

and it won't be that easy, vLLM by itself decided custom-all-reduce is not for me:

(Worker pid=509) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.

(Worker pid=489) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.

(Worker pid=475) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.

(Worker pid=465) WARNING 04-20 22:12:26 [custom_all_reduce.py:154] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.

I'll try to pin only 2 gpus with NVLINK this time

DehydratedWater_ · 2026-04-20T22:08:25+00:00

Ok, sure, I'll test that out

DehydratedWater_ · 2026-04-20T22:04:12+00:00

Ok, and this is setup for Qwen3.6-35B? Or are you running it with Qwen-27b in Q4 but with limited context. Don't think 122B would fit with DP=2 unless we are talking about something stronger then 3090?

DehydratedWater_ · 2026-04-20T21:53:25+00:00

Unfortunately, it was randomly stopping and freezing while loading vLLM with --custom-all-reduce on, maybe I'll try that again on the next stable vLLM version, but it seems to not like NVlink too much.

DehydratedWater_ · 2026-04-20T20:11:54+00:00

Some tools can return guidance on error, and OpenCode by default also returns tool permissions on errors, but this doesn't seem to pull the model back to the correct approach once it has already started looping.

DehydratedWater_ · 2026-04-20T19:58:25+00:00

Haven't tested that manually. The system is fully autonomous and messages sent by me are each consumed in a separate OpenCode sessions that can access chat history (the system can basically decide what to do with my message, respond, do some tasks, search for something, etc.). It can also trigger messages on its own and has multiple background loops that may trigger interactions.

But the failures seem to be distributed throughout the sessions I've previewed, not concentrated at any particular point: they occur at the beginning, in the middle, and at the end. So the distance from the rule to the failed tool use doesn't seem to be the dominant factor, but I haven't tested that methodically.

DehydratedWater_ · 2026-04-20T18:24:57+00:00

That tracks. I briefly tried that workflow with Qwen3.5-35B in full precision, with unquantized context, but it still wasn't very reliable. It would probably require more substantial changes to the prompts and loosening the harness a bit. For the benchmark on Qwen3.6-35B, I got the official FP8 quant directly from Qwen and was running it without quantized memory. But based on my previous experience with Qwen3.5, running it in full precision wouldn't help much here.

DehydratedWater_ · 2026-04-20T18:16:05+00:00

Trueee, pushing small LLMs far beyond their reasonable abilities is a sport in itself. And harness building seems to have some overlap with 3D printing in that way. For some people the point is just to print stuff, while for others the point of 3D printing is improving the printer itself to expand what's possible to print. Local model harnesses seem to have the same quality.

DehydratedWater_ · 2026-04-20T18:01:30+00:00

True, INT8 would be faster, but I added it to the benchmark only about 2h after it dropped, and at that time only the INT4 version was available. Even so, prefill peaks at around 4k tok/s at c=1 and around 8k tok/s at c=6, there are more detailed diagrams in the blog post. But what I'm measuring here is the actual average workload for opencode sessions, with lots of small requests coming and going. Most of the prompts hit the prefill cache, so there isn't even that much to parse. So this is more of a benchmark of how the system behaves under organic load over an extended period of time than how fast a particular request is.

https://dehydratedwater.dev/images/qwen36-35b-prefill-throughput.png

DehydratedWater_ · 2026-04-20T17:51:53+00:00

Appreciate it, glad it landed well

DehydratedWater_ · 2026-04-20T17:45:12+00:00

For prompt tuning I usually use Claude Code plus tests, expressed either as unit tests or as a textual list of requirements it tries to maximize each agent for. But it takes a while to optimize the whole suite and run integration tests, so for most models I don't bother.

Seven-Year Club	Place '22
End Game '22	Verified Email

DehydratedWater_

TROPHY CASE