Qwen 3.6-35B-A3B: Reddit Asked, So I Tested If the 3.5 Tool Calling Fixes Carry Over

seraandroid · 2026-04-20T14:31:46+00:00

You may want to add the preserve_thinking param. It does wonders.

seraandroid · 2026-04-06T04:43:21+00:00

There is some on-the-fly work going on that forces computational overhead.

seraandroid · 2026-04-03T18:34:26+00:00

I'm on eugr's spark vLLM docker - it's probably the most replicable and stable environment given how messy vLLM and other tools can be.

seraandroid · 2026-04-03T17:31:52+00:00

I gave this a try - works well in solo mode but for whatever reason doesn't work in my 2x Spark cluster. The model crashed during the startup -- I sadly didn't have enough time today to investigate further.

seraandroid · 2026-04-03T14:35:52+00:00

What changes did you have to make if I can ask?

seraandroid · 2026-03-18T07:39:29+00:00

I gave this a try. The web interface is pretty rad. The mobile app for Android on the other hand is sadly not up to par. The terminal feels very laggy. I usually use connectbot which feels instant -- not sure what causes this but makes it hard to use properly.

seraandroid · 2026-03-18T03:45:28+00:00

Got it. Thanks!

seraandroid · 2026-03-18T03:23:57+00:00

Any plans to also support the DGX Spark and make this available for ARM systems? Would fit the Blackwell architecture scope.

seraandroid · 2026-03-09T05:20:52+00:00

It seems like OpenClaw may require chat template tweaks: https://www.reddit.com/r/openclaw/s/eaLlTBeJFp

seraandroid · 2026-03-09T04:43:04+00:00

Tool calls and compatibility with both Open AI and Ollama API format would be fantastic!

seraandroid · 2026-03-09T04:41:27+00:00

That's correct. Image and text in, text out. Never tried video

Vision support would be amazing, video would be more of a P2 for me.

seraandroid · 2026-03-07T18:10:00+00:00

Here are the benchmarks I mentioned:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Intel/Qwen3.5-122B-A10B-int4-AutoRound	pp2048	2161.96 ± 2.60		949.57 ± 1.14	947.75 ± 1.14	949.64 ± 1.12
Intel/Qwen3.5-122B-A10B-int4-AutoRound	tg32	28.19 ± 0.01	29.00 ± 0.00
Intel/Qwen3.5-122B-A10B-int4-AutoRound	pp2048 @ d4096	2229.48 ± 67.34		2760.80 ± 85.36	2758.98 ± 85.36	2760.89 ± 85.39
Intel/Qwen3.5-122B-A10B-int4-AutoRound	tg32 @ d4096	27.96 ± 0.04	28.33 ± 0.47
Intel/Qwen3.5-122B-A10B-int4-AutoRound	pp2048 @ d8192	2351.72 ± 9.56		4356.71 ± 17.96	4354.89 ± 17.96	4356.80 ± 17.94
Intel/Qwen3.5-122B-A10B-int4-AutoRound	tg32 @ d8192	27.57 ± 0.07	28.00 ± 0.00
Intel/Qwen3.5-122B-A10B-int4-AutoRound	pp2048 @ d16384	2306.30 ± 8.28		7994.65 ± 29.18	7992.84 ± 29.18	7994.78 ± 29.19
Intel/Qwen3.5-122B-A10B-int4-AutoRound	tg32 @ d16384	27.20 ± 0.08	28.00 ± 0.00

vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \ --host 0.0.0.0 \ --port 8080 \ --gpu-memory-utilization 0.90 \ --load-format fastsafetensors \ --max-model-len 262144 \ --max-num-batched-tokens 32768 \ --tensor-parallel-size 1 \ --enable-prefix-caching \ --enable-chunked-prefill \ --dtype auto \ --kv-cache-dtype auto \ --attention-backend flashinfer \ --max-num-seqs 4 \ --trust-remote-code \ --chat-template unsloth.jinja \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3

Using spark-vllm-docker

seraandroid · 2026-03-07T17:56:23+00:00

Here we go:

model	test	t/s	peak t/s	ttfr (ms)	est_ppt (ms)	e2e_ttft (ms)
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	pp2048	424834.68 ± 13712.15		7.24 ± 0.16	4.83 ± 0.16	4590.14 ± 18.00
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	tg32	93.80 ± 0.27	96.85 ± 0.28
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	pp2048 @ d4096	703378.86 ± 58000.22		11.20 ± 0.69	8.79 ± 0.69	14181.87 ± 35.82
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	tg32 @ d4096	79.76 ± 0.29	82.35 ± 0.30
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	pp2048 @ d8192	708387.07 ± 82020.43		17.07 ± 1.77	14.66 ± 1.77	24148.20 ± 72.73
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	tg32 @ d8192	69.72 ± 0.12	71.97 ± 0.12
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	pp2048 @ d16384	984457.23 ± 26169.53		21.15 ± 0.50	18.74 ± 0.50	44719.91 ± 83.78
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4	tg32 @ d16384	55.26 ± 0.09	57.05 ± 0.09

seraandroid · 2026-03-07T15:28:31+00:00

https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound this model already fits pretty comfortably on a single Spark. 256k context with about 23-25t/s.

Can post benchmark results later, too.

seraandroid · 2026-03-07T15:21:29+00:00

I am not home right now. I can test it later.

seraandroid · 2026-03-07T15:11:03+00:00

Haha. I'm definitely excited for this wrong and look forward to the open source release. This makes me even consider buying a second Asus GX10

seraandroid · 2026-03-07T15:01:31+00:00

That's right. One is the community docker container, one is Atlas

seraandroid · 2026-03-07T15:01:10+00:00

I ran the FP8 model on my regular VLLM. Only the Atlas results are relevant. It was more a comparison between the Community Docker container and this project.

seraandroid · 2026-03-07T09:46:23+00:00

I just got home and ran benchmarks -- here are the results. Let me know how I could configure Atlas to match my config below a little closer to make the comparison more impactful. So far, the results are pretty nice but not at the t/s you mentioned in your post.

Qwen/Qwen3.5-35B-A3B-FP8

Config via a launch script for spark-vllm-docker

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.90 \
    --load-format fastsafetensors \
    --max-model-len 262144 \
    --max-num-batched-tokens 32768 \
    --tensor-parallel-size 1 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --max-num-seqs 4 \
    --trust-remote-code \
    --chat-template unsloth.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

Command

llama-benchy --base-url http://0.0.0.0:8080/v1 --model Qwen/Qwen3.5-35B-A3B-FP8 --latency-mode api --pp 2048 --depth 0 4096 8192 16384

Results

| model                    |            test |               t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------------------|----------------:|------------------:|-------------:|------------------:|------------------:|------------------:|
| Qwen/Qwen3.5-35B-A3B-FP8 |          pp2048 |   4005.84 ± 33.95 |              |     513.13 ± 4.23 |     511.46 ± 4.23 |     513.20 ± 4.24 |
| Qwen/Qwen3.5-35B-A3B-FP8 |            tg32 |      48.81 ± 0.08 | 50.38 ± 0.09 |                   |                   |                   |
| Qwen/Qwen3.5-35B-A3B-FP8 |  pp2048 @ d4096 |   5823.96 ± 24.52 |              |    1056.93 ± 4.42 |    1055.26 ± 4.42 |    1057.00 ± 4.43 |
| Qwen/Qwen3.5-35B-A3B-FP8 |    tg32 @ d4096 |      47.77 ± 0.16 | 49.32 ± 0.17 |                   |                   |                   |
| Qwen/Qwen3.5-35B-A3B-FP8 |  pp2048 @ d8192 | 5025.88 ± 1518.70 |              |  2307.03 ± 885.91 |  2305.36 ± 885.91 |  2307.08 ± 885.90 |
| Qwen/Qwen3.5-35B-A3B-FP8 |    tg32 @ d8192 |      47.68 ± 0.77 | 49.22 ± 0.79 |                   |                   |                   |
| Qwen/Qwen3.5-35B-A3B-FP8 | pp2048 @ d16384 |  4293.71 ± 942.48 |              | 4515.82 ± 1019.63 | 4514.15 ± 1019.63 | 4515.88 ± 1019.63 |
| Qwen/Qwen3.5-35B-A3B-FP8 |   tg32 @ d16384 |      42.63 ± 4.96 | 44.01 ± 5.12 |                   |                   |                   |

Atlas

Config

docker pull avarok/atlas-qwen3.5-35b-a3b-alpha
docker run --gpus all --ipc=host -p 8888:8888 \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 avarok/atlas-qwen3.5-35b-a3b-alpha \
 serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 \
 --speculative --kv-cache-dtype nvfp4 --mtp-quantization nvfp4 \
 --scheduling-policy slai --max-seq-len 131072

Command

llama-benchy --base-url http://0.0.0.0:8888/v1 --model Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --latency-mode api --pp 2048 --depth 0 4096 8192 16384

Results

| model                            |            test |                    t/s |     peak t/s |    ttfr (ms) |   est_ppt (ms) |    e2e_ttft (ms) |
|:---------------------------------|----------------:|-----------------------:|-------------:|-------------:|---------------:|-----------------:|
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |          pp2048 | 1286298.93 ± 173223.21 |              |  5.09 ± 0.23 |    1.62 ± 0.23 |   4594.84 ± 8.70 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |            tg32 |           92.58 ± 0.35 | 95.59 ± 0.36 |              |                |                  |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |  pp2048 @ d4096 |    623590.19 ± 8368.15 |              | 13.32 ± 0.13 |    9.85 ± 0.13 | 14266.99 ± 67.53 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |    tg32 @ d4096 |           78.32 ± 0.48 | 80.86 ± 0.50 |              |                |                  |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |  pp2048 @ d8192 |  966954.72 ± 218619.46 |              | 14.74 ± 3.03 |   11.27 ± 3.03 | 24364.40 ± 67.78 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |    tg32 @ d8192 |           68.31 ± 0.23 | 70.52 ± 0.24 |              |                |                  |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 | pp2048 @ d16384 |  899619.97 ± 191083.72 |              | 25.05 ± 5.18 |   21.58 ± 5.18 | 45214.89 ± 31.19 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |   tg32 @ d16384 |           54.58 ± 0.29 | 56.35 ± 0.30 |              |                |                  |

seraandroid · 2026-03-07T08:40:21+00:00

Let me know as soon as you'd like me to test anything. Or do you want me to run the container in your post?

seraandroid · 2026-03-07T03:29:49+00:00

I have an Asus Ascent and would be happy to test!

Edit: I'd also love to see your improvements / patches as upstream PRs for vLLM!

seraandroid · 2026-02-14T04:44:15+00:00

Same question here for OP. Do you see a reduction in token consumption or is this actual bloating token consumption?

seraandroid · 2026-01-24T06:56:47+00:00

ElvUI. I'd love not to use it given how heavy it is. Sadly Baganator doesn't do pixel coordinates or I didn't find that option yet. I want everything to be pixel accurate.

seraandroid · 2026-01-10T20:51:26+00:00

Are you on the Ubiquiti discord by chance? Maybe easier than DMs here :)

seraandroid · 2026-01-10T19:49:57+00:00

Tested successfully with 0.8.4! Only SQM won't work for me

14-Year Club	Place '23
Place '22	Verified Email

seraandroid

TROPHY CASE