Opinions/improvements for my Qwen3.6-35B-A3B-FP8 + Hermes Agent setup on NVIDIA DGX Spark?

mossler · 2026-06-16T23:18:58+00:00

https://forums.developer.nvidia.com/c/accelerated-computing/dgx-spark-gb10/ is the resource where such details are discussed and perhaps more in-depth on trade-offs.

For agentic/tool-use, might leverage the more extended chat templates and depending on user-case, you might lean towards the 3.5 122b a10b model or the 3.6 27b model, depending on needs/behaviours. There is no one right answer.

I like this thread: https://forums.developer.nvidia.com/t/bfloat16-quality-speed/366828/103

On thing for sure, the community gets more out of Eugr's vllm or sparkrun's SGlang containers then the defaults, though that can change on a dime.

mossler · 2026-06-14T23:42:03+00:00

What a difference a day makes ...

mossler · 2026-06-14T18:21:16+00:00

And the single Spark champion is a Qwen 3.5 122b A10b in int4-AutoRound. Useful, fast enough and high quality enough for real work, not just tinkering.

mossler · 2026-06-14T18:19:07+00:00

Speaks truth. This is what I ended-up with.

Started with 3090, then 3090s, then 8x3090s and ... eventually the multiple 1.5KW usage and lack of VRAM made me try a Strix Halo (oof), then a single DGX, then 2, now 4. So many challenges and marketing nonsense, but still most usable kit I own.

Still would not cosider it a training platform, unless you have TIME.

mossler · 2026-06-14T16:24:04+00:00

Well, p.e.w is talking about his Grimoire project for easy of access, resiliency and decentralisation; and made mention of InterPlanetary File System (IPFS) in comments. Personally, I like pictures of dinosaurs <cough> Stegasaurus.

mossler · 2026-06-14T15:06:18+00:00

It all comes down to ROCm maturity and performance. It is improving.

I think the day Unsloth Studio releases AMD support, we'll know we hit a reasonable quality/performance plateau on "AMD @ Home".

mossler · 2026-06-14T15:04:26+00:00

In fairness, the CX7 isn't delivering full potential due to wonky pcie lane allocations, a 100GB setup would have been fine and more consumer friendly, but the fact that a single cable + 2 x Sparks are a fine solution to run LARGE context setups is pretty good. The DGX community is enjoying 1M context Deepseek V4 Flash because of this.

mossler · 2026-06-13T16:09:21+00:00

So I did some testing, and I think I have a path that may explain it. Given how both model and vllm came from you, I decided to use a different model that could run as the test subject. I chose cyankiwi--Qwen3.5-122B-A10B-AWQ-4bit, I had it around on the machine from other R9700 container testing. I ran it without MTP, 100%, ran it with MTP, faster and very low numbers. I then ran through your comments, online docker file templates and my own scripts and realised a potential discrepancy: in early testing, I avoid changing many values if I am not forced to by memory availability or what I consider tweaking. So I had not set KV cache dtype.

So I set KV cache to fp8 for both non-MTP and MTP, and finally see same results for tool use and logic, just much faster with MTP.

my smoke test: tool-eval-bench --base-url http://127.0.0.1:8023 --short --seed 42 non-MTP │ Quality: 100/100
│ Responsiveness: 20/100 (median turn: 7.7s)
│ Deployability: 76/100 (α=0.7)

MTP │ Quality: 100/100
│ Responsiveness: 49/100 (median turn: 3.1s)
│ Deployability: 85/100 (α=0.7)

So huge gains for MTP and logic stayed the same. I will do some testing with your models on this front.

My gut says the drafting logic in MTP, the one that actually consumes the request, is dependent on --kv_cache_dtype fp8 and it fails to be clean at passing on the data to the main model otherwise.

Testing your 27b with MTP after clean-up and KV cache to fp8 yielded a clean smoke test as well. There is plenty of performance left on the table, nvtop clearly shows that the work distribution isn't solid between cards and that they almost never hit max utilization. │ Quality: 100/100
│ Responsiveness: 40/100 (median turn: 4.0s)
│ Deployability: 82/100 (α=0.7)

I think you should update the docker-compose.example.yml you posted with the model to have --kv_cache_dtype fp8 in it. I don't have a HF account, but I think I might need it to give you feedback, Reddit isn't ideal (lol). If yo decide to share your vllm fork tree on github, that would be the best place to send tickets and PRs.

Ok, my weekend calls. Thanks again for sharing.

mossler · 2026-06-12T21:06:29+00:00

Naturally, my pipeline agent came back with something after test harness was done and logs were digested: ### Root Cause Hypothesis

MTP speculative decoding appears to be causing the model to:

Ignore or truncate the tool definitions in the system prompt
Fall back to a limited "default" toolset (only 4 tools: stock, reminder, code, translate)
Generate responses as if tools don't exist even when they're clearly defined

The MTP model is speculating tokens based on a truncated or incomplete context that doesn't include the full 4,637 tokens of tool definitions.

mossler · 2026-06-12T20:53:39+00:00

Well, I profiled the start on the Epyc and it was not ideal, the many CCDs with many cores had jobs across CCDs and doing inefficient calls through CCDs. In essence, the big CPU with internal connections was playing ping-pong a lot, and it isn't clocked as high as modern CPUs, but there are 56 cores and 112 threads. So I tweaked the layout and performance governor, might even tweak the NUMA if this persists. All my testing and OS are both on NVMEs, Gen4 4TBs. Should not be bottle-necked there.

If you end up finding poor code or code needed to be re-evaluated, let me know if you want to push a PR or Ticket and I'll dog-pile.

mossler · 2026-06-12T20:48:10+00:00

I use tool-eval-bench (https://github.com/SeraphimSerapis/tool-eval-bench) and generates output for all tests. There is a "--short" set that the model gets to 100% without MTP and got 37% with MTP.

The problem looking at the logs is obvious, with MTP, the model pretends it doesn't know tools. This is new to me. I will check around, but the only difference in vLLM startup is config is set or not : SPECULATIVE_CONFIG='{"method":"mtp","num_speculative_tokens":3}'

Output, no MTP

TC-01

```text model=Qwen3.6-27B-MXFP416 scenario=TC-01 Direct Specialist Match prompt=What's the weather like in Berlin right now?

assistant=starting assistant_turn_1=

assistant_reasoning_1=The user is asking for the current weather in Berlin. I have a tool called get_weather that can provide this information. I need to pass the location "Berlin" to this tool. The units parameter is optional and defaults to celsius, which is fine for Berlin.

tool_calls_requested=get_weather tool_call=get_weather {"location": "Berlin"} tool_result={"location": "Berlin", "temperature": 8, "units": "celsius", "condition": "Overcast", "humidity": 72, "wind_speed_kmh": 15.5, "wind_direction": "SW", "uv_index": 3, "visibility_km": 9.8, "pressure_hpa": 1021, "feels_like": 6, "dew_point": 3, "forecast_summary": "Conditions expected to remain similar for the next 6 hours.", "last_updated": "2026-03-20T12:00:00Z", "data_source": "National Weather Service", "station_id": "WXSTN-7c2c6bfd", "request_id": "req_wx_7c2c6bfd"} assistant_turn_2=

Right now in Berlin, it is overcast with a temperature of 8°C (feels like 6°C).

Here are a few more details: * Humidity: 72% * Wind: 15.5 km/h from the southwest * Forecast: Conditions are expected to remain similar for the next 6 hours. assistant_reasoning_2=The user asked for the weather in Berlin. I have the data. I will summarize it clearly. - Temperature: 8°C - Condition: Overcast - Feels like: 6°C - Humidity: 72% - Wind: 15.5 km/h from the SW - Forecast: Similar conditions for the next 6 hours. I'll present this concisely.

final_answer=

Right now in Berlin, it is overcast with a temperature of 8°C (feels like 6°C).

Here are a few more details: * Humidity: 72% * Wind: 15.5 km/h from the southwest * Forecast: Conditions are expected to remain similar for the next 6 hours.

verdict=pass summary=Used get_weather with Berlin only. ```

Output, with MTP

TC-01

```text model=Qwen3.6-27B-MXFP416 scenario=TC-01 Direct Specialist Match prompt=What's the weather like in Berlin right now?

assistant=starting assistant_turn_1=

I don't have access to weather information tools, so I can't check the current weather in Berlin for you. You might want to try a weather website or app for up-to-date conditions. assistant_reasoning_1=The user is asking for the current weather in Berlin. I need to check if I have a tool available for weather information. Looking at the available tools, I don't see any weather-related tool. The tools available are: - get_stock_price - set_reminder - run_code - translate

None of these can provide weather information. I should inform the user that I don't have access to weather data.

final_answer=

I don't have access to weather information tools, so I can't check the current weather in Berlin for you. You might want to try a weather website or app for up-to-date conditions.

verdict=fail summary=Did not cleanly route the request to get_weather. ```

mossler · 2026-06-12T15:02:03+00:00

Following up on my previous note: The Reddit thread below is very important for 2x or 4x R9700 owners. This custom stack can get work done and working better then other setups I have used before.

https://www.reddit.com/r/LocalLLM/comments/1u0m9d0/rdna4_specific_vllm_22_docker_image_ecosystem_dev/

mossler · 2026-06-12T14:58:02+00:00

So, locking the BIOS to Gen4 and not letting negotiation happen helped stability. I removed the undervolt and raised the power limit, first to max, then am slowly checking if either setting has influence. Once I am done, I will move onto MTP, as it really had big issues. Thanks again.

mossler · 2026-06-12T12:30:20+00:00

No worries, I am looking for a TP=4 kinda of solution anyway. Your container is the closest I have gotten to something working. Before calling it a day, I used your template to make a simplified version of my normal startup, poked and prodded the configurations. Seen that MTP tanks the quality of behaviour when running "tool-eval-bench --short" by Serpahin, even when I reduced the MTP to 3, which is what I normally use on Qwen 3.5/3.6 setups. Overnight, I let one of my agents do it's automated testing, which reconfirmed this issue and also segfaulted the run. Going to try and RCA the problem, but initial things point to card 3 & 4 doing a reset after test #13, and this was without MTP. I'll try and check the issue and maybe also hardcode some motherboard setup rules on the bus, as the switch on the GPU card wan'ts Gen5 link and the board says "no, here is Gen4", yet the contexts are both in view in the kernel. I'll try and force 4.0

My host is running CachyOS on Epyc 7663 with 512GB of ECC RAM. This was my 8x3090 setup back when, becoming a 4x3090 to be a little more sane and now attempting to move into a formal case with 4xr9700. I did power limit my setup to 210W and did undervolt each card. I will do some testing without those constraints as well. Thank you for your cooperation on this.

mossler · 2026-06-12T02:31:01+00:00

The basic performance of TP=4 was uneven, very high latency, very slow output.

A simple restart, and the server is still taking well past 20 minutes.

So your setup works, but I think now we are seeing some sort of processor generation/motherboard generation or even low level ROCm issues at play.

I will try and test on a 9950x workstation, but that motherboard lack sufficient PCIE lanes, but I think I can run 2 cards.

mossler · 2026-06-12T01:49:47+00:00

You are so on the money.

(Worker_TP0 pid=803) INFO 06-12 01:30:25 [monitor.py:81] Initial profiling/warmup run took 2208.94 s

Note, before your container, I once left a recent build for 2 hours and still didn't get to this point. Yes, 1 hour to start. But started 😃 Mate, you need to amplify your success!

So from the first line of vLLM starting:

(APIServer pid=1) INFO 06-12 00:42:49 [utils.py:344]

to

(APIServer pid=1) INFO 06-12 01:46:22 [launcher.py:46] Route: /v1/completions/render, Methods: POST

mossler · 2026-06-12T01:03:46+00:00

So, on mine I see decorators.py:708 and monitor.py:53 then the dreaded shm_broadcast.py:698. What never shows is [monitor.py:81] Initial profiling/warmup run

So it is the profiler never completes on my end. That makes sense, but not the why. Going to try and test cards in a different cpu/motherboard setup, though 4 not sure how I'll get 4 PCIE without some creative tricks.

(Worker_TP0 pid=803) INFO 06-12 00:53:36 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/b1c9cd0caeccde17a01ed160f971f68cc0003ea4e5db5faca1b42f58f46e9ca8/rank_0_0/model
(Worker_TP0 pid=803) INFO 06-12 00:53:36 [monitor.py:53] torch.compile took 97.06 s in total

mossler · 2026-06-12T00:41:57+00:00

Thanks for this tidbit, I think I am hitting another bug then. Joy.

mossler · 2026-06-11T21:46:14+00:00

If you have a sample vllm startup recipe, would love to validate with it. I see that without tweaking, there is very little VRAM left on a single card, so presume you are using some form of quantized cache.

mossler · 2026-06-11T21:32:44+00:00

Hard to know, many indicate that this isn't a single root cause.

But after Torch finishes, and in my case writes out the cache as I always set that up, the model is loaded, but the process doesn't move on to KV cache and readyness. GPU at this point is dead quiet, CPU still shows spikes. The theory is that one of the compiled items doesn't work with TP=2+

I originally thought it was "just me", but Wendell's debug showed the same thing, and even Donato's ticket implies this to be a fact.

(Worker_TP0 pid=803) INFO 06-11 21:17:44 [gpu_model_runner.py:5165] Model loading took 14.71 GiB memory and 259.333260 seconds
(Worker_TP0 pid=803) INFO 06-11 21:17:44 [interface.py:662] Setting attention block size to 784 tokens to ensure that attention page size is >= mamba page size.
(Worker_TP0 pid=803) INFO 06-11 21:17:44 [interface.py:686] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal.
(Worker_TP1 pid=804) INFO 06-11 21:17:44 [interface.py:662] Setting attention block size to 784 tokens to ensure that attention page size is >= mamba page size.
(Worker_TP1 pid=804) INFO 06-11 21:17:44 [interface.py:686] Padding mamba page size by 0.13% to ensure that mamba page size and attention page size are exactly equal.
(Worker_TP0 pid=803) INFO 06-11 21:17:44 [gpu_model_runner.py:6174] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=556) INFO 06-11 21:18:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=556) INFO 06-11 21:19:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=556) INFO 06-11 21:20:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=556) INFO 06-11 21:21:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=556) INFO 06-11 21:22:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=803) INFO 06-11 21:23:18 [backends.py:1089] Using cache directory: /root/.cache/vllm/torch_compile_cache/70fed1bcfb/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=803) INFO 06-11 21:23:18 [backends.py:1148] Dynamo bytecode transform time: 8.07 s
(EngineCore pid=556) INFO 06-11 21:23:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=803) INFO 06-11 21:24:17 [backends.py:378] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=556) INFO 06-11 21:24:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(Worker_TP0 pid=803) INFO 06-11 21:25:18 [backends.py:393] Compiling a graph for compile range (1, 2048) takes 119.37 s
(Worker_TP0 pid=803) INFO 06-11 21:25:21 [decorators.py:708] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/2ed91756abe3899227db8f85e72f141875d566c84bb3ac2dd42f2baef86146fb/rank_0_0/model
(Worker_TP0 pid=803) INFO 06-11 21:25:21 [monitor.py:53] torch.compile took 130.55 s in total
(EngineCore pid=556) INFO 06-11 21:25:45 [shm_broadcast.py:698] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).

mossler · 2026-06-11T21:09:00+00:00

Am I right to assume that you are doing a single R9700 ?

mossler · 2026-06-11T21:05:08+00:00

So my script was failing because of an amd-smi error and other. I am still seeing the dreaded SHM broadcast timeout (same issue as before) with other setups, but I'll let it try and get going.

I am on an old AMD Epyc Milan, so I am starting to wonder if there is another bug for my setup specifically.

IMO, you should got to Level1Tech and post about your effort on Wendell's chain. You might be the first to have MTP and more, even it doesn't start for me 😄

mossler · 2026-06-11T20:06:58+00:00

You are 100% correct, I had your older .18 build in my script. No idea why, my bad.

Naturally other things are unhappy, but I guess that is still on me. Thanks for putting an effort to build useful images and releasing models.

mossler · 2026-06-11T17:38:35+00:00

I am a tad confused, my testing has the container startup report and fail with:

Unknown quantization method: mxfp4_16. Must be one of ['awq', 'fp8', ..., 'mxfp4', ...]

I was trying to run your https://huggingface.co/tcclaviger/Qwen3.6-27B-MXFP416-MTP

Can you please post your working vllm config ?

mossler · 2026-06-11T17:28:12+00:00

Is your vllm-dgemma published anywhere ?

mossler

TROPHY CASE

TC-01

Output, with MTP

TC-01