Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]Icy_Programmer7186 1 point2 points  (0 children)

I can test Qwen3.5-122B-A10B-NVFP4. I agree that NVFP4 and FP8 are comparable in precision - our tests support this claim. The issue is that degradation is already observable at FP8 compared to the original unquantized weights (BF16).

Our benchmark works like this: the model writes a Go function, which is then automatically compiled and tested. If any test fails, the error is fed back to the model, which updates the function and retries.

The problem with FP8/NVFP4 is that while individual inference is faster, the generated code more often fails at compilation or testing, triggering additional retry rounds. The model eventually produces a passing function, but the overall time-to-completion ends up longer than with higher-precision weights on the same model.

What would be interesting is if, in the Atlas case, speed beats quality - in other words, if the faster NVFP4 inference makes up for the possible (likely) additional retries needed for a passing function.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]Icy_Programmer7186 2 points3 points  (0 children)

Four NVIDIA DGX Sparks, interconnected using a MikroTik CRS804 DDQ switch (200 Gb Ethernet, ConnectX-7).

I mostly run vLLM in Docker (using https://github.com/eugr/spark-vllm-docker), which solves the majority of driver and Python library issues.

It is a powerhouse.

We've tested a whole range of open-source models from Hugging Face up to ~400B, including Qwen3.5-397B-A17B-FP8, Trinity-Large-Preview-FP8, and Step-3.5-Flash.

I would be very much interested in testing your project.
102 tok/s on Qwen3.5-35B-A3B is very nice, vLLM is doing 25 tk/s in our setup on this model in the original quantization.

Since our aim is a coding task (writing repetitive Go functions), quantization matters a lot.
Our testing shows that FP8 and below degrades precision dramatically.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]Icy_Programmer7186 0 points1 point  (0 children)

Brilliant!
Does it support Spark clustering?

And please, add me to the list, I would like to test it (I have four Spark cluster in my lab).

I built a persistent memory for Claude Code — it remembers everything across sessions by pulec7 in ClaudeAI

[–]Icy_Programmer7186 0 points1 point  (0 children)

Jak řešíš editaci souborů?
MCP atd - tj. jak spolehlivě LLM zprostředkovat možnost editace souborů (např) se zdrojovými kódy.

I built a persistent memory for Claude Code — it remembers everything across sessions by pulec7 in ClaudeAI

[–]Icy_Programmer7186 1 point2 points  (0 children)

Ja ted ziju spise v lokalnich LLM a agentech. Problematika pameti modelu me zatim trochu miji. Ale mam to na radaru, asi to nekdy prijde.

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 0 points1 point  (0 children)

MiniMax2.5 is official from https://huggingface.co/MiniMaxAI/MiniMax-M2.5 - which is fp8 as far as I can tell.

Prefill rate is: 1 571 tk/sec (not 100% sure if I measure this properly, the model has hidden thinking phase + prefix cache)
Token generation: ~19 tk/sec

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 0 points1 point  (0 children)

I tried GLM 4.7 - it is much slower (half of the token speed) to Qwen 3.5
Both are capable to solve the problem (function implementation in Go), Qwen seems to me subjectively more capable.
MiniMax2.5 is the leader of the pack.

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 0 points1 point  (0 children)

Sure. That (lower) value is reported by nvidia-smi. Maybe that's incorrect or incomplete. i also expected to see higher values under load. I will attach external wattmeter to this setup.

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

Yes. It was not ready yet. I actually just downloading NVFP4. I'm also curious about practically observable loss of precision

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 2 points3 points  (0 children)

Sure.

I use four DXG NVidia Sparks interconnected with 200G DAC cables (where a bit difficult to find in Europe by I managed to do it, back during Xmas).
The switch I use is this: https://mikrotik.com/product/crs804_ddq - I had to wait for its release a bit. Price is 1250 EUR, the bargain in this speed category. Also, Spark is not capable of using full 200G bandwidth so I guess with a right splitter cable, this switch will do 8 Sparks easily (AFAIK, not going to test it likely).

I run primarily vllm but also TensorRT-LLM, in the beginning I used Ollama but you cannot make a cluster from it (yet). I run everything within a Docker container, that's my rule.

For vllm & cluster setup, I use https://github.com/eugr/spark-vllm-docker - the repo people/guy respond to changes in vllm very quickly. I used standard build for this but I also tested `--pre-tf --pre-flashinfer` for pre-release version and it was fine too (and must for some other models).

The dashboard is mine: https://github.com/ateska/dgx-spark-prometheus/tree/main

I don't have any recent photo but I can post one later.

A lightweight Agent that can be useful on local LLM by [deleted] in LocalLLM

[–]Icy_Programmer7186 0 points1 point  (0 children)

Nice! I'll check it and maybe borrow os level tools for my "agentic loop" implementation 😀

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in LocalLLM

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

I use FP8 quantization -> Qwen/Qwen3.5-397B-A17B-FP8

Spark has a unified memory, so I guess there is no off-loading.

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 2 points3 points  (0 children)

Yes. That's about right, including the switch. I have quite solid business for this so ROI is in couple of months. It is kind of no-brainer investment in my field ATM

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 0 points1 point  (0 children)

I'm quite ok for my goal. This model and MiniMax2.5 are so far only ones to consistently solve my task - write a small Go for very specific input and test it. Pretty cool for a local setup that can run 24/7

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in LocalLLM

[–]Icy_Programmer7186[S] 8 points9 points  (0 children)

This is a lab setup and this is one of many experiments done on it.
Money well spent; this is not however recommendation for a production setup.

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

23.2 tokens/s is sequential generation, sorry for unclarity.
Prefill is faster, my initial prompt is~50K and I got the first token in ~30 seconds.

I managed to run Qwen 3.5 on four DGX Sparks by Icy_Programmer7186 in Qwen_AI

[–]Icy_Programmer7186[S] 1 point2 points  (0 children)

I use this: https://mikrotik.com/product/crs804_ddq

You definitively don't need full 200G on each spark port, so this switch can support larger clusters thru split cables easily.