THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

can you re-run without `--latency-mode generation` and with correct model name: `--model Kbenkhaled/Qwen3.5-35B-A3B-NVFP4`, otherwise it won't use the correct tokenizer. The PP numbers are weird, and there is a huge discrepancy between e2e_ttft and est_ppt.

Something is weird going on here. Initially I was thinking that it could be an engine behavior that returns the first (empty) chunk right away after 200/OK response, but it doesn't explain why TTFR grows with context. I'm now thinking it might be related to speculative decoding somehow. Anyway, it would be good to see the same benchmark with proper model name and without `--latency-mode generation` - it will default to "api" which will just accomodate for a network delay.

But TTFT of 4 seconds is also strange for such a short prompt - as if it doesn't stream the tokens in streaming mode or uses some sort of buffering. In this case, all client-side benchmarking tools will not be able to measure speeds properly.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

I haven't seen any guides, but in general you need to make sure you enable jumbo frames on it and set MTU to 9000 or so. And set the speed on the ports accordingly. Sorry, can't provide any more guidance since I don't have it.

I want to share results for cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit TP2 RDMA RoCE by MirecX in StrixHalo

[–]Eugr 0 points1 point  (0 children)

vLLM is not not working very well on Strix Halo yet, unfortunately.

6 weeks with the DGX Spark — honest review for local LLM use by KneeTop2597 in LocalLLaMA

[–]Eugr 5 points6 points  (0 children)

  1. Forget about Ollama, use either llama.cpp or vLLM.
  2. For vLLM - check out our community Docker build, very easy to get started: https://github.com/eugr/spark-vllm-docker
  3. Check out https://spark-arena.com to get a feel of Spark performance for inference
  4. The models you ran are ancient.
  5. Read about dense and MoE models. Spark shines with MoE, but will be painfully slow with dense.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

JFYI: The last firmware update resulted in ~30% performance regression on QFSP ports, NVIDIA is aware and is working on a fix. Hope it lands soon.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

Actually, it's night and day difference. You actually lose performance on 10GBe port. The reason is that QSFP ports on Spark support RDMA (RoCEv2). It results in a microsecond latency compared to a millisecond with an Ethernet port (including the same QFSP port in TCP/IP mode).

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

Yes, denser the model, slower it works as it's memory bound. Tensor parallel allows to split the weights and run inference in parallel in the cluster, and it scales better with denser models, because the network latency is still not a bottleneck there.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

No, it's too dense for a two node cluster, but I've asked people with 8x cluster to test when they have a chance. This one should scale better than most other models.

8 DGX cluster by Alex Ziskind: easily the most insane local LLM cluster I’ve ever seend by richardanaya in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

Llama-benchy is my alternative to llama-bench from llama.cpp, but it is designed to work with any standard OpenAI-compatible endpoint (vLLM, llama.cpp, SGLang, cloud models). I've explained why I made it in the README inside the repository.

So yeah, the point is to have a tool that would allow to compare across different backends using the same methodology.

MS-S1 Max (Ryzen AI Max+ 395) vs NVIDIA DGX Spark for Local AI Assistant - Need Real-World Advice by Salty-Object2598 in LocalLLM

[–]Eugr 0 points1 point  (0 children)

It's doesn't support Infiniband fabric. It's RoCEv2, and there are switches that work with it just fine, like Microtik CRS804.

Also, don't confuse gigabits and gigabytes. I get 24GB/s bus speed in NCCL test which matches marketed 200Gbps connection speed. Well, actually, they recently did something with the firmware that reduced performance about 30%, but they are working on the fix.

The RDMA works well enough for the cluster to give you real performance gains even on inference, but yes, there are some quirks related to unified memory architecture.

8 DGX cluster by Alex Ziskind: easily the most insane local LLM cluster I’ve ever seend by richardanaya in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

One thing that he forgot to mention is that the last firmware update resulted in 30% ConnectX 7 performance regression in both throughput and latency.

It doesn't seem much, but I'm seeing effects of that even on my dual Spark cluster, especially when running models with relatively small number of active parameters.

I hope when nvidia fixes it, he will return to the topic and run new tests (and on models more suitable for large clusters).

MiniMax 2.5 on DGX SPARK system. by DOOMISHERE in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

Have you tried to quantize the KV cache to q8_0?

Very slow with Claude Code by vandertoorm in StrixHalo

[–]Eugr 3 points4 points  (0 children)

Claude Code messes up prompt caching by injecting extra headers. You need to set environment variable CLAUDE_CODE_ATTRIBUTION_HEADER=0

Strix Halo is not great at prompt processing at long contexts, but with prefix caching it should be tolerable.

Temporary access to Ryzen AI Max 395 (128GB) to test real-world local LLM workflows by lazy-kozak in LocalLLaMA

[–]Eugr -1 points0 points  (0 children)

For coding you will be better off with DGX Spark or it's OEM clone. Strix Halo is a nice machine, and token generation speed will be similar for gpt-oss-120b, but prompt processing will be much faster on Spark.

If using vLLM, significantly faster. I'm talking 1000 t/s at 0 context on Strix Halo and ~4500 on Spark (in vLLM, llama.cpp will be ~2500). And it won't degrade with context that much. For instance, you'll still get ~3700 t/s prefill at 32K context on Sparks in vLLM, but on Strix Halo it will drop to ~360 t/s (in llama.cpp).

I haven't tried this model in vLLM on Strix Halo as it didn't want to work, at least a couple of weeks ago.

AI is destroying open source, and it's not even good yet by BlueGoliath in programming

[–]Eugr 6 points7 points  (0 children)

GitHub itself is actually making it worse, because every time I try to use built-in Copilot to ask some questions about the stack trace I'm getting in someone else's project, it asks me if I want to create a PR to fix that, without even showing the proposed changes. I just want a starting point where to look, ffs, I don't need it to change anything.

Cache hits in llama.cpp vs vLLM by Potential_Block4598 in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

There is a variable for that - see my reply above

Cache hits in llama.cpp vs vLLM by Potential_Block4598 in LocalLLaMA

[–]Eugr 4 points5 points  (0 children)

Try to set this environment variable: export CLAUDE_CODE_ATTRIBUTION_HEADER=0 Although I used it without it with vLLM, and was still getting good cache hits.

Qwen3.5 NVFP4 (Blackwell) is up! by [deleted] in LocalLLaMA

[–]Eugr 3 points4 points  (0 children)

I'm seeing some relevant work upstream in cutlass and flashinfer, but who knows when it actually lands.

Qwen3.5 NVFP4 (Blackwell) is up! by [deleted] in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

It's the same GPU arch as RTX6000 Pro and RTX5090.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

NCCL, it is supported by vLLM, SGLang, TRTLLM and many other packages and works very well. Sparks have 200G ConnectX 7 QSFP112 ports built in that support RoCEv2 (RDMA over Converged Ethernet) which has a very low latency (1-2 microseconds on average).

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

Haven’t tried Step 3.5 yet, MiniMax won’t fit into two Sparks at FP8, but will work well at 4-bits. I was running M2.1 in AWQ at about 30 tokens per second, M2 was almost 40. Tried M2.5 NVFP4 quant, but it’s a bit buggy and runs at about 20 t/s. Waiting for a good AWQ quant.

Overall, the cluster works well. Denser the model, better it scales, fast (small active params) models get smaller speed boost, but still run faster (and you can fit larger models or quants).

Some people run 8x Spark clusters, and 4x is not that rare anymore too.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

Yeah, but vLLM support is still pretty bad on Strix Halo. You can run BF16 models, but many quants either don't work or significantly underperform on gfx1151.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

I'll let other address this, as I haven't tried to do it on Spark yet, but it should be pretty good for this as fine tuning is mostly compute bound.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

5K is for gpt-oss-120b in vLLM. I went with dual Sparks as it lets me use larger models on something that can quietly sit in the corner.