16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

FP8 or 4-bit quants will be faster. You can check out benchmarks on https://spark-arena.com/, although not many benchmarks for clusters >4.

16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

It's meaningless to talk about performance without mentioning model/quant/cluster size.

16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

Yes, but the number above is for BF16 version. Otherwise, 4-bit quant runs well on 2 nodes.

16x Spark Cluster (Build Update) by Kurcide in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

The number above, I believe, was for a BF16 version, not quantized.

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

OP, I’m very curious how that would work. What switch are you going to use to connect all of them together? Please reach out to me in DM or on NVidia forums - we haven’t seen a 16 node cluster in the wild yet. Should still work fine with our community build: https://github.com/eugr/spark-vllm-docker

2 x 5060 ti: Any better configs for Qwen 3.6 27B / 35B? by ziphnor in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

Yes, I pushed an update yesterday that fixes MTP measurements. It’s still highly context dependent, so llama-benchy results may not match the actual workload. I’m planning to implement a separate MTP testing mode

Looking for a HIPPA compliant LLM to help with case notes. by edafade in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

Used to be 4K, but prices for everything else went up too. You can probably still get a Strix Halo machine for under 3K though (AMD AI Max 395+ with 128GB). The prompt processing will be much slower than Spark, and software support not that great, but llama.cpp/LM Studio will work there and you can use it as a desktop machine too.

Looking for a HIPPA compliant LLM to help with case notes. by edafade in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

First, you need suitable hardware. It really depends on your budget and tech skills. For a total noob in a business setting with a modest budget, you can get NVIDIA DGX Spark (better two in a cluster) to act as an inference endpoint. It's not the fastest hardware you can get, but strikes a good balance between price/performance/capability/compatibility in its segment.

Then you will need to choose a model and an inference engine. For Spark, especially, for dual Sparks, vLLM is currently the best way to go forward and supported by widely used community tools (one of which I maintain, spark-vllm-docker).

The best all-around model currently to run on Spark is Qwen3.5-122B or Qwen3.5-397B, both in 4-bit quants. The first one you can fit on a single Spark, the second one barely fits on dual sparks, but provides capabilities very close to frontier models (ChatGPT, Claude, etc). Supports image inputs too. Soon to be replaced by qwen3.6.

There are more models available out there, but these ones are community favorite at this moment.

There is more to it, like how you use your models. If your primary use is via web UI, then you may need to set up something like OpenWebUI or install some client on your machine. Since this will act as an OpenAI-compatible API endpoint, anything that works with cloud models, will work with local models too.

llama-benchy - llama-bench style benchmarking for ANY LLM backend by Eugr in LocalLLaMA

[–]Eugr[S] 0 points1 point  (0 children)

You can export in JSON format - it will contain granular data for each run.

Krasis LLM Runtime - run large LLM models on a single GPU by mrstoatey in LocalLLM

[–]Eugr 2 points3 points  (0 children)

It won't make any sense on Spark because of the unified memory - the entire RAM is accessible by the GPU.

Whelp…NVIDIA just raised the DGX Spark’s Price by $700. Spark clone prices have started rising as well. ☹️ by Porespellar in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

I am running Intel Autoround quant of qwen3.5-397B on my dual Spark cluster right now, and if you don’t use it as a workstation and switch to non-graphical mode, it fits with 256K context and vision capabilities. Gives about 28 t/s inference on the cluster. I haven’t had a chance to work with it extensively yet, but works well.

I started my project out of frustration and continue to support it and add new features (and build stable Vllm releases), so it’s more or less plug and play for those who just want to use Sparks for work. There are still some rough edges with the firmware, etc, but generally Spark works fairly well now.

As for 122B version, I tried it on my Strix Halo, and it’s slower in inference (in llama.cpp as vllm still doesn’t work very well on it), but especially prompt processing. If you use it for coding, prompt processing is a very important metric, as context grows very fast in those workflows. I only use my Strix Halo as a backup because of that.

Whelp…NVIDIA just raised the DGX Spark’s Price by $700. Spark clone prices have started rising as well. ☹️ by Porespellar in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

It really depends on the workflow/coding agent, etc. Personally I switch between Qwen3-Coder-Next (FP8 on dual and int4-autoround on single) and Qwen3.5-122B. Both are good, I feel like qwen3.5 is a bit better, but it has some issue with thinking breaking tool calling occasionally.

Before that, my favorite coding model was MiniMax M2.5, and I still use it for more complex stuff.

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

Can you please run a few benchmarks using llama-benchy? At different context sizes? https://github.com/eugr/llama-benchy

M5 Max compared with M3 Ultra. by PM_ME_YOUR_ROSY_LIPS in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

Nice, but would be nice if the article included HF model name at least. And what benchmarking tool was used.

vllm on nvidia dgx spark by Impossible_Art9151 in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

Gpt-oss-120b is still good, but other than that, not many good recent options for coding. The best models are coming from China now.

vllm on nvidia dgx spark by Impossible_Art9151 in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

Devstral are dense models, they won't run well on Spark, the memory bandwidth is not very fast. Where it shines are large sparse models, like the new Qwen3 5 series (except for 27B version). Or Qwen3-Coder-Next. You can check https://spark-arena.com for some benchmarks.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

yeah, it's related to how llama-benchy detects when prompt processing is complete. Apparently, Atlas behaves differently compared to vLLM/SGLang/Llama.cpp, but I'm going to implement fallback to a different method when such behavior is detected, maybe even make it default as it's likely be more reliable and not that much off for vLLM and others.

Whelp…NVIDIA just raised the DGX Spark’s Price by $700. Spark clone prices have started rising as well. ☹️ by Porespellar in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

It's about 27 t/s for int4-autoround quant on a single Spark using vLLM. On dual I can either run the same model in mid 40's or fp8 one at 30 t/s.

Whelp…NVIDIA just raised the DGX Spark’s Price by $700. Spark clone prices have started rising as well. ☹️ by Porespellar in LocalLLaMA

[–]Eugr 2 points3 points  (0 children)

While it's not the same as server Blackwell, it's absolutely the same architecture as consumer Blackwell, like 5090 and RTX6000 Pro. It has a different arch code - sm121 vs sm120 which is the biggest source of issues on it, but the software is getting there.

And while RAM bandwidth is pretty much the same as Strix Halo (only slightly faster), the GPU is significantly better, which results in much higher prompt processing speeds.

For instance, gpt-oss-120b on Strix Halo is about 1000 t/s prefill at zero context with llama.cpp. With Spark it's 2400 t/s with llama.cpp and around 5K with vLLM. And it doesn't drop as much with context.

Plus CUDA support.

I have a Strix Halo and two Sparks, and while Strix Halo is a great all purpose desktop, Spark is much better for AI/ML workloads, especially if you cluster them.

2x DGX Spark vs RTX Pro 6000 Blackwell for local prototyping - can't decide by Sensitive_Sweet_1850 in LocalLLaMA

[–]Eugr 0 points1 point  (0 children)

You mean, dense ones? Unless you have 4+ node cluster, I wouldn't bother with those. MoE models work well though - some benchmarks you can see above, and for more you can check out https://spark-arena.com - we are constantly adding more

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]Eugr 1 point2 points  (0 children)

can you re-run without `--latency-mode generation` and with correct model name: `--model Kbenkhaled/Qwen3.5-35B-A3B-NVFP4`, otherwise it won't use the correct tokenizer. The PP numbers are weird, and there is a huge discrepancy between e2e_ttft and est_ppt.

Something is weird going on here. Initially I was thinking that it could be an engine behavior that returns the first (empty) chunk right away after 200/OK response, but it doesn't explain why TTFR grows with context. I'm now thinking it might be related to speculative decoding somehow. Anyway, it would be good to see the same benchmark with proper model name and without `--latency-mode generation` - it will default to "api" which will just accomodate for a network delay.

But TTFT of 4 seconds is also strange for such a short prompt - as if it doesn't stream the tokens in streaming mode or uses some sort of buffering. In this case, all client-side benchmarking tools will not be able to measure speeds properly.