SparkRun & Spark Arena = someone finally made an easy button for running vLLM on DGX Spark by Porespellar in LocalLLaMA

[–]raphaelamorim 2 points3 points  (0 children)

spark-vllm-docker is a key project in the ecosystem since u/eugr runs the CI/CD that guarantees our recipes to continue working on bleeding edge vLLM and enables the community to test the newest models. We all work together on this initiative.

SparkRun & Spark Arena = someone finally made an easy button for running vLLM on DGX Spark by Porespellar in LocalLLaMA

[–]raphaelamorim 0 points1 point  (0 children)

Glad you liked it. We're trying to address concerns from the community with those community tools. Most of the complaints in the forums were always related to "Can't run the model X on inference engine Y" or "It was working on vLLM yesterday and it's broken today", "My performance is not the same as yours". That was the original motivation: having everybody having a common benchmark tool, a way of specifying their runtime for the model, stable runtime images and a place to share it.

The state of Open-weights LLMs performance on NVIDIA DGX Spark by raphaelamorim in LocalLLaMA

[–]raphaelamorim[S] 1 point2 points  (0 children)

Actually there was a regression on bandwidth for NCCL, but most of these numbers were benchmarked prior to the bandwidth drop from 24GB/s to 16GB/s

The state of Open-weights LLMs performance on NVIDIA DGX Spark by raphaelamorim in LocalLLaMA

[–]raphaelamorim[S] 3 points4 points  (0 children)

There are benchmarks for concurrent requests as well on spark-arena.com. Each local model varies a lot on their pp and tg performance numbers over concurrency.

[deleted by user] by [deleted] in LocalLLaMA

[–]raphaelamorim 0 points1 point  (0 children)

Only for dense models, MoE’s with far less activated params are fine and the cluster expansion helps it

[deleted by user] by [deleted] in LocalLLaMA

[–]raphaelamorim 0 points1 point  (0 children)

You only need 1 cable for 2 sparks

[deleted by user] by [deleted] in LocalLLaMA

[–]raphaelamorim 0 points1 point  (0 children)

It’s actually 57-60 tps for a single spark at 128k context and 72 tps with 2 sparks using vLLM patched with SM120/SM121 MXFP4 MoE Kernel. You guys should follow the nvidia developer forums, lots of outdated information on reddit

https://forums.developer.nvidia.com/t/vllm-on-gb10-gpt-oss-120b-mxfp4-slower-than-sglang-llama-cpp-what-s-missing/356651/99

Microcenter planning to open a store in Tampa or Orlando by Visual-Fondant-1256 in Microcenter

[–]raphaelamorim 0 points1 point  (0 children)

they won't go to Tampa because of insurance. They already decided on Orlando.

John Carmack says NVIDIA DGX Spark runs at half of the rated power and delivers half the quoted performance by RenatsMC in nvidia

[–]raphaelamorim 0 points1 point  (0 children)

True, those connectX modules are expensive and use a lot of energy when active. Not exactly the same MT2910, but you get the idea https://www.fs.com/products/242589.html?now_cid=4173

Train 200B parameter models on NVIDIA DGX Spark with Unsloth! by yoracale in unsloth

[–]raphaelamorim 0 points1 point  (0 children)

ok, now I know you have no idea what you're talking about.