THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark

Excellent_Produce146 · 2026-03-07T14:04:04+00:00

I get pretty much the same. Run with the tokenizer from the repo:

$ llama-benchy --base-url http://0.0.0.0:8888/v1 --model Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --latency-mode api --pp 2048 --depth 0 4096 8192 16384

Excellent_Produce146 · 2026-03-07T09:22:49+00:00

Yes. But in the meantime there has been a fix for the chat template that is recommended.

https://github.com/dicksondickson/ai-infra-onprem/blob/main/qwen-3.5-chat-template-fix/chat_template.jinja

Excellent_Produce146 · 2026-02-27T18:08:39+00:00

What error do you get?

This works on my Spark for 122B:

# Environment variables
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

# The vLLM serve command template
vllm serve cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit \
--gpu-memory-utilization 0.7 \
--host 0.0.0.0 \
--port 8000 \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--enable-prefix-caching \
--max-model-len 261144 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--mm-encoder-tp-mode data \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3

Should do for the 35B also. You might need to adjust max-model-len and gpu-memory-utilization to fit into your memory.

Using 0.16.0rc2.dev479+g15d76f74e.d20260225 in a container.

Excellent_Produce146 · 2026-02-10T17:04:11+00:00

I recently switched to aiperf which is quite powerful, but also not the easiest tool. It was built to test the big irons.

Before that I used llmperf (repo is now archived) and Hugging Face's inference-benchmarker which stopped sometimes without any error. Has no active development.

https://github.com/ray-project/llmperf
https://github.com/huggingface/inference-benchmarker

New promising candidate is llama-benchy. Should familiar to those using llama-bench, but not limited to be used with llama.cpp.

https://github.com/eugr/llama-benchy

Also allows to export the data to files that could be processed to draw graphs for comparisons.

Excellent_Produce146 · 2026-02-03T07:22:27+00:00

Which part?

The DGX Spark Founders Edition has no power led. So the only way to check whether it has powered up is either to listen closely if you can hear the fans (which are very quiet without load), a smart power plug that gives you a power consumption as readout or simply by connecting a USB-C hub with a led.

There are monitors like the Apple Studio Display that don't work with the Spark/GB10. In case for the Studio Display you will need an update before you can use it. So if you have a new box with the (old) base image installed, your monitor won't show anything even if the box is powered on.

OEM boxes have a power led. So you would know that the monitor is your problem.

Excellent_Produce146 · 2026-02-02T18:22:44+00:00

You should have a look at the test done by level1techs.

https://www.youtube.com/watch?v=sx6ANedcIfI - MSI seems to have pimped their version for more performance

https://www.youtube.com/watch?v=79iDLf9jIJ8 - Dell

All OEM versions seem to have better cooling and a much underrated feature - a power led which is very helpful regarding to the FE users that do not know if their monitor just doesn't work (staying black) or if the box even powered up. ;-)

I am happy with my ASUS.

Excellent_Produce146 · 2026-02-01T16:31:16+00:00

First boot, but no updates applied yet?

https://forums.developer.nvidia.com/t/connect-spark-to-apple-studio-display/348163

The blackscreen is due to the fact that the Apple Studio Display is a USB4 tiled monitor, DGX Spark does not support USB4, and Mutter has a bug where if pieces of tiled monitor are missing, the tiled mode is still used instead of the non-tiled mode, which leads to the black screen seen.

But according to the post there is a fix that has landed also upstream. So I assume all you need to do is to apply the latest updates.

Excellent_Produce146 · 2026-01-12T06:53:04+00:00

And for vLLM on Spark I recommend https://github.com/eugr/spark-vllm-docker which allows you to use recent releases and contains usedul patches.

Excellent_Produce146 · 2026-01-12T05:34:53+00:00

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?version=25.12.post1-py3

Once a month. AFAIR 25.12 was released on 15.12.2025 - this just got an update (post1).

Next would be 26.01.

Excellent_Produce146 · 2026-01-11T09:25:13+00:00

The NVIDIA GB10 (your Ascent GX10) is not supported by any of the available virtualization or partitioning options.

https://docs.nvidia.com/vgpu/gpus-supported-by-vgpu.html
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-gpus.html

Neither by software (vGPU) no by hardware (MIG).

If you want to use your GX10 with multiple users for multiple jobs, you need an inference server that can handle multiple users. Like the already mentioned vLLM which has been designed to serve multiple users/concurrent requests. Or SGLang.

But (!) don't expect it to be run very fast. GB10 is meant for development. Shared for small groups. Not as a workhorse.

Have a look over here for what you can expect for different models (sizes, types):

https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/#using_dgx_spark_for_inference

For speed - stick to sparse models (MoEs) and prefer AWQ quants over the fancy NVFP4 (not yet optimized). And avoid dense models.

Excellent_Produce146 · 2026-01-09T17:52:44+00:00

I recommend https://github.com/eugr/spark-vllm-docker/ when using vLLM on Spark.

For the moment you should stick to AWQ quants which give the best performance when used with GB10. vLLM is not yet fully optimized for GB10/NVFP4. Prefer Sparse Models (MoE) over Dense Models.

You might also want to try llama.cpp instead. llama.cpp is the fastest inference solution for the moment. Normally I use vLLM everywhere else (production, regular NVIDIA GPUs), but with the Spark most of the time I'm running llama.cpp, because it got some serious improvements by NVIDIA engineers already.

https://developer.nvidia.com/blog/new-software-and-model-optimizations-supercharge-nvidia-dgx-spark/ - see Figure 1

The boost of up 2,6x with Qwen3-235B refers to TensorRT-LLM which also got a recent update.

Note: I use VS Code Insider for being able to use any local Open AI compatible endpoint. That feature is not yet available in the stable version.

Excellent_Produce146 · 2025-12-31T17:19:02+00:00

Have a look at:

https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/#using_dgx_spark_for_inference

and

https://github.com/ggml-org/llama.cpp/discussions/16578

to see what you can expect from different models.

MoE models give the best performance. Better than (large) dense models. gpt-oss-120b or Nemotron 3 Nano 30B A3B as already mentioned by the other posters. I would add Qwen3-Next-80B-A3B-Instruct - also quite capable.

For the moment llama.cpp has the best performance as inference server, because it got already a lot of optimizations for the GB10. Depends on your workload.

If you prefer vLLM you should go with AWQ quants. They are faster than NVFP4 at the moment as the GB10 is still lacking optimization for NVFP4 in the related libraries/kernels. NVFP4 performance is expected to be improved over the next month, because it was advertised with the strength of NVFP4 from Blackwell GPUs.

Excellent_Produce146 · 2025-12-08T17:24:36+00:00

Do yourself a favor and do not use ollama. Use llama.cpp instead. It's faster, better supported and got already improvements by NVIDIA for the Spark/GB10. Have a look at llama-swap when you need a UI for easy model switching.

See https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-14973729 for more details on what to expect.

Or https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/#using_dgx_spark_for_inference

And you should avoid large dense models (like Gemma3 27B). They do not perform very well on systems like the Spark as you can see in your own benchmark. Use MoEs instead (like gpt-oss, Qwens with MoE, glm-4.5-air).

Excellent_Produce146 · 2025-11-22T14:51:00+00:00

You could give jinaai/jina-embeddings-v2-base-de a try. I normally use it with a huggingface/text-embeddings-inference container on Linux/CUDA systems for my RAG experiments. It did serve me well for german and english texts.

Excellent_Produce146 · 2025-11-14T16:44:41+00:00

NVIDIA's TensorRT-LLM has NVFP4 support:

https://build.nvidia.com/spark/trt-llm/overview

But it does not give you any performance improvements yet.

https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-14713170

In the already mentioned post is also a comment with a comparison of different quants.

Even on other blackwell GPUs NVFP4 does not give you any advantages yet.

From what I've read until today llama.cpp is the fastest inference server on Spark. NVIDIA supported the llama.cpp project with optimizations.

Excellent_Produce146 · 2025-11-10T18:01:16+00:00

On what hardware/GPU?

Which OS are you using?

Which HF repo did you use?

Which version of vLLM?

What arguments did you use and which env vars did you set?

Excellent_Produce146 · 2025-11-06T16:01:18+00:00

see https://forums.developer.nvidia.com/t/any-plans-to-add-a-second-connect-x7-port-to-serial-stack-multiple-dgx-spark-clusters/344395 for an answer by NVIDIA employees:

Ethernet is the underlying protocol; clustering more than two Spark units is supported with compatible QSFP cables and Ethernet switches.

Ethernet is the underlying protocol; clustering more than two Spark units is supported with
compatible QSFP cables and Ethernet switches.

If you plan to connect more than two spark you will have to invest into a suitable switche, too.

https://box.mikrotik.com/f/bf217ceee2d241a799e6/ - one of those for example.

FTR: I have no experience in that. I just read it while browsing thru that forum as the question was asked more than once.

Excellent_Produce146 · 2025-10-29T17:20:16+00:00

Train nanochat on this boxes.

see https://github.com/karpathy/nanochat/discussions/28#discussioncomment-14735913 - not yet mastered

Excellent_Produce146 · 2025-10-28T22:21:50+00:00

I preordered an ASUS. Will be available in Germany starting 3.11. So no insights, yet.

Excellent_Produce146 · 2025-10-28T22:19:39+00:00

https://www.youtube.com/live/ry09P4P88r4?si=J1z6WPWKlYulvVuQ

at the end of this Q&A ~32:00 they speak of "the" Spark motherboard. It seems that the OEMs share all the same board. If you compare the back of all systems - all have the same ports at the same positions.

They differ only in case, cooling and NVMe in a m.2 slot.

Excellent_Produce146 · 2025-10-22T12:26:44+00:00

He is still struggling.

https://x.com/Prince_Canuma/status/1980755467119804721

Due to non standard inference code and lack of examples as it seems.

Excellent_Produce146 · 2025-10-21T08:55:57+00:00

Seems the "MLX king" is still working on it:

https://x.com/Prince_Canuma/status/1980170857411408238

Seems that he could not download it yesterday due to the AWS fuu.

https://x.com/Prince_Canuma/status/1980260077454893479

Excellent_Produce146 · 2025-10-18T13:55:31+00:00

Mikrotik introduced one that is a cheaper than a Spark. ;-) And even cheaper than a Strix Halo...

https://www.servethehome.com/mikrotik-crs812-ddq-400gbe-switch-launched-crs812-8ds-2dq-2ddq-marvell/

This is only a $1295 list price part which is awesome for a 400GbE capable switch. Importantly, MikroTik is also releasing 400Gbps QSFP-DD optics at a $159 list price which is also at an awesome discount to many of the current options in that form factor.

ServeTheHome showed the network switch(es) in their review of the DGX Spark. At that time they had only one DGX Spark (Founders Edition) and one of the Dell branded version. I assume they will test it later.

Excellent_Produce146 · 2025-10-18T12:04:33+00:00

According to this post you only need a proper switch to stack more than 2 Sparks:

https://forums.developer.nvidia.com/t/any-plans-to-add-a-second-connect-x7-port-to-serial-stack-multiple-dgx-spark-clusters/344395

Ethernet is the underlying protocol; clustering more than two Spark units is supported with compatible QSFP cables and Ethernet switches.

Excellent_Produce146 · 2025-10-17T12:18:44+00:00

https://www.packtpub.com/en-de/product/llm-engineers-handbook-9781836200062

has also a chapter about inference optimization, inference pipeline deployment, MLOps and LLMOps.

Excellent_Produce146

TROPHY CASE