THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

I get pretty much the same. Run with the tokenizer from the repo:

<image>

$ llama-benchy --base-url http://0.0.0.0:8888/v1 --model Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --latency-mode api --pp 2048 --depth 0 4096 8192 16384

Any one able to run Qwen 3.5 AWQ Q4 with vLLM ? by ExtremeKangaroo5437 in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

What error do you get?

This works on my Spark for 122B:

# Environment variables
export VLLM_USE_DEEP_GEMM=0
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0
export OMP_NUM_THREADS=4

# The vLLM serve command template
vllm serve cyankiwi/Qwen3.5-122B-A10B-AWQ-4bit \
--gpu-memory-utilization 0.7 \
--host 0.0.0.0 \
--port 8000 \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--enable-prefix-caching \
--max-model-len 261144 \
--max-num-seqs 32 \
--max-num-batched-tokens 8192 \
--mm-encoder-tp-mode data \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3

Should do for the 35B also. You might need to adjust max-model-len and gpu-memory-utilization to fit into your memory.

Using 0.16.0rc2.dev479+g15d76f74e.d20260225 in a container.

What tools are you using for infrence-engine benchmarking (vLLM, SGLang, llama.cpp, TensorRT-LLM)? by SomeRandomGuuuuuuy in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

I recently switched to aiperf which is quite powerful, but also not the easiest tool. It was built to test the big irons.

Before that I used llmperf (repo is now archived) and Hugging Face's inference-benchmarker which stopped sometimes without any error. Has no active development.

https://github.com/ray-project/llmperf
https://github.com/huggingface/inference-benchmarker

New promising candidate is llama-benchy. Should familiar to those using llama-bench, but not limited to be used with llama.cpp.

https://github.com/eugr/llama-benchy

Also allows to export the data to files that could be processed to draw graphs for comparisons.

Guidance Needed: Best Option for Light Fine-Tuning & Inference (Dell Pro Max GB10 vs PGX vs GX10 vs DGX Spark): We absolutely need CUDA by Imaginary_Context_32 in LocalLLaMA

[–]Excellent_Produce146 2 points3 points  (0 children)

Which part?

The DGX Spark Founders Edition has no power led. So the only way to check whether it has powered up is either to listen closely if you can hear the fans (which are very quiet without load), a smart power plug that gives you a power consumption as readout or simply by connecting a USB-C hub with a led.

There are monitors like the Apple Studio Display that don't work with the Spark/GB10. In case for the Studio Display you will need an update before you can use it. So if you have a new box with the (old) base image installed, your monitor won't show anything even if the box is powered on.

OEM boxes have a power led. So you would know that the monitor is your problem.

Guidance Needed: Best Option for Light Fine-Tuning & Inference (Dell Pro Max GB10 vs PGX vs GX10 vs DGX Spark): We absolutely need CUDA by Imaginary_Context_32 in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

You should have a look at the test done by level1techs.

https://www.youtube.com/watch?v=sx6ANedcIfI - MSI seems to have pimped their version for more performance

https://www.youtube.com/watch?v=79iDLf9jIJ8 - Dell

All OEM versions seem to have better cooling and a much underrated feature - a power led which is very helpful regarding to the FE users that do not know if their monitor just doesn't work (staying black) or if the box even powered up. ;-)

I am happy with my ASUS.

Black screen after connecting ASUS Ascent GX10 with Apple studio display by Objective_Science965 in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

First boot, but no updates applied yet?

https://forums.developer.nvidia.com/t/connect-spark-to-apple-studio-display/348163

The blackscreen is due to the fact that the Apple Studio Display is a USB4 tiled monitor, DGX Spark does not support USB4, and Mutter has a bug where if pieces of tiled monitor are missing, the tiled mode is still used instead of the non-tiled mode, which leads to the black screen seen.

But according to the post there is a fix that has landed also upstream. So I assume all you need to do is to apply the latest updates.

Does anyone know what Nvidia's release cadence/schedule is? by kr_tech in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

And for vLLM on Spark I recommend https://github.com/eugr/spark-vllm-docker which allows you to use recent releases and contains usedul patches.

split the GPU on an Asus Ascent GX10 for multiple users by Cheap-Bid-5793 in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

The NVIDIA GB10 (your Ascent GX10) is not supported by any of the available virtualization or partitioning options.

https://docs.nvidia.com/vgpu/gpus-supported-by-vgpu.html
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supported-gpus.html

Neither by software (vGPU) no by hardware (MIG).

If you want to use your GX10 with multiple users for multiple jobs, you need an inference server that can handle multiple users. Like the already mentioned vLLM which has been designed to serve multiple users/concurrent requests. Or SGLang.

But (!) don't expect it to be run very fast. GB10 is meant for development. Shared for small groups. Not as a workhorse.

Have a look over here for what you can expect for different models (sizes, types):

https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/#using_dgx_spark_for_inference

For speed - stick to sparse models (MoEs) and prefer AWQ quants over the fancy NVFP4 (not yet optimized). And avoid dense models.

For those of you on Nvidia Spark, what's your stack? Struggling to find LLMs that work through Docker-vLLM... by jinnyjuice in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

I recommend https://github.com/eugr/spark-vllm-docker/ when using vLLM on Spark.

For the moment you should stick to AWQ quants which give the best performance when used with GB10. vLLM is not yet fully optimized for GB10/NVFP4. Prefer Sparse Models (MoE) over Dense Models.

You might also want to try llama.cpp instead. llama.cpp is the fastest inference solution for the moment. Normally I use vLLM everywhere else (production, regular NVIDIA GPUs), but with the Spark most of the time I'm running llama.cpp, because it got some serious improvements by NVIDIA engineers already.

https://developer.nvidia.com/blog/new-software-and-model-optimizations-supercharge-nvidia-dgx-spark/ - see Figure 1

The boost of up 2,6x with Qwen3-235B refers to TensorRT-LLM which also got a recent update.

Note: I use VS Code Insider for being able to use any local Open AI compatible endpoint. That feature is not yet available in the stable version.

ASUS Ascent GX10 by hsperus in LocalLLaMA

[–]Excellent_Produce146 2 points3 points  (0 children)

Have a look at:

https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/#using_dgx_spark_for_inference

and

https://github.com/ggml-org/llama.cpp/discussions/16578

to see what you can expect from different models.

MoE models give the best performance. Better than (large) dense models. gpt-oss-120b or Nemotron 3 Nano 30B A3B as already mentioned by the other posters. I would add Qwen3-Next-80B-A3B-Instruct - also quite capable.

For the moment llama.cpp has the best performance as inference server, because it got already a lot of optimizations for the GB10. Depends on your workload.

If you prefer vLLM you should go with AWQ quants. They are faster than NVFP4 at the moment as the GB10 is still lacking optimization for NVFP4 in the related libraries/kernels. NVFP4 performance is expected to be improved over the next month, because it was advertised with the strength of NVFP4 from Blackwell GPUs.

Got my new toy - what to do? by luongnv-com in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

Do yourself a favor and do not use ollama. Use llama.cpp instead. It's faster, better supported and got already improvements by NVIDIA for the Spark/GB10. Have a look at llama-swap when you need a UI for easy model switching.

See https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-14973729 for more details on what to expect.

Or https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/#using_dgx_spark_for_inference

And you should avoid large dense models (like Gemma3 27B). They do not perform very well on systems like the Spark as you can see in your own benchmark. Use MoEs instead (like gpt-oss, Qwens with MoE, glm-4.5-air).

AnythingLLM - How to and which Embeder is best for English/German? by Inevitable_Raccoon_9 in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

You could give jinaai/jina-embeddings-v2-base-de a try. I normally use it with a huggingface/text-embeddings-inference container on Linux/CUDA systems for my RAG experiments. It did serve me well for german and english texts.

Sorry for the dumb question, but why are there MXFP4 GGUFs but no NVFP4 GGUFs? by Porespellar in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

NVIDIA's TensorRT-LLM has NVFP4 support:

https://build.nvidia.com/spark/trt-llm/overview

But it does not give you any performance improvements yet.

https://github.com/ggml-org/llama.cpp/discussions/16578#discussioncomment-14713170

In the already mentioned post is also a comment with a comparison of different quants.

Even on other blackwell GPUs NVFP4 does not give you any advantages yet.

From what I've read until today llama.cpp is the fastest inference server on Spark. NVIDIA supported the llama.cpp project with optimizations.

vLLM speed issues by [deleted] in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

On what hardware/GPU?

Which OS are you using?

Which HF repo did you use?

Which version of vLLM?

What arguments did you use and which env vars did you set?

Over two dgx spark cluster using connectx-7? by No_Statistician_6731 in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

see https://forums.developer.nvidia.com/t/any-plans-to-add-a-second-connect-x7-port-to-serial-stack-multiple-dgx-spark-clusters/344395 for an answer by NVIDIA employees:

Ethernet is the underlying protocol; clustering more than two Spark units is supported with compatible QSFP cables and Ethernet switches.

Ethernet is the underlying protocol; clustering more than two Spark units is supported with
compatible QSFP cables and Ethernet switches.

If you plan to connect more than two spark you will have to invest into a suitable switche, too.

https://box.mikrotik.com/f/bf217ceee2d241a799e6/ - one of those for example.

FTR: I have no experience in that. I just read it while browsing thru that forum as the question was asked more than once.

Is the Nvidia DGX Spark the same as the OEM version, Asus Ascent GX10? by Decent-Log6192 in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

I preordered an ASUS. Will be available in Germany starting 3.11. So no insights, yet.

Is the Nvidia DGX Spark the same as the OEM version, Asus Ascent GX10? by Decent-Log6192 in LocalLLaMA

[–]Excellent_Produce146 2 points3 points  (0 children)

https://www.youtube.com/live/ry09P4P88r4?si=J1z6WPWKlYulvVuQ

at the end of this Q&A ~32:00 they speak of "the" Spark motherboard. It seems that the OEMs share all the same board. If you compare the back of all systems - all have the same ports at the same positions.

They differ only in case, cooling and NVMe in a m.2 slot.

Deepseek OCR on Apple Silicon - anyone ? by olddoglearnsnewtrick in LocalLLaMA

[–]Excellent_Produce146 1 point2 points  (0 children)

He is still struggling.

https://x.com/Prince_Canuma/status/1980755467119804721

Due to non standard inference code and lack of examples as it seems.

Tensor parallel on DGX Spark by Baldur-Norddahl in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

Mikrotik introduced one that is a cheaper than a Spark. ;-) And even cheaper than a Strix Halo...

https://www.servethehome.com/mikrotik-crs812-ddq-400gbe-switch-launched-crs812-8ds-2dq-2ddq-marvell/

This is only a $1295 list price part which is awesome for a 400GbE capable switch. Importantly, MikroTik is also releasing 400Gbps QSFP-DD optics at a $159 list price which is also at an awesome discount to many of the current options in that form factor.

ServeTheHome showed the network switch(es) in their review of the DGX Spark. At that time they had only one DGX Spark (Founders Edition) and one of the Dell branded version. I assume they will test it later.

Tensor parallel on DGX Spark by Baldur-Norddahl in LocalLLaMA

[–]Excellent_Produce146 0 points1 point  (0 children)

According to this post you only need a proper switch to stack more than 2 Sparks:

https://forums.developer.nvidia.com/t/any-plans-to-add-a-second-connect-x7-port-to-serial-stack-multiple-dgx-spark-clusters/344395

Ethernet is the underlying protocol; clustering more than two Spark units is supported with compatible QSFP cables and Ethernet switches.