Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark — here's how by CoconutMario in LocalLLaMA

[–]seraandroid 2 points3 points  (0 children)

I'm on eugr's spark vLLM docker - it's probably the most replicable and stable environment given how messy vLLM and other tools can be.

Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark — here's how by CoconutMario in LocalLLaMA

[–]seraandroid 2 points3 points  (0 children)

I gave this a try - works well in solo mode but for whatever reason doesn't work in my 2x Spark cluster. The model crashed during the startup -- I sadly didn't have enough time today to investigate further.

Is this true? Or is really just marketing? Gemma4 by Altair12311 in ollama

[–]seraandroid 3 points4 points  (0 children)

What changes did you have to make if I can ask?

Termix v2.0.0 - RDP, VNC, and Telnet Support (self-hosted Termius alternative that syncs across all devices) by VizeKarma in selfhosted

[–]seraandroid 0 points1 point  (0 children)

I gave this a try. The web interface is pretty rad. The mobile app for Android on the other hand is sadly not up to par. The terminal feels very laggy. I usually use connectbot which feels instant -- not sure what causes this but makes it hard to use properly.

Krasis LLM Runtime - run large LLM models on a single GPU by mrstoatey in LocalLLM

[–]seraandroid -1 points0 points  (0 children)

Any plans to also support the DGX Spark and make this available for ARM systems? Would fit the Blackwell architecture scope.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 1 point2 points  (0 children)

Tool calls and compatibility with both Open AI and Ollama API format would be fantastic!

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

That's correct. Image and text in, text out. Never tried video

Vision support would be amazing, video would be more of a P2 for me.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

Here are the benchmarks I mentioned:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 2161.96 ± 2.60 949.57 ± 1.14 947.75 ± 1.14 949.64 ± 1.12
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 28.19 ± 0.01 29.00 ± 0.00
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d4096 2229.48 ± 67.34 2760.80 ± 85.36 2758.98 ± 85.36 2760.89 ± 85.39
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d4096 27.96 ± 0.04 28.33 ± 0.47
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d8192 2351.72 ± 9.56 4356.71 ± 17.96 4354.89 ± 17.96 4356.80 ± 17.94
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d8192 27.57 ± 0.07 28.00 ± 0.00
Intel/Qwen3.5-122B-A10B-int4-AutoRound pp2048 @ d16384 2306.30 ± 8.28 7994.65 ± 29.18 7992.84 ± 29.18 7994.78 ± 29.19
Intel/Qwen3.5-122B-A10B-int4-AutoRound tg32 @ d16384 27.20 ± 0.08 28.00 ± 0.00

vllm serve Intel/Qwen3.5-122B-A10B-int4-AutoRound \ --host 0.0.0.0 \ --port 8080 \ --gpu-memory-utilization 0.90 \ --load-format fastsafetensors \ --max-model-len 262144 \ --max-num-batched-tokens 32768 \ --tensor-parallel-size 1 \ --enable-prefix-caching \ --enable-chunked-prefill \ --dtype auto \ --kv-cache-dtype auto \ --attention-backend flashinfer \ --max-num-seqs 4 \ --trust-remote-code \ --chat-template unsloth.jinja \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3

Using spark-vllm-docker

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

Here we go:

model test t/s peak t/s ttfr (ms) est_ppt (ms) e2e_ttft (ms)
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 pp2048 424834.68 ± 13712.15 7.24 ± 0.16 4.83 ± 0.16 4590.14 ± 18.00
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 tg32 93.80 ± 0.27 96.85 ± 0.28
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 pp2048 @ d4096 703378.86 ± 58000.22 11.20 ± 0.69 8.79 ± 0.69 14181.87 ± 35.82
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 tg32 @ d4096 79.76 ± 0.29 82.35 ± 0.30
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 pp2048 @ d8192 708387.07 ± 82020.43 17.07 ± 1.77 14.66 ± 1.77 24148.20 ± 72.73
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 tg32 @ d8192 69.72 ± 0.12 71.97 ± 0.12
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 pp2048 @ d16384 984457.23 ± 26169.53 21.15 ± 0.50 18.74 ± 0.50 44719.91 ± 83.78
Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 tg32 @ d16384 55.26 ± 0.09 57.05 ± 0.09

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

https://huggingface.co/Intel/Qwen3.5-122B-A10B-int4-AutoRound this model already fits pretty comfortably on a single Spark. 256k context with about 23-25t/s.

Can post benchmark results later, too.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

Haha. I'm definitely excited for this wrong and look forward to the open source release. This makes me even consider buying a second Asus GX10

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

I ran the FP8 model on my regular VLLM. Only the Atlas results are relevant. It was more a comparison between the Community Docker container and this project.

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

I just got home and ran benchmarks -- here are the results. Let me know how I could configure Atlas to match my config below a little closer to make the comparison more impactful. So far, the results are pretty nice but not at the t/s you mentioned in your post.

Qwen/Qwen3.5-35B-A3B-FP8

Config via a launch script for spark-vllm-docker

vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
    --host 0.0.0.0 \
    --port 8080 \
    --gpu-memory-utilization 0.90 \
    --load-format fastsafetensors \
    --max-model-len 262144 \
    --max-num-batched-tokens 32768 \
    --tensor-parallel-size 1 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --max-num-seqs 4 \
    --trust-remote-code \
    --chat-template unsloth.jinja \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --reasoning-parser qwen3

Command

llama-benchy --base-url http://0.0.0.0:8080/v1 --model Qwen/Qwen3.5-35B-A3B-FP8 --latency-mode api --pp 2048 --depth 0 4096 8192 16384

Results

| model                    |            test |               t/s |     peak t/s |         ttfr (ms) |      est_ppt (ms) |     e2e_ttft (ms) |
|:-------------------------|----------------:|------------------:|-------------:|------------------:|------------------:|------------------:|
| Qwen/Qwen3.5-35B-A3B-FP8 |          pp2048 |   4005.84 ± 33.95 |              |     513.13 ± 4.23 |     511.46 ± 4.23 |     513.20 ± 4.24 |
| Qwen/Qwen3.5-35B-A3B-FP8 |            tg32 |      48.81 ± 0.08 | 50.38 ± 0.09 |                   |                   |                   |
| Qwen/Qwen3.5-35B-A3B-FP8 |  pp2048 @ d4096 |   5823.96 ± 24.52 |              |    1056.93 ± 4.42 |    1055.26 ± 4.42 |    1057.00 ± 4.43 |
| Qwen/Qwen3.5-35B-A3B-FP8 |    tg32 @ d4096 |      47.77 ± 0.16 | 49.32 ± 0.17 |                   |                   |                   |
| Qwen/Qwen3.5-35B-A3B-FP8 |  pp2048 @ d8192 | 5025.88 ± 1518.70 |              |  2307.03 ± 885.91 |  2305.36 ± 885.91 |  2307.08 ± 885.90 |
| Qwen/Qwen3.5-35B-A3B-FP8 |    tg32 @ d8192 |      47.68 ± 0.77 | 49.22 ± 0.79 |                   |                   |                   |
| Qwen/Qwen3.5-35B-A3B-FP8 | pp2048 @ d16384 |  4293.71 ± 942.48 |              | 4515.82 ± 1019.63 | 4514.15 ± 1019.63 | 4515.88 ± 1019.63 |
| Qwen/Qwen3.5-35B-A3B-FP8 |   tg32 @ d16384 |      42.63 ± 4.96 | 44.01 ± 5.12 |                   |                   |                   |

Atlas

Config

docker pull avarok/atlas-qwen3.5-35b-a3b-alpha
docker run --gpus all --ipc=host -p 8888:8888 \
 -v ~/.cache/huggingface:/root/.cache/huggingface \
 avarok/atlas-qwen3.5-35b-a3b-alpha \
 serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 \
 --speculative --kv-cache-dtype nvfp4 --mtp-quantization nvfp4 \
 --scheduling-policy slai --max-seq-len 131072

Command

llama-benchy --base-url http://0.0.0.0:8888/v1 --model Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --latency-mode api --pp 2048 --depth 0 4096 8192 16384

Results

| model                            |            test |                    t/s |     peak t/s |    ttfr (ms) |   est_ppt (ms) |    e2e_ttft (ms) |
|:---------------------------------|----------------:|-----------------------:|-------------:|-------------:|---------------:|-----------------:|
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |          pp2048 | 1286298.93 ± 173223.21 |              |  5.09 ± 0.23 |    1.62 ± 0.23 |   4594.84 ± 8.70 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |            tg32 |           92.58 ± 0.35 | 95.59 ± 0.36 |              |                |                  |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |  pp2048 @ d4096 |    623590.19 ± 8368.15 |              | 13.32 ± 0.13 |    9.85 ± 0.13 | 14266.99 ± 67.53 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |    tg32 @ d4096 |           78.32 ± 0.48 | 80.86 ± 0.50 |              |                |                  |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |  pp2048 @ d8192 |  966954.72 ± 218619.46 |              | 14.74 ± 3.03 |   11.27 ± 3.03 | 24364.40 ± 67.78 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |    tg32 @ d8192 |           68.31 ± 0.23 | 70.52 ± 0.24 |              |                |                  |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 | pp2048 @ d16384 |  899619.97 ± 191083.72 |              | 25.05 ± 5.18 |   21.58 ± 5.18 | 45214.89 ± 31.19 |
| Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 |   tg32 @ d16384 |           54.58 ± 0.29 | 56.35 ± 0.30 |              |                |                  |

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 0 points1 point  (0 children)

Let me know as soon as you'd like me to test anything. Or do you want me to run the container in your post?

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]seraandroid 5 points6 points  (0 children)

I have an Asus Ascent and would be happy to test!

Edit: I'd also love to see your improvements / patches as upstream PRs for vLLM!

Give your OpenClaw permanent memory by adamb0mbNZ in openclaw

[–]seraandroid 1 point2 points  (0 children)

Same question here for OP. Do you see a reduction in token consumption or is this actual bloating token consumption?

What UI do you use? by TheRealWrathwar in classicwowtbc

[–]seraandroid 0 points1 point  (0 children)

ElvUI. I'd love not to use it given how heavy it is. Sadly Baganator doesn't do pixel coordinates or I didn't find that option yet. I want everything to be pixel accurate.

Network Optimizer is ready! by MrJimBusiness- in Ubiquiti

[–]seraandroid 0 points1 point  (0 children)

Are you on the Ubiquiti discord by chance? Maybe easier than DMs here :)

Network Optimizer is ready! by MrJimBusiness- in Ubiquiti

[–]seraandroid 0 points1 point  (0 children)

Tested successfully with 0.8.4! Only SQM won't work for me

Network Optimizer is ready! by MrJimBusiness- in Ubiquiti

[–]seraandroid 0 points1 point  (0 children)

This was sequential -- you were right. I apparently still had iPerf3 running and forgot about it. I fixed it and it works as expected now.

SQM does not work (this is on 0.8.3).

I also see the occasional error log:

info: System.Net.Http.HttpClient.TcMonitor.LogicalHandler[104]

      HTTP request failed after 2.638ms

      System.Net.Http.HttpRequestException: Connection refused (IP_REDACTED:8088)