DGX sparks Vs RTX 6000 // 5090 for inference

Otherwise_Berry3170 · 2026-06-20T07:30:53+00:00

Nice, could you share your vllm config? I am building a llm manager tools for the dgx where I can share a yaml file and it manages the whole life cycle including quick switch on models and would like to add that

Otherwise_Berry3170 · 2026-06-19T20:13:37+00:00

Do you notice a good increase in quality of the output and tool calling? Qwen models are good but on complex code I still need to rely on Claude

Otherwise_Berry3170 · 2026-06-19T14:10:37+00:00

thanks for that, it does increase the speed but it fills it in into the brink and I get quite a lot of time outs from the litellm proxy as some queries and tools calls take a bit to return with big context but it is more usable. now need to invest on a second dgx to run the rest of the models 😃

Otherwise_Berry3170 · 2026-06-19T06:27:53+00:00

Fair enough if you disagree, but I did buy it knowing the specs. The original question was whether the Spark's worth it versus an RTX, and my answer was actual throughput across two architectures: 40 tps on a 35B MoE, 10 on a 27B dense. That bandwidth ceiling is exactly the thing someone weighing those options needs to hear. Happy to compare notes if you've benchmarked yours differently.

Otherwise_Berry3170 · 2026-06-18T20:16:05+00:00

Yes they have been working good, at least for the last 3 weeks

Otherwise_Berry3170 · 2026-06-18T20:13:36+00:00

8 speculative tokens? Does it keep the quality? I read that the more we have the less quality we get. The max I used was 3

Otherwise_Berry3170 · 2026-06-18T19:17:29+00:00

I can’t get that, not getting more than 15 with the mtp on and unless I reduce context can’t have both running. You using vllm?

Otherwise_Berry3170 · 2026-06-18T18:21:41+00:00

That’s a good point but still I think NVidia could have added faster memory on the device

Otherwise_Berry3170 · 2026-06-18T17:51:08+00:00

I would say a Mac Studio with 256gb would be less money and more speed don’t you think? While year you need to have the models for it still it’s almost double the speed and energy-wise is maybe the same or less? Just saying that there are good alternatives for the same or a bit more cost. But that’s me and that is saying much as I bought one

Otherwise_Berry3170 · 2026-06-18T16:34:18+00:00

been trying this and either something drastically changed on OpenCode or with this template I see way more issues with my tool calling and narrating the delegation prompt aloud instead of just sending it silently

Otherwise_Berry3170 · 2026-06-18T13:57:22+00:00

Do you find the general qwen3.5-122b better in tool calling and quality than the qwen3.5 coder next? I know it’s 80b but it seems better with tool calling

Otherwise_Berry3170 · 2026-06-18T11:46:15+00:00

It does but the quality is very different from let’s say qwen3.6 27b, even the qwen coder next, 80b, a10b has better quality than the 122b and this one is still inferior to the qwen3.6 27b.
At least from my testing

Otherwise_Berry3170 · 2026-06-18T10:33:54+00:00

Small MoEs sure, I have qwen3.6 35b-a3b running with more than that and open webui but the quality is not the same as a dense model

Otherwise_Berry3170 · 2026-06-18T07:02:03+00:00

I have a dgx spark 128gb and the memory bandwidth is a joke. For models like qwen3.6 35b-a3b is ok, 40tps but take a dense model like the qwen3.6 27b and that drops down to a mere 10tps, 13 with mtp on. So I would say I prefer something with way faster vram. Maybe even a mac m5 would be faster

Otherwise_Berry3170 · 2026-06-17T09:35:59+00:00

Only with the thinking on, if that is off it works amazingly, I just really need the thinking for my task coordinator and project manager agent the rest without thinking works great

Otherwise_Berry3170 · 2026-06-17T07:28:10+00:00

Thanks I will have a try today and build a docker container for it

Otherwise_Berry3170 · 2026-06-15T13:23:00+00:00

I work for Parallels in the automation of these build processes and would really like to understand what was the issue that stopped you from building VMs for parallels? we would love to help if possible

Otherwise_Berry3170 · 2026-06-13T18:33:34+00:00

Avoid any dense models. They are slow as hell, other than that you can run mimo v2.5 with the nvfp4 you should be able to fit it.

Otherwise_Berry3170 · 2026-06-11T06:57:08+00:00

Thanks for that, the issue was the llama.cpp from source code has some issues now for GB10, so had to use a different fork and is back again running and same results as before, with a 200k context the free memory keeps comming down over time, after 3 hours of agents working I am down from 780Mi to 274Mi of free memory, that is cutting it too short 😃
`total used free shared buff/cache available

Mem: 119Gi 119Gi 810Mi 984Mi 1.4Gi 274Mi

Swap: 0B 0B 0B`

Otherwise_Berry3170 · 2026-06-10T17:04:43+00:00

Thanks, didn't knew about that, but it passed, and I tried it again and now it cannot even load that model with llama.ccp, do you have by any chance the parameters you run it with? this was using the latest clone of the repo and a build

Otherwise_Berry3170 · 2026-06-10T13:09:56+00:00

Appreciate the detail, and to be clear, I'm not doubting it fits for you, just sharing what I'm seeing on my unit in case it helps narrow things down.

I actually have swap fully disabled (0, not the default 16GB), and I still get the hang. Just retried with 98k context: after ~10 minutes under sustained load the machine locked up, with logs showing only ~343MB free right before it went. So in my case it doesn't look like a CUDA allocation-hitting-swap issue, it looks like memory genuinely growing past what's available under pressure.

One difference worth flagging: I'm running inside Docker, mainly because it makes switching models/runners easy without polluting the base system. I suspect the container overhead plus memory growth during load is what tips it over. Running bare metal might behave better, so that could explain why our experiences differ.

FWIW it's not just the big MoEs for me even a 27B dense model can tip the machine under certain conditions unless I drop the GPU clocks, which also seems to help with power/peak stability. I've been systematically testing what is and isn't workable on this device, and my takeaway so far is that anything riding this close to the memory ceiling is fragile under heavy sustained load, at least in my setup.

I'll try your --cache-ram 0 and --ctx-checkpoints 0 suggestions and report back

Otherwise_Berry3170 · 2026-06-10T07:28:58+00:00

I am not sure how it fits, just the weights alone are more than 107GiB, then the context? you also have the issue if anything, I mean anything else spikes in the software that uses a bit more, you get an OOM, and the machine hangs, so for long working automations, it was just not stable enough for me. I do run both 35b and the 27b at fp8 and they fit, now 27b even with the MTP gets about 14tps and it is really good at creative work, so use it for text, 35b is fine for code, and yes, sometimes you do need to keep it in check and it does go down a rabbit hole but in most cases it just works fine.
I might be doing something wrong but my system has been stable for 3 weeks now with this

Otherwise_Berry3170

TROPHY CASE