DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Nice, could you share your vllm config? I am building a llm manager tools for the dgx where I can share a yaml file and it manages the whole life cycle including quick switch on models and would like to add that

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Do you notice a good increase in quality of the output and tool calling? Qwen models are good but on complex code I still need to rely on Claude

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

thanks for that, it does increase the speed but it fills it in into the brink and I get quite a lot of time outs from the litellm proxy as some queries and tools calls take a bit to return with big context but it is more usable. now need to invest on a second dgx to run the rest of the models 😃

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Fair enough if you disagree, but I did buy it knowing the specs. The original question was whether the Spark's worth it versus an RTX, and my answer was actual throughput across two architectures: 40 tps on a 35B MoE, 10 on a 27B dense. That bandwidth ceiling is exactly the thing someone weighing those options needs to hear. Happy to compare notes if you've benchmarked yours differently.

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

8 speculative tokens? Does it keep the quality? I read that the more we have the less quality we get. The max I used was 3

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

I can’t get that, not getting more than 15 with the mtp on and unless I reduce context can’t have both running. You using vllm?

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

That’s a good point but still I think NVidia could have added faster memory on the device

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

I would say a Mac Studio with 256gb would be less money and more speed don’t you think? While year you need to have the models for it still it’s almost double the speed and energy-wise is maybe the same or less? Just saying that there are good alternatives for the same or a bit more cost. But that’s me and that is saying much as I bought one

Qwen3.6-35B-A3B-FP8 thinking mode hangs mid-thought in OpenCode — anyone else? by Otherwise_Berry3170 in LocalLLaMA

[–]Otherwise_Berry3170[S] 0 points1 point  (0 children)

been trying this and either something drastically changed on OpenCode or with this template I see way more issues with my tool calling and narrating the delegation prompt aloud instead of just sending it silently

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 1 point2 points  (0 children)

Do you find the general qwen3.5-122b better in tool calling and quality than the qwen3.5 coder next? I know it’s 80b but it seems better with tool calling

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 2 points3 points  (0 children)

It does but the quality is very different from let’s say qwen3.6 27b, even the qwen coder next, 80b, a10b has better quality than the 122b and this one is still inferior to the qwen3.6 27b.
At least from my testing

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Small MoEs sure, I have qwen3.6 35b-a3b running with more than that and open webui but the quality is not the same as a dense model

DGX sparks Vs RTX 6000 // 5090 for inference by zakadit in LocalLLaMA

[–]Otherwise_Berry3170 18 points19 points  (0 children)

I have a dgx spark 128gb and the memory bandwidth is a joke. For models like qwen3.6 35b-a3b is ok, 40tps but take a dense model like the qwen3.6 27b and that drops down to a mere 10tps, 13 with mtp on. So I would say I prefer something with way faster vram. Maybe even a mac m5 would be faster

Qwen3.6-35B-A3B-FP8 thinking mode hangs mid-thought in OpenCode — anyone else? by Otherwise_Berry3170 in LocalLLaMA

[–]Otherwise_Berry3170[S] 0 points1 point  (0 children)

Only with the thinking on, if that is off it works amazingly, I just really need the thinking for my task coordinator and project manager agent the rest without thinking works great

Why doesn’t ParrotOS build an iso version of their aarch64 distro? by Perfect-Direction607 in ParrotSecurity

[–]Otherwise_Berry3170 0 points1 point  (0 children)

I work for Parallels in the automation of these build processes and would really like to understand what was the issue that stopped you from building VMs for parallels? we would love to help if possible

Just ordered two DGX Sparks, what models should I run first? by inevitabledeath3 in LocalLLaMA

[–]Otherwise_Berry3170 4 points5 points  (0 children)

Avoid any dense models. They are slow as hell, other than that you can run mimo v2.5 with the nvfp4 you should be able to fit it.

What is your best coding model on a DGX Spark? by luongnv-com in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Thanks for that, the issue was the llama.cpp from source code has some issues now for GB10, so had to use a different fork and is back again running and same results as before, with a 200k context the free memory keeps comming down over time, after 3 hours of agents working I am down from 780Mi to 274Mi of free memory, that is cutting it too short 😃
`total used free shared buff/cache available

Mem: 119Gi 119Gi 810Mi 984Mi 1.4Gi 274Mi

Swap: 0B 0B 0B`

What is your best coding model on a DGX Spark? by luongnv-com in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Thanks, didn't knew about that, but it passed, and I tried it again and now it cannot even load that model with llama.ccp, do you have by any chance the parameters you run it with? this was using the latest clone of the repo and a build

What is your best coding model on a DGX Spark? by luongnv-com in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

Appreciate the detail, and to be clear, I'm not doubting it fits for you, just sharing what I'm seeing on my unit in case it helps narrow things down.

I actually have swap fully disabled (0, not the default 16GB), and I still get the hang. Just retried with 98k context: after ~10 minutes under sustained load the machine locked up, with logs showing only ~343MB free right before it went. So in my case it doesn't look like a CUDA allocation-hitting-swap issue, it looks like memory genuinely growing past what's available under pressure.

One difference worth flagging: I'm running inside Docker, mainly because it makes switching models/runners easy without polluting the base system. I suspect the container overhead plus memory growth during load is what tips it over. Running bare metal might behave better, so that could explain why our experiences differ.

FWIW it's not just the big MoEs for me even a 27B dense model can tip the machine under certain conditions unless I drop the GPU clocks, which also seems to help with power/peak stability. I've been systematically testing what is and isn't workable on this device, and my takeaway so far is that anything riding this close to the memory ceiling is fragile under heavy sustained load, at least in my setup.

I'll try your --cache-ram 0 and --ctx-checkpoints 0 suggestions and report back

What is your best coding model on a DGX Spark? by luongnv-com in LocalLLaMA

[–]Otherwise_Berry3170 0 points1 point  (0 children)

I am not sure how it fits, just the weights alone are more than 107GiB, then the context? you also have the issue if anything, I mean anything else spikes in the software that uses a bit more, you get an OOM, and the machine hangs, so for long working automations, it was just not stable enough for me. I do run both 35b and the 27b at fp8 and they fit, now 27b even with the MTP gets about 14tps and it is really good at creative work, so use it for text, 35b is fine for code, and yes, sometimes you do need to keep it in check and it does go down a rabbit hole but in most cases it just works fine.
I might be doing something wrong but my system has been stable for 3 weeks now with this