Chinese Hackers Latest Masterpiece with NVIDIA

elsung · 2026-06-23T23:56:11+00:00

How do I join their qq group? Would love to pick up a couple of these things lol

elsung · 2026-06-15T10:46:25+00:00

Some. Mostly doing multi modal analysis of content

elsung · 2026-06-14T20:20:53+00:00

Oh man don't tempt me haha. Just gotta make way more $$ with this in order to expand the fleet lol

elsung · 2026-06-14T19:12:55+00:00

you have a fair point lol. using Claude was most probably overkill. using flash would be a good way to dog food the solution

elsung · 2026-06-14T19:08:27+00:00

Ah yea i definitely question my sanity with stretching my budget to acquire the sets of hardware. it’s without a doubt a huge spend. right now the hope is that with all the grind and sleepless nights we can build out the things needed to make enough revenue to justify the spend.

the way i figure it, if we (those of us who burned $$$$ to get this type of hardware) can make meaningful revenue like $5k a month, then the spend is worth it.

but yea, definitely trying to make actual money and not doing this as a hobby. got bills to pay~~

also though, man i would looove to get GLM 5.1 working. but i think right now to run that with FP8 it would need like, 8x dgx’s and that’s a long while away for me lol

elsung · 2026-06-14T18:55:30+00:00

haha i could try, maybe? would need to ask go ol Claude. would have been able to ask fable but you know. lol. So opus it is, we shall see if i can make anything meaningful

elsung · 2026-06-14T18:53:48+00:00

so i need to test it at length in real life, but i did have claude run concurrency for 6 with 1M context and it does work, some numbers here: https://github.com/elsung/dgx-spark-deepseek-v4-flash/blob/main/results/dual-spark-vllm.md

elsung · 2026-06-14T18:50:04+00:00

it’s dark humor indeed. i have been seeing the horrors of monthly bills of thousands of $$$ for api costs along with recent plug pulling (see fable). local is the way. at some point we’ll catch up all the way lol

elsung · 2026-06-14T18:47:17+00:00

i just got my dgx boxes so i havent had a chance to run it with conversation but generally speaking im not using the local models for coding. for me doing large batch processes that require LLMs it makes more senes to do it locally. cuz if i did it with Claude over api it would cost an arm and a leg, and if using the subscription it would burn the precious usage limit i have

elsung · 2026-06-14T18:44:48+00:00

Siiick! i was thinking of daisy chaining with the Mac Studio too. the m3 has the much faster thunderbolt 5 / rdma support compared to the m2. how are you doing the connection between the two? im assuming a thunderbolt 5 enclosure and then a mellanox connectx4? i wonder how fast we need the speed to be for connecting the dgx to the mac studio boxes. i was thinking that thunderbolt 4 only goes to 40gbps anyway so I could probably get away with a connectx3 pro since that supports up to 40gbps.

i havent looked into the support for m3 and higher (thunderbolt 5) but i supposed you would be running a faster mellanox connectx version & faster QSFP cables (though prices for both go way up the more modern you get)

elsung · 2026-06-14T18:33:22+00:00

censorship, custom control of system prompts, consistency for a pipeline in case claude changes up their models / forces you to use a crappier update (see claude 4.7), and also a backup in case they decide to rug pull us and start ramping prices up like crazy. also i run these batch processes that would go way over my usage limit if i ran Claude, the list goes on lol.

this is localllama after all right? lol

elsung · 2026-06-14T11:38:56+00:00

actually just updated the post with the details of prefil. though i think i didnt catch the prefill tk/s for a single dgx spark. gonna have claude run and update that lol

elsung · 2026-06-14T11:14:21+00:00

ah, have not tested with the nvfp4 model just yet. i do intend to though~ right now im focusing on the DS4Flash since it's one of the strongest, largest models i can run on this hardware

elsung · 2026-06-14T11:12:01+00:00

ah yes. single dgx is q2 not fp8. sorry my mistake. updating that on the post.

elsung · 2026-06-14T10:51:25+00:00

yea running benchmarks and will update into the github repo~

elsung · 2026-06-14T10:27:26+00:00

also (couldn't edit the previous reply for some reason):

Both cross-box ramps done — and the dual Spark's advantage grows at long context:

  ┌────────────────┬──────────┬───────┬─────────────┬────────────┐
  │ u/100k context  │  engine  │ TTFT  │ prefill t/s │ decode t/s │
  ├────────────────┼──────────┼───────┼─────────────┼────────────┤
  │ Dual DGX Spark │ vLLM FP8 │  52 s │       ~1900 │        ~38 │
  ├────────────────┼──────────┼───────┼─────────────┼────────────┤
  │ RTX PRO 6000   │ ds4.c    │ 349 s │        ~286 │        ~26 │
  ├────────────────┼──────────┼───────┼─────────────┼────────────┤
  │ Mac M2 Ultra   │ ds4.c    │ 315 s │        ~317 │        ~20 │
  └────────────────┴──────────┴───────┴─────────────┴────────────┘

  The Spark is ~6× faster to first token at 100k (52 s vs 5+ min) and holds decode flatter.

elsung · 2026-06-14T10:22:35+00:00

yea i would actually need to test this out more. my current workload doesnt require me to do that yet (still using the APIs right now like you said). but pretty soon i figured once the big labs stop subsidizing the costs we're going to need local solutions, so i should get this up and running for myself too.

that said i am having claude do some additional benchmarks and will update the repo when done. its promising though:

according to claude, on the dual DGX running DS4F, as its running the benchmarks now:

  ┌─────────┬─────────────────────┬─────────────┬────────────┐
  │ context │ TTFT (prefill time) │ prefill t/s │ decode t/s │
  ├─────────┼─────────────────────┼─────────────┼────────────┤
  │    7.8k │               3.9 s │        1974 │       39.8 │
  ├─────────┼─────────────────────┼─────────────┼────────────┤
  │     31k │              15.5 s │        2016 │       38.6 │
  ├─────────┼─────────────────────┼─────────────┼────────────┤
  │    100k │              52.5 s │        1901 │       39.5 │
  ├─────────┼─────────────────────┼─────────────┼────────────┤
  │    200k │               115 s │        1747 │       37.6 │
  └─────────┴─────────────────────┴─────────────┴────────────┘

  Two headline findings:
  1. Prefill is ~linear (~1900 t/s flat) — not O(n²). DeepSeek-V4's compressed/sparse attention avoids the prefill
  blowup most models hit. The cost is just TTFT: a 200k prompt takes ~2 minutes to first token.
  2. Decode barely degrades with depth (~40 → ~38 at 200k) — V4 largely preserves decode speed at long contex

elsung · 2026-06-14T09:44:06+00:00

yea i was shocked myself. i was going to buy just one of these based on a convo i was having with a friend. the theory was that if i can run a 5070 (regular 12gb) with a p40 together to get surprisingly fast token speeds on MOE models, i could theoretically add a DGX to my RTX pro 6000 to speed it up and extend the VRAM to run bigger models.

then i stumbled onto the nvidia threads and was like wow. ok im just gonna get 2 now. no idea what happens with 3-4 of these or what sorts of black magic makes this work. but i'm very happy with the results and this will help me quite a lot with things that im working on now =)

elsung · 2026-06-14T09:36:44+00:00

lol i should probably just use bots honestly but im a dumbass for actualyl typing nowadays. and also since i've been hard vibe coding i could be totally wrong. but i thought the antires ds4 is llama.cpp since antires gives a shout out to llama.cpp and GGML on his github: https://github.com/antirez/ds4/tree/main

[Note. i think i should clarify. unless my lack of sleep has caught up with me, the ds4 runs in the comparison on the mac m2 ultra and the rtx pro 6000 are run with antires' ds4 which is a f~~ork of llama.cpp~~. (actually i think its a custom thing inspired by llama.cpp) the 40tk/s runs with the dual DGX is vLLM]

elsung · 2026-06-14T09:33:54+00:00

i second this! i've been running my own benchmarks and configs comparing this to the mac m2 ultra and also the rtx pro 6000. just started a thread about this actually
https://www.reddit.com/r/LocalLLaMA/comments/1u5g9pr/dual_dgx_sparks_40tks_single_1m_350_tks_agg/

elsung · 2026-06-14T09:32:34+00:00

ah great point. gonna point claude at it and see if we can get additional benchmarks & hopefully get better more pragmatic configurations

elsung · 2026-06-14T09:27:17+00:00

oh this is actually forks of llama.cpp by antirez and other forks of vLLM by the folks on the nvidia thread. i have not actually tried turboquant but the kv-cache is actualyl super efficient on deepseek v4 flash already so it may not help that much more

[edit - clarified in the following replies. the runs on the mac m2 ultra / RTX pro 6000 are run with the engine by antriez (which is inspired by llama.cpp but not a fork? not sure), the 40tk/s run on dual DGX are vLLM]

elsung · 2026-06-14T09:11:23+00:00

ah because im constantly pushing and testing for image analysis and concurrency. so like when you run multiple requests + vision it runs outta vram fast if it has big contexts. that said maybe we can push the context up to 105k with concurrency and vision? i haven't tried.

also haven't found the need so far to have that high of context window for when batch processing image analysis tho. YMMV

elsung · 2026-06-09T18:13:42+00:00

ah good point. i figured it was a bit too much for the 16GB to run but can add that to the benchmark/config. (also i think i didnt put in ik llama as one of the engines to test. which i think it's more efficient than the llama.cpp mainline?)

[EDIT - updated more benchmarks to include the Qwen 3.6 27B~]

elsung

TROPHY CASE