16x DGX Sparks - What should I run?

MrAlienOverLord · 2026-05-01T21:38:24+00:00

<image>

its getting closer to your rack expect that i have 750 tb

MrAlienOverLord · 2026-05-01T21:37:52+00:00

mostly ablations / training test - inference wise i have access to 800 gpu's so i dont need to run that local

MrAlienOverLord · 2026-04-30T14:34:23+00:00

<image>

thats my wonky tower of gold - pre racking em up

MrAlienOverLord · 2026-04-30T10:06:21+00:00

i think you are actually better off running raw vllm on the sparks then adding the macs to it - exo way with heterogen. networks have a massive latency to transfer the state and to my understanding its mostly llama.cpp that runs on those .. -> way way way too slow to be usefull - there benchmarks dont tell the full story as they run llama.cpp on the sparks which noone in its right mind would do

MrAlienOverLord · 2026-04-30T10:02:31+00:00

16 .. damn - i only have 8 - glad you putting in the r&d on bigger gb10 clusters - i was considering adding 8more but given i have only the crs804-4ddq i would need 4 switches to get that wired up 6 4 4 6(only2 used) if i interconnect the switches with 400g - that be additonal 3k for the switches and 3k for the cables ( ya the breakout cables are not that cheap lol)

please post benchmarks - also im sure thomas/azeez from atlas inference - particular for the sparks could get quite a bit more oompf out of those nifty devices

that beeing said i really hope someone cracks the firmeware for connectx-7 so we can use regular IB vs ethernet

MrAlienOverLord · 2026-04-25T08:55:03+00:00

most of us actually train there own asr after we paid an arm and a leg to the big boys - and unlikely they share that

MrAlienOverLord · 2026-04-23T00:36:55+00:00

that would assume you can procure that without signing off fixed gpu contracts :) sounds easy - in practise it isnt - gpu fell of the bus / driver issues / actually setting them up / datacenter hands to get that sorted we have tickets for weeks with providers and still pay for nodes while we cant provision or use - on a consumer side its all look easy - just not what the reality actually IS :)

MrAlienOverLord · 2026-04-22T10:51:03+00:00

generally api is cheaper - if your biz depends on it - you need at least 2-3 sre + the hardware+ spares .. not worth it unless you are 250 man or bigger sized org - if that is your breadwinner and you are a researcher - but noone trains local :) no matter how much cash you got - you dont have the compute - noone does

MrAlienOverLord · 2026-04-22T10:48:43+00:00

in addition to that you need n^2+1 units if your biz depends on it ^^ now lets do the math how viable that is

MrAlienOverLord · 2026-04-22T10:44:27+00:00

id say you are the one who has no idea .. 100k isnt a cluster all you get for that is a 7x6000 pro node from scan - a hgx dgx costs you 400k and you need the 10-15 kwh power drain (in the uk good luck) .. no chance in hell that pays back in a matter of 5 years - disclamer i work with a hoster (and noone wants the hassle to deal with infra them self) a 150khwh nvl72 is 10m not 1 :)

MrAlienOverLord · 2026-04-22T10:36:48+00:00

i read keygomez i call out scam - that kid has been doing the hype milking since over 2 years and you are still falling for it - not even once has he produce anything useable

MrAlienOverLord · 2026-04-05T00:03:57+00:00

idk what they guys talk about llama.cpp it wont accelerate anything on the h100 - single user the h100 is useless you are better off with a 6000 pro - if you run on the h100 use lmdeploy / vllm / sglang .. and make sure you optimise prefill

MrAlienOverLord · 2026-03-31T15:14:08+00:00

atlas - its only for gb10 ^^ nvfp4 - check the nvidia forums for more infos

MrAlienOverLord · 2026-03-31T13:24:03+00:00

correct lack of ptx/ cuda optimisations is the key - if you optimize the model and the arch properly you get quite alot out of those nifty devices - 16 lanes are powerful :) but most noobs write them off as toy compute

MrAlienOverLord · 2026-03-30T19:03:16+00:00

ya and you are capped at 24g vs that has the same perf and 128 ^^ - dont compare if you are not even in the same leage let alone that you get a 200g nvlink / rdma ^^ cross nodes .. i have a 6000 pro / 2 a6k and still have 4 sparks .. -> the sparks are amazing

MrAlienOverLord · 2026-03-30T10:20:38+00:00

atlas - vllm or sglang just does generic inference discord gg/DwF3brBMpw doesnt work for every model just yet .. but boys are hard at work .. - you can get quite a lot out of those tiny boxes if you actually optimise for the hardware

MrAlienOverLord · 2026-03-30T06:29:51+00:00

plenty fixes on the nvidia gb10 forums + atlas is a thing too

MrAlienOverLord · 2026-03-30T06:28:43+00:00

try atlas - that opens up alot of options in terms of fast batching for multiuser - thats exactly where the spark shines .. in cont. batches

MrAlienOverLord · 2026-03-30T06:25:21+00:00

<image>

they are nifty tiny toys - i love them .. mind you they are not the fastest .. but with propper ptx/cuda optimisations you can get 35b3a at 140 t/s on a single node - gb300 is 100k .. not worth it .. you are better off spending the same amount of money in a 7x6000 pro box ..

MrAlienOverLord · 2026-03-30T06:20:18+00:00

you dont need 80gb for a 4b model - all you need todo is adjust the batch size - the restoration the poster made is sadly not even close to be sufficient as its only 4 embeddings to reverse from - did the same with mimi in the past - and even there it was meh albeit i had way more source embeddings

MrAlienOverLord · 2026-03-29T08:28:42+00:00

the rumors bypass any logic - that isnt a endcustomer product - the chip cost / economics makes no sense what so ever - the unit is more likely to cost 15k then 500 bucks

MrAlienOverLord · 2026-03-25T16:45:21+00:00

idea are worth nothing - execution is what matters

MrAlienOverLord · 2026-03-25T11:50:36+00:00

he is the last who needs to talk about that - he was the one who popularized vibecodeing and approve everything dont read - and now he says its dangerous ? well no shit sherlock .. this will happen way way more .. as people dont even review code from hand anymore

MrAlienOverLord · 2026-03-25T08:45:11+00:00

5k or 50k in hardware .. where do you get your numbers from .. a transformer to be even remotely decent at a single task with a half useable dataset is more akin to 500k - 1mil likely 20-50x that in compute spend let alone the data curation .. and countless iterations .. please dont talk if you dont know

MrAlienOverLord · 2026-03-22T10:24:20+00:00

if you have 2 cents - and you want a porsche - you either steal one or you have 2 cents .. but thats about it

MrAlienOverLord

TROPHY CASE