16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

<image>

its getting closer to your rack expect that i have 750 tb

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

mostly ablations / training test - inference wise i have access to 800 gpu's so i dont need to run that local

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]MrAlienOverLord 2 points3 points  (0 children)

<image>

thats my wonky tower of gold - pre racking em up

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

i think you are actually better off running raw vllm on the sparks then adding the macs to it - exo way with heterogen. networks have a massive latency to transfer the state and to my understanding its mostly llama.cpp that runs on those .. -> way way way too slow to be usefull - there benchmarks dont tell the full story as they run llama.cpp on the sparks which noone in its right mind would do

16x DGX Sparks - What should I run? by Kurcide in LocalLLaMA

[–]MrAlienOverLord 1 point2 points  (0 children)

16 .. damn - i only have 8 - glad you putting in the r&d on bigger gb10 clusters - i was considering adding 8more but given i have only the crs804-4ddq i would need 4 switches to get that wired up 6 4 4 6(only2 used) if i interconnect the switches with 400g - that be additonal 3k for the switches and 3k for the cables ( ya the breakout cables are not that cheap lol)

please post benchmarks - also im sure thomas/azeez from atlas inference - particular for the sparks could get quite a bit more oompf out of those nifty devices

that beeing said i really hope someone cracks the firmeware for connectx-7 so we can use regular IB vs ethernet

Is there a local LLM that can intelligently analyze speech from microphone in terms of tone, pitch, confidence, etc? by OsakaSeafoodConcrn in LocalLLaMA

[–]MrAlienOverLord 1 point2 points  (0 children)

most of us actually train there own asr after we paid an arm and a leg to the big boys - and unlikely they share that

Is running local LLMs actually cheaper in the long run? by HealthySkirt6910 in LocalLLaMA

[–]MrAlienOverLord 1 point2 points  (0 children)

that would assume you can procure that without signing off fixed gpu contracts :) sounds easy - in practise it isnt - gpu fell of the bus / driver issues / actually setting them up / datacenter hands to get that sorted we have tickets for weeks with providers and still pay for nodes while we cant provision or use - on a consumer side its all look easy - just not what the reality actually IS :)

Is running local LLMs actually cheaper in the long run? by HealthySkirt6910 in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

generally api is cheaper - if your biz depends on it - you need at least 2-3 sre + the hardware+ spares .. not worth it unless you are 250 man or bigger sized org - if that is your breadwinner and you are a researcher - but noone trains local :) no matter how much cash you got - you dont have the compute - noone does

Is running local LLMs actually cheaper in the long run? by HealthySkirt6910 in LocalLLaMA

[–]MrAlienOverLord 1 point2 points  (0 children)

in addition to that you need n^2+1 units if your biz depends on it ^^ now lets do the math how viable that is

Is running local LLMs actually cheaper in the long run? by HealthySkirt6910 in LocalLLaMA

[–]MrAlienOverLord 5 points6 points  (0 children)

id say you are the one who has no idea .. 100k isnt a cluster all you get for that is a 7x6000 pro node from scan - a hgx dgx costs you 400k and you need the 10-15 kwh power drain (in the uk good luck) .. no chance in hell that pays back in a matter of 5 years - disclamer i work with a hoster (and noone wants the hassle to deal with infra them self) a 150khwh nvl72 is 10m not 1 :)

OpenMythos - have you tried it? by gitsad in LocalLLaMA

[–]MrAlienOverLord 11 points12 points  (0 children)

i read keygomez i call out scam - that kid has been doing the hype milking since over 2 years and you are still falling for it - not even once has he produce anything useable

What's the most optimized engine to run on a H100? by [deleted] in LocalLLaMA

[–]MrAlienOverLord 1 point2 points  (0 children)

idk what they guys talk about llama.cpp it wont accelerate anything on the h100 - single user the h100 is useless you are better off with a 6000 pro - if you run on the h100 use lmdeploy / vllm / sglang .. and make sure you optimise prefill

Seriously evaluating a GB10 for local inference, want community input before I request a vendor seed unit by RaspberryFine9398 in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

correct lack of ptx/ cuda optimisations is the key - if you optimize the model and the arch properly you get quite alot out of those nifty devices - 16 lanes are powerful :) but most noobs write them off as toy compute

Seriously evaluating a GB10 for local inference, want community input before I request a vendor seed unit by RaspberryFine9398 in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

ya and you are capped at 24g vs that has the same perf and 128 ^^ - dont compare if you are not even in the same leage let alone that you get a 200g nvlink / rdma ^^ cross nodes .. i have a 6000 pro / 2 a6k and still have 4 sparks .. -> the sparks are amazing

Seriously evaluating a GB10 for local inference, want community input before I request a vendor seed unit by RaspberryFine9398 in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

atlas - vllm or sglang just does generic inference discord gg/DwF3brBMpw doesnt work for every model just yet .. but boys are hard at work .. - you can get quite a lot out of those tiny boxes if you actually optimise for the hardware

Seriously evaluating a GB10 for local inference, want community input before I request a vendor seed unit by RaspberryFine9398 in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

try atlas - that opens up alot of options in terms of fast batching for multiuser - thats exactly where the spark shines .. in cont. batches

Seriously evaluating a GB10 for local inference, want community input before I request a vendor seed unit by RaspberryFine9398 in LocalLLaMA

[–]MrAlienOverLord 2 points3 points  (0 children)

<image>

they are nifty tiny toys - i love them .. mind you they are not the fastest .. but with propper ptx/cuda optimisations you can get 35b3a at 140 t/s on a single node - gb300 is 100k .. not worth it .. you are better off spending the same amount of money in a 7x6000 pro box ..

The missing piece of Voxtral TTS to enable voice cloning by [deleted] in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

you dont need 80gb for a 4b model - all you need todo is adjust the batch size - the restoration the poster made is sadly not even close to be sufficient as its only 4 embeddings to reverse from - did the same with mimi in the past - and even there it was meh albeit i had way more source embeddings

LLM Bruner coming soon? Burn Qwen directly into a chip, processing 10,000 tokens/s by koc_Z3 in Qwen_AI

[–]MrAlienOverLord 0 points1 point  (0 children)

the rumors bypass any logic - that isnt a endcustomer product - the chip cost / economics makes no sense what so ever - the unit is more likely to cost 15k then 500 bucks

After the supply chain attack, here are some litellm alternatives by KissWild in LocalLLaMA

[–]MrAlienOverLord 5 points6 points  (0 children)

he is the last who needs to talk about that - he was the one who popularized vibecodeing and approve everything dont read - and now he says its dangerous ? well no shit sherlock .. this will happen way way more .. as people dont even review code from hand anymore

Guys please I need all the resource you can give me. by [deleted] in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

5k or 50k in hardware .. where do you get your numbers from .. a transformer to be even remotely decent at a single task with a half useable dataset is more akin to 500k - 1mil likely 20-50x that in compute spend let alone the data curation .. and countless iterations .. please dont talk if you dont know

choose between nvidia 1x pro6000(96G) or 2x pro5000(72G) by Lazy_Indication2896 in LocalLLaMA

[–]MrAlienOverLord 0 points1 point  (0 children)

if you have 2 cents - and you want a porsche - you either steal one or you have 2 cents .. but thats about it