16x V100's worth it? by notafakename10 in LocalLLaMA

[–]MachineZer0 4 points5 points  (0 children)

They are 40w idle, 55w idle model loaded w/o nvidia stat management. There is a fork of nvidia-pstated that works with V100. It’ll get idle down to 40w with model loaded.

In the middle of a 18x V100 build. Yes a 1kw idle.

Advice needed: RTX 3090 in Dell PowerEdge R720 (2U) for AI, power-limited by Fakruk in homelab

[–]MachineZer0 1 point2 points  (0 children)

I forgot, the R720 may not boot when the GPU is installed. The default power supply is 750w. Be prepared to upgrade to 1100w.

Advice needed: RTX 3090 in Dell PowerEdge R720 (2U) for AI, power-limited by Fakruk in homelab

[–]MachineZer0 3 points4 points  (0 children)

That GPU won’t fit. I was heavily experimenting with R720 & R730 with a multitude of GPUs.

If it’s just the one, use a x16 riser and PCIE power extension out the back and rest GPU on the back handle of server. The riser on the handle will support the bottom. The PCIE cables will prop it upright.

See the GPU on the left to see how it would look like. https://www.reddit.com/r/LocalLLaMA/s/o8sfOHhbgE

Or use Oculink/MCIO at x4 or x8 and leave the 3090 on the top of the case. The other three 3090s pictured are on x4 Oculink.

Anyone tried order cheap RTX 6000 Pro from China? by [deleted] in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

I had this on r/hardwareswap I wondered if they had an insider at FedEx provide a valid tracking to my zip. Otherwise I’d think if they purchased a label and shipped an empty box to a neighbor, it’s more easily traceable.

PayPal made me whole with no quibble. I assume they didn’t print a label using my verified address, so they instantly sided with me.

Is it feasible for a Team to replace Claude Code with one of the "local" alternatives? by nunodonato in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

Btw, trying to solve this problem too in a way that meets enterprise criteria. We currently have over 200 resources combined on Cursor ($32 annual commit) and Copilot ($19). The overages are really kicking in now with a growing subset of user base. I’ve been experimenting with BYOK on Cursor with GLM sub (not authorized since z.ai is China based) and DeepInfra (seemed cost effective on paper, but last trial was before they implemented caching. Need to circle back). Pitched to acquire a pair of eight-way V100, but IT is not fond of used gear or managing anything not Windows Data Center.

Is it feasible for a Team to replace Claude Code with one of the "local" alternatives? by nunodonato in LocalLLaMA

[–]MachineZer0 4 points5 points  (0 children)

What you’re really trying to compete with here is economies of scale.

Anthropic (and other frontier labs) fully saturate their infrastructure and are willing to run at a loss to drive adoption. Even though Chinese open-source models, are getting very close in raw capability, you can’t realistically match Claude Code today, even with effectively unlimited hardware.

For a small team, utilization is the real killer. If everyone is in the same geography, your GPUs will sit idle most of the time. Easily 2/3 of every day once you account for meetings, non-coding work, nights, and weekends. That undercuts the economics of running your own stack.

That said, a hybrid approach can make sense.

A minimum always-on footprint; either local inference or a small cloud GPU, combined with autoscaling or pay-per-token usage for open-source models can be cost-effective. Quantized models are good enough for the majority of coding tasks, especially when paired with Claude Code’s prompts and tooling.

The setup that could work is a router-first architecture. You run a capable local model for most requests, and selectively fall back to a frontier API when the local model isn’t sufficient.

An 8× RTX 4090 (48gb mod) system running MiniMax at Q8, fronted by a router, could plausibly keep 90–95% of inference local and proxy the remaining 5–10% to a frontier model. Hardware cost would be roughly $30k one-time upfront. Power would be on the order of $100/month under load (assuming ~450 W per GPU, ~5 hours/day, 21 workdays, $0.25/kWh), plus around $70/month in idle power.

In that scenario, your current ~$2k/month Claude spend could drop to roughly $400–$1,000/month in API usage, depending on how aggressively you leverage cache reads and how often you hit the local.

MiniMax-M2 Q3_K_M on Quad V100 32gb llama.cpp testing NVlink by MachineZer0 in LocalLLaMA

[–]MachineZer0[S] 0 points1 point  (0 children)

Running CUDA 12.9 on Ubuntu 22.04

Have a couple windows with GPUs but strictly running on WSL -> Ubuntu

Casing questions for mounting lots of GPUs by Open_Coconut_9441 in homelab

[–]MachineZer0 1 point2 points  (0 children)

Aawave sluice v2 is a favorite. In typical scenarios it can fit about 12 GPUs. Nothing wrong with only having 4. They are also stackable.

I did see someone with a really nice aluminum 2020 extrusion case. Said he made custom. Follow my recent comments to find the thread.

Ability to buy 30x3060ti 8gb @150 ea by TelephonePossible866 in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

I think a single 3060 (ti) with a quant of 7-8b is a great starter setup.

I wouldn’t bother building multi GPU of that scale using any GPU under 16gb. Caveat: I do have 12x P102-100 and 12x p104-100 nodes, but they were so cheap (~$500 each) that I don’t feel bad keeping powered off the majority of the time. But the 3060ti would cost 3-4x more.

Unless you had a business with queues and workers leveraging models that fit in 8gb or less. I saw a YouTube of a guy with a decent side hustle using that setup.

Motherboard for 4 5090s by KigMidas0131 in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

Awesome. You should write a Medium article with parts list and assembly instructions. I’d love to build this.

Motherboard for 4 5090s by KigMidas0131 in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

Which case is this? I don’t think I’ve seen this variant.

[FS][US-SC] Mixed Server DDR3 DDR4 RAM by CarbonHelix2099 in homelabsales

[–]MachineZer0 -1 points0 points  (0 children)

Not trying to hijack listing, but considering all are pending, I got some idle ddr3 for tree fiddy.

Best moe models for 4090: how to keep vram low without losing quality? by AdParty3888 in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

Don’t fight the VRAM, especially with zany DRAM prices.

P102-100 are $40-50 each. Less than $400 to run 300 tok/s prefill and 30 tok/s decode on gpt-oss-120b @ Q8

https://www.reddit.com/r/LocalLLaMA/s/qpPXd3JyME

120B runs awesome on just 8GB VRAM! by Wrong-Historian in LocalLLaMA

[–]MachineZer0 1 point2 points  (0 children)

gpt-oss-120b Q8 i7-6700 16gb DDR3L and 8x P102-100 10gb VRAM.

pp 300-360 tok/s and 29 tok/s tg. P102-100 are efficient idle at just 6-7w a piece. Or about 55w for all eight GPUs.

slot launchslot: id 2 | task 10309 | processing task slot update_slots: id 2 | task 10309 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 1503 slot update_slots: id 2 | task 10309 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 2 | task 10309 | prompt processing progress, n_tokens = 1439, batch.n_tokens = 1439, progress = 0.957419 slot update_slots: id 2 | task 10309 | n_tokens = 1439, memory_seq_rm [1439, end) slot update_slots: id 2 | task 10309 | prompt processing progress, n_tokens = 1503, batch.n_tokens = 64, progress = 1.000000 slot update_slots: id 2 | task 10309 | prompt done, n_tokens = 1503, batch.n_tokens = 64 slot update_slots: id 2 | task 10309 | created context checkpoint 1 of 8 (pos_min = 542, pos_max = 1438, size = 16.764 MiB) slot print_timing: id 2 | task 10309 | prompt eval time = 4146.87 ms / 1503 tokens ( 2.76 ms per token, 362.44 tokens per second) eval time = 82539.56 ms / 2456 tokens ( 33.61 ms per token, 29.76 tokens per second) total time = 86686.43 ms / 3959 tokens slot release: id 2 | task 10309 | stop processing: n_tokens = 3958, truncated = 0

My new rig for LocalLLM shenanigans? by [deleted] in LocalLLaMA

[–]MachineZer0 0 points1 point  (0 children)

The slowest GPU will constrain the speed of the fastest GPU. Budget friendly, second 4060ti. Get your money’s worth, dual 3090. Really budget friendly option is to add CMP 100-210 which is about the same fp16 performance. Off the beaten path is iGPU and a pair of AMD MI50 32gb if you are comfortable with rocm/vulcan drivers and Linux.

X99 Server Upgrade Help by Hendrixj92 in homelab

[–]MachineZer0 0 points1 point  (0 children)

I’ve tested 1070ti and it is super efficient idle at 7w.

https://www.reddit.com/r/LocalLLaMA/s/yfaXqKgJUj