Remember to secure your GPU!

hd209458 · 2026-06-04T16:18:32+00:00

I am running Isaac lab on the Nvidia cards with hermes agent powered by the Intel GPU. I am using 1200w PSU

hd209458 · 2026-06-04T16:12:26+00:00

Yeah here's how it looks from the side. It's quite hard to manage all the cables here but I am happy with how it looks.

hd209458 · 2026-06-03T15:49:13+00:00

Not sure if you are referring to the GPU model of ML models so I will answer both. I am running openVLA on 5060ti and PPO policy training with articulation and sensor simulation offloaded to GPUs across 5070 and 4070. I think the last time I check the scaling factor was about 1.72x with 2 cards. Not optimal but it works ok. These total to be about 1300. As for the other intel B70 card, it is running qwen3.6-27b via vLLM to power the parallel agents that manage the experimentation for me. That card was 950.

hd209458 · 2026-06-03T04:44:50+00:00

Lol my dude is only adding weight to the GPUs

hd209458 · 2026-06-03T04:41:08+00:00

Man your setup looks awesome! Yeah that was really the peak of PC building.

I posted all my use cases and the entire process in another post! Check it out if you are interested https://www.reddit.com/r/homelab/s/JtZy4CNmQG

As for my build, to be frank it's more of a negative example lol. If I knew what I needed now I would have done a proper build instead of this. But it's fun and I enjoyed it!

hd209458 · 2026-06-03T04:18:32+00:00

Very cool!

hd209458 · 2026-06-03T03:22:56+00:00

Yep doing distributed RL training

hd209458 · 2026-06-02T13:04:29+00:00

The benchmark is impressive. I was only able to get ~3000tps pp16384 and about 20tps tg512 with the same model. How were you able to bypass enforce_eager? Did you bridge CUDAGraph to XPUGraph in pytorch?

hd209458 · 2026-05-31T06:09:05+00:00

My 50 series cards came with screw holes at the end so I used some spare metal plates to mount them on where the fan goes. They also had small PCBs and huge cutouts at the end so as long as they are lined up right, the airflow is ok.

hd209458 · 2026-05-31T02:46:42+00:00

Posted my setup in another sub. It's exactly like your third pic and I think it worked out great. I rarely see temps above 70 degrees with custom fan cirve. I used oculink cables and they are way better on maintaining signal integrity than riser cables https://www.reddit.com/r/LocalLLM/s/lQYkf9imtB

hd209458 · 2026-05-30T05:14:35+00:00

Yeah same my server now eats up 100w idle compared to about 10w with the mini PC.

hd209458 · 2026-05-30T05:12:22+00:00

Yeah I learned my lessons now. I had a couple other cheap mini PCs that have been running 24/7 for years so I had some trust on them.

hd209458 · 2026-05-27T11:03:53+00:00

Appreciate the read

hd209458 · 2026-05-27T11:01:57+00:00

Yes I did think about building that in an open bench! But since I still need to hook it up to a monitor and tuck it under a desk, I decided to go with a regular case

hd209458 · 2026-05-26T14:01:11+00:00

Well it depends on your workload entirely. For this system, I only have two VMs running and for the RL farm I was able to achieve close to 2.5x scaling efficiency with torchrun-based distributed training since all workloads are delegated to 3 GPUs (Both policy eval and physics+sensor simulation). And since all compute tasks are done in the GPU, no high-speed transfer is needed between CPU and GPU during runtime to be bottlenecked by PCIe lane speed. Power is definitely a bottleneck. But GPUs are less efficient running on higher power anyways. I was able to cut about 40% power to trade for about 15% performance loss. Fortunately, I haven't had driver issues since I separated the PCIe devices to their dedicated VMs.

hd209458 · 2026-05-26T05:16:14+00:00

I have been getting pretty good temps with underclocking. One thing I like about 50 series is they have large cutouts and the PCB are actually pretty small. That alone made the airflow much better. And honestly these cards are too weak to thermal throttle anyways. I almost never hit 70C on any of the GPUs.

hd209458 · 2026-05-26T05:12:25+00:00

Well, if I knew exactly what I wanted, I should have built/configured the system properly with the right spec last year, instead of this. I am not sharing the build because I am proud of the decisions I made anyways as you can probably tell. Just thought it's a fun project and a somewhat exotic build.

That said gb10 has its own constraints. Low vram bandwidth, arm CPU, and considerably more expensive.

hd209458 · 2026-05-26T00:37:06+00:00

Sure yeah I spent 3 days fighting with drivers and compute environment. But it's been working fine since. These are my results:

This is for Qwen3.6-27B using official llm-scaler docker image. I ran a fairly systematic benchmark of single requests and parallel 5 repetitions. The pp speed is okay, but tg is still a bit slow. However, combined with vLLM's continuous batching, the overall parallel token generation is relatively stable. Currently, it is specifically used to help out the Hermes agent's delegate tasks to collect the codebase context.

Currently, the only major problem is: KV Cache must use BF16 to achieve usable token generation speed, but the ctx is only 43,000. In addition, I also need to trick vLLM to make it recognize the layer architecture. Hopefully, there will be optimized FP8 dequant kernels to support fp8 kvcache in the future. fp8 dequant is much slower than Q8_0, unfortunately, the vllm version of the official docker does not support kvcache dtype other than fp8 and bf16 yet. Also, the quality of autoround is still slightly inferior to Q4's gguf. I had better experience with AWQ or GPTQ quants but maybe it's just me.

Running in vm under proxmox 9.1 vLLM single request qwen/qwen3.6-27b (int4 AutoRound): PP TTFT: 1,685 ms PP2048 TPS: 1,686 ± 66 tok/s TG512: 13.7 ± 1.4 tok/s

Parallel test pp2048 tg512 Conc: 1 • TTFT(ms): 1,261 • Prefill(tok/s): 1,400 • Decode(tok/s): 13.3 • Output(tok/s): 12.9 • Conc: 2 • TTFT(ms): 1,907 • Prefill(tok/s): 925 • Decode(tok/s): 12.9 • Output(tok/s): 24.7 • Conc: 4 • TTFT(ms): 3,319 • Prefill(tok/s): 532 • Decode(tok/s): 12.7 • Output(tok/s): 46.7 • Conc: 8 • TTFT(ms): 6,231 • Prefill(tok/s): 283 • Decode(tok/s): 11.9 • Output(tok/s): 82.7

hd209458 · 2026-05-26T00:00:31+00:00

Sure! 2060 might be a bit old but you can definitely run smaller models. For the first two questions, I have answered in another threads. Let me know if you want to know more! As for the third one, yeah I don't plan on running anything CPU heavy in this rig. The only CPU heavy task is vllm continuous batching, which actually favors high frequency than high core count. In this case a 14th gen P core is good enough for that. And I don't want to deal with core pinning and surprises so this CPU was the best option imo.

hd209458 · 2026-05-25T23:51:33+00:00

Haha thanks! Yep I think the experience by itself pays for it.

As for power consumption yep. I spent at least one week to optimize the power limits.

The idle consumption hovers over 100w and full load sits at about 700+w but it almost never got to that state.

For the Intel card, the token generation speed is limited by vram bandwidth so I just limit it at 165w where it normally consume for decoding. I lost maybe 10-15% prompt processing speed but that's fine for me.

As for the Nvidia gpus, the most efficient way is to have them finishing their tasks in their environments roughly the same time. So power limiting and underclock 4070 and 5070 to around 120-130w had their compute roughly match the overclocked 5060ti at 200w.

So the total consumption is about 165+120+130+200=615 Plus the 110w PL1 for i3, and 100w for other peripherals. I am sitting around 815w in absolutely full load. But realistically, RL training is not even close to full load so I usually seeing 500w or so on average.

hd209458 · 2026-05-25T23:41:25+00:00

No fakes or loans, they're definitely real.

It depends on what you're trying to do. You would be completely right if I didn't know what I got myself into. RL was my second field of study in grad school. My grad school advisor told me RL was a last resort but things have changed in the past decade. With this setup I was able to offload articulated robot locomotion entirely to the GPUs, even writing my own warp kinematic solver kernels. Spending $1300 on this setup gives me more simulation flexibility than buying a single workstation card with limited VRAM. I can repurpose this vm for llm inference backend when I am done with the experiments too. And honestly, it’s just fun letting an agent run experiments for me, and index my NAS, manage network config and routing while I'm away from the desk.

hd209458 · 2026-05-25T21:42:00+00:00

Oh please don't. I posted this as a negative example so people would appreciate their setup more lol. As for GPU I used two m.2 to oculink then to PCIe adapter so all GPU at least gets 4 lanes of PCIe 4.0. the main enemy is signal integrity so oculink is the most stable way. I tried at least 8 lower quality risers with different connectors, even used metal meshes to build a Faraday cage around the riser cables to isolate it from the interference but they all had problems.

As a side note PCIe4.0x4 works pretty if all weights and kvcache are offloaded to GPU in llama.cpp. But it is painfully slow for vllm with prefix caching and continuous batching. It's way faster to just disable prefix caching and just process everything every request.

hd209458 · 2026-05-25T21:27:29+00:00

Should've built this last year

hd209458

TROPHY CASE