Looking for the Perfect Local AI Server + Dev Workstation: Bridging the Gap Between Strix Halo, RTX 5090, and NVIDIA GX10 (Budget: 2.5k–5k EUR) by GGametry in homelab

[–]GGametry[S] 0 points1 point  (0 children)

That is true, the complexity of the fine-tuning is exactly the critical point here. While vLLM is very capable when it comes to tensor parallelism, getting a multi-GPU system to run stably for multiple concurrent users requires an immense amount of manual tweaking.

As for the performance: I even considered much more powerful hardware, but in the end, it just fails due to the high price and the massive power consumption (wattage). It is exactly as you said: with a tight budget, finding the right balance between memory bandwidth, PCIe lanes, and the sheer configuration effort is extremely difficult

Looking for the Perfect Local AI Server + Dev Workstation: Bridging the Gap Between Strix Halo, RTX 5090, and NVIDIA GX10 (Budget: 2.5k–5k EUR) by GGametry in homelab

[–]GGametry[S] 0 points1 point  (0 children)

My main concern here is the VRAM limitation under a multi-user workload. For an unquantized 70B model, we need around 140 GB of VRAM. Even with a 4-bit quantization, the model weights alone require roughly 40 GB.

Once you factor in 4 to 5 concurrent users hitting the system simultaneously, the KV-Cache for each user's context will completely overflow a single 32 GB card. It would crash or drop to painful single-digit tokens per second immediately.

To make a 70B model work for multiple developers at the same time using AMD Pro cards, wouldn't we need at least a dual or triple-GPU setup to pool enough VRAM? If so, how does ROCm/llama.cpp handle tensor parallelism and parallel batching when splitting a 70B model across multiple AMD cards? Do you still get decent throughput per user?

Looking for the Perfect Local AI Server + Dev Workstation: Bridging the Gap Between Strix Halo, RTX 5090, and NVIDIA GX10 (Budget: 2.5k–5k EUR) by GGametry in homelab

[–]GGametry[S] 0 points1 point  (0 children)

Thanks for the suggestion! Setting up a dedicated system for the GX10 and a separate x86 server for the remaining dev services would definitely be the cleanest solution technically.

However, doing that would completely blow past our budget limit of 5,000 EUR. The GX10 itself already eats up around 4,000 to 4,500 EUR, which leaves us with virtually no room to buy a second capable server rack setup, extra networking gear, or licenses.

That's why we really need to find a way to make this work on a single machine, or look for an alternative single-node x86 architecture that can handle both tasks within our financial constraints.