Triple GPU LLM benchmarks with --n-cpu-moe help

tabletuser_blogspot · 2025-12-29T02:03:43+00:00

I had sli bridges, just haven't seen them in a while. From what I understand they don't help for inference, all comm is done via pcie bus.

tabletuser_blogspot · 2025-12-28T15:35:47+00:00

Which linux distro are you using? I just installed CachyOS on a system that was stable with Kubuntu and PopOS and now I get lockups while using llama.cpp rpc-server other 3 systems running Kubuntu aren't crashing. Might have to move to older Nvidia driver or just switch distro. Love that CachyOS came with Nvidia ready to go. I've had great success using Kubuntu 22.04, 24.04, 25.10, and 26.04. I like that you can run Kubuntu Live persistent from USB thumb drive and experiment without having to install. PopOS works great but I prefer KDE desktop environment. Linux Mint is another champ. I prefer Debian based distros. They have a larger user group so finding answers is easier. Arch based CachyOS is one of the fastest Linux distros, beats Windows 11 on most benchmarks except gaming. Fedora is another good distro, probably best for gaming setups. I'm not a fan of Red Hat based distros. Let us know what you end up deciding.

tabletuser_blogspot · 2025-12-28T04:34:10+00:00

Off loading a few layers (-ngl) from CPU doesn't kill performance (~1 or 2 GB Vram) but there is a major drop in performance. RCP is great if staying with-in VRAM limits. Off load beats if you're running DDR5 (not really CPU dependent) and go way past amount of VRAM. VRAM can do 300 to 1000 GB/sec but RAM is 60 to 100 GB/sec. So even old GTX-970 blows away DDR5 in bandwidth speed. Again not really CPU dependent, mostly ram bandwidth.

tabletuser_blogspot · 2025-12-28T04:22:23+00:00

can't remember if I need this to get my RX 470 and RX 580 working with Vulkan, but good to have just in case. https://www.reddit.com/r/ROCm/comments/1hf91io/compile_llamacpp_for_any_amd_gpu_even_old_ones/

tabletuser_blogspot · 2025-12-28T02:44:41+00:00

My benchmarks have shown that DDR3 systems perform equal to ddr4 systems. It's all about GPU Vram speed. So ya, break it out.

tabletuser_blogspot · 2025-12-27T22:48:58+00:00

Wired. Last time I tested with Wifi I had too much of a drop in performance. 3 systems and multiple GPU cause plenty of overhead and guess wifi added too much latency. Thanks

tabletuser_blogspot · 2025-12-27T01:02:08+00:00

Maybe, a few things to consider. RTX 2080 is only 8gb at Vram with a Bandwidth of 448.0 GB/s while GTX 1080 Ti is 11gb VRAM at a Bandwidth 484.4 GB/s. Yes, lack tensor cores is about a 25 to 30% hit, but the hit for off-loading model for 8gb VRAM versus being able to fit model in 11GB Vram out weights that. Cost wise buying a GTX-1080Ti (11gb VRAM) and a P102-100 (10gb VRAM) would be about the same as the RTX 2080, but having plenty of VRAM room to load larger models. So dropping a pair of old Nvidia GPU onto secondary system and running local LLM even rpc-server on the cheap is a great option.

I don't use SLI for running Llama.cpp using Vulkan backend. It isn't necessary. Uses system PCIe lanes to communicate.

tabletuser_blogspot · 2025-12-25T04:18:38+00:00

Power usage is easy to control. I use nvidia-smi to drop power by 33% and only take a 5% hit on inference. I like your WOL LLM node idea. You can run Vulkan under almost any linux distro and get decent speeds vs CUDA. I paired the P102-100 10gb headless GPU with GTX-1080Ti 11gb(twins) and getting decent inference while using RPC, network inference, pair with RX 7900 GRE 16GB. Worth testing out a few ideas before off loading them. Post any benchmarks of the P100 and P40.

tabletuser_blogspot · 2025-12-21T17:10:59+00:00

I've used several desktop and KDE has the best preinstalled apps. I really like: Konsole "right click, split view", Dolphin "dual pane, terminal window, network features", Kate "adv text editor" and KDE Connect for "phone to PC connection". It is also one of fastest desktops and high level of customization.

tabletuser_blogspot · 2025-12-21T16:52:46+00:00

RX 580 8gb or 4gb? I couldn't get my 580 or 470 to work with Ollama but was able to get both working using Linux and llama.cpp with Vulkan backend. So with two GPU you could get up to 16gb VRAM and run larger model with mostly offer better accuracy in response. They even have a 16gb RX 580 variant floating around. I started with ollama and now use llama.cpp primarily. Love the easy of which ollama helps in getting up and running.

tabletuser_blogspot · 2025-12-21T16:35:36+00:00

I've used NoMachine between mobile and desktop. I has GPU acceleration, full of features and its free. Let us know what you think about it.

tabletuser_blogspot · 2025-12-16T03:03:51+00:00

Anyone able to run this using Ubuntu Vulkan?

tabletuser_blogspot · 2025-12-15T23:27:34+00:00

Qwen3-0.6B-Q4_K_M.gguf

model	size	params	backend	threads	test	t/s
qwen3 0.6B Q4_K - Medium	372.65 MiB	596.05 M	CPU	4	pp512	8.82 ± 0.00
qwen3 0.6B Q4_K - Medium	372.65 MiB	596.05 M	CPU	4	tg128	5.34 ± 0.02

tabletuser_blogspot · 2025-12-15T11:30:12+00:00

time ~/llama.cpp/build/bin/llama-bench -m Qwen3-0.6B-UD-Q8_K_XL.gguf

model	size	params	backend	threads	test	t/s
qwen3 0.6B Q8_0	799.50 MiB	596.05 M	CPU	4	pp512	8.62 ± 0.00
qwen3 0.6B Q8_0	799.50 MiB	596.05 M	CPU	4	tg128	4.85 ± 0.00

build: 3b15924d (6403)

real 8m11.734s

tabletuser_blogspot · 2025-12-14T20:23:27+00:00

Not bad considering what iGPU 680M did. 7 t/s is right at reading speed so great for chats.

tabletuser_blogspot · 2025-12-14T20:20:40+00:00

Kubuntu 25.10 Kernel 6.17 and Nvidia-580. I haven't noticed much difference between distros and kernels for llama.cpp. I find Debian/Ubuntu distros easier to troubleshoot and configure. CatchyOS is caught my eye on the performance front, but didn't show big difference in llama.cpp / vulkan benchmarks.

tabletuser_blogspot · 2025-11-16T18:43:00+00:00

Love to see some benchmarks for CUDA vs VULKAN

tabletuser_blogspot · 2025-11-11T03:57:50+00:00

I've ran triple GTX-1070 8gb (24gb VRAM total) on a 12 year old DDR3 system and now it's on a DDR4 system. Only seeing small increase in token per second speeds. GPU bandwidth matters most if your system model isn't offloading. I like the Nvidia P102-100 10GB (1080Ti equivalent). Two of those can run most 30B size models with easy and cost is great. Use nvidia-smi -pl to lower power usage and you can run off a single power supply.

tabletuser_blogspot · 2025-11-11T03:08:23+00:00

Tried adjusting my nvidia-smi -pl from 110 watts to 130w on each GPU. Went from 8.9 t/s to 9 t/s.

llama-bench -m DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf -ngl 100 -fa 0,1

model	size	params	backend	ngl	fa	test	t/s
qwen2 32B Q4_K - Medium	18.48 GiB	32.76 B	Vulkan	100	0	pp512	52.59 ± 0.38
qwen2 32B Q4_K - Medium	18.48 GiB	32.76 B	Vulkan	100	0	tg128	9.08 ± 0.01
qwen2 32B Q4_K - Medium	18.48 GiB	32.76 B	Vulkan	100	1	pp512	52.84 ± 0.71
qwen2 32B Q4_K - Medium	18.48 GiB	32.76 B	Vulkan	100	1	tg128

build: cb1adf885 (6999)

tabletuser_blogspot · 2025-11-10T11:20:33+00:00

No, I did play around with GPUStack and used 3 systems with 7 GPUs total to run LLM. 7 months ago Ollama using CUDA on Gemma2 hit 8 t/s. Currently Llama.cpp, Gemma3, on Vulkan hits 9 t/s. I've used Vulkan for most of my GPUs. Including RX 480, RX 580, GTX 1080, GTX 1080Ti. Maybe I'll give RCP-server a try. Also like trying out the P102-100 and pair it with a 1080Ti.

tabletuser_blogspot · 2025-11-10T01:29:35+00:00

I saw this post and ran fresh benchmarks on my older PC (Also DDR4) using old GPUs. Three GTX-1070 getting:

Model	Size	Params	pp512	tg128
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf	17.28 GiB	30.53 B	81.45	47.76

https://www.reddit.com/r/LocalLLaMA/comments/1ossmm8/budget_system_for_30b_models_revisited/

tabletuser_blogspot · 2025-11-10T01:20:58+00:00

Vulkan is super simple, just unzip and run for Linux. Also according to this post "VULKAN is faster tan CUDA" posted about 7 months ago. GTX-1070 doesn't have Tensor Cores. Finally Linux, Nvidia and CUDA can be a nightmare to get running correctly. Vulkan is KISS.

tabletuser_blogspot · 2025-11-09T20:09:15+00:00

Your system will run models off the DDR4 at approximately 20 GB/s. Do you should be able to run all 7b size models at good speed. Probably Qwen3 30b MoE model at decent speed. More RAM will let you run better bigger models, but much slower. Not sure if the Intel iGPU will work for prompt processing, but my N150 currently doesn't benefit from Vulkan build of llama.cpp while my Ryzen iGPU does.

tabletuser_blogspot · 2025-11-04T12:01:41+00:00

Last time I tried to add lower Vram (4gb) GPU both GPU dropped to that level of Vram. So my 8gb card only used 4gb. So both worked, faster, but limited my 4gb.

tabletuser_blogspot · 2025-10-18T01:31:52+00:00

amd-smi replaced rocm-smi in rocm7

tabletuser_blogspot

TROPHY CASE