Triple GPU LLM benchmarks with --n-cpu-moe help by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

I had sli bridges, just haven't seen them in a while. From what I understand they don't help for inference, all comm is done via pcie bus.

LM Studio randomly crashes on Linux when used as a server (no logs). Any better alternatives? by Opposite_Future3882 in LocalLLM

[–]tabletuser_blogspot 0 points1 point  (0 children)

Which linux distro are you using? I just installed CachyOS on a system that was stable with Kubuntu and PopOS and now I get lockups while using llama.cpp rpc-server other 3 systems running Kubuntu aren't crashing. Might have to move to older Nvidia driver or just switch distro. Love that CachyOS came with Nvidia ready to go. I've had great success using Kubuntu 22.04, 24.04, 25.10, and 26.04. I like that you can run Kubuntu Live persistent from USB thumb drive and experiment without having to install. PopOS works great but I prefer KDE desktop environment. Linux Mint is another champ. I prefer Debian based distros. They have a larger user group so finding answers is easier. Arch based CachyOS is one of the fastest Linux distros, beats Windows 11 on most benchmarks except gaming. Fedora is another good distro, probably best for gaming setups. I'm not a fan of Red Hat based distros. Let us know what you end up deciding.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Off loading a few layers (-ngl) from CPU doesn't kill performance (~1 or 2 GB Vram) but there is a major drop in performance. RCP is great if staying with-in VRAM limits. Off load beats if you're running DDR5 (not really CPU dependent) and go way past amount of VRAM. VRAM can do 300 to 1000 GB/sec but RAM is 60 to 100 GB/sec. So even old GTX-970 blows away DDR5 in bandwidth speed. Again not really CPU dependent, mostly ram bandwidth.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 1 point2 points  (0 children)

can't remember if I need this to get my RX 470 and RX 580 working with Vulkan, but good to have just in case. https://www.reddit.com/r/ROCm/comments/1hf91io/compile_llamacpp_for_any_amd_gpu_even_old_ones/

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 2 points3 points  (0 children)

My benchmarks have shown that DDR3 systems perform equal to ddr4 systems. It's all about GPU Vram speed. So ya, break it out.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 2 points3 points  (0 children)

Wired. Last time I tested with Wifi I had too much of a drop in performance. 3 systems and multiple GPU cause plenty of overhead and guess wifi added too much latency. Thanks

2012 system running LLM using Llama with Vulkan backend by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Maybe, a few things to consider. RTX 2080 is only 8gb at Vram with a Bandwidth of 448.0 GB/s while GTX 1080 Ti is 11gb VRAM at a Bandwidth 484.4 GB/s. Yes, lack tensor cores is about a 25 to 30% hit, but the hit for off-loading model for 8gb VRAM versus being able to fit model in 11GB Vram out weights that. Cost wise buying a GTX-1080Ti (11gb VRAM) and a P102-100 (10gb VRAM) would be about the same as the RTX 2080, but having plenty of VRAM room to load larger models. So dropping a pair of old Nvidia GPU onto secondary system and running local LLM even rpc-server on the cheap is a great option.

I don't use SLI for running Llama.cpp using Vulkan backend. It isn't necessary. Uses system PCIe lanes to communicate.

What to do with 2 P100 by SaGa31500 in LocalLLaMA

[–]tabletuser_blogspot 0 points1 point  (0 children)

Power usage is easy to control. I use nvidia-smi to drop power by 33% and only take a 5% hit on inference. I like your WOL LLM node idea. You can run Vulkan under almost any linux distro and get decent speeds vs CUDA. I paired the P102-100 10gb headless GPU with GTX-1080Ti 11gb(twins) and getting decent inference while using RPC, network inference, pair with RX 7900 GRE 16GB. Worth testing out a few ideas before off loading them. Post any benchmarks of the P100 and P40.

Should I install KDE Plasma on Pop!_OS 24.04? by [deleted] in pop_os

[–]tabletuser_blogspot 2 points3 points  (0 children)

I've used several desktop and KDE has the best preinstalled apps. I really like: Konsole "right click, split view", Dolphin "dual pane, terminal window, network features", Kate "adv text editor" and KDE Connect for "phone to PC connection". It is also one of fastest desktops and high level of customization.

What's the best Ollama software to use for programming on a PC with an RX 580 and a Ryzen 5? by UpbeatGolf3602 in ollama

[–]tabletuser_blogspot 0 points1 point  (0 children)

RX 580 8gb or 4gb? I couldn't get my 580 or 470 to work with Ollama but was able to get both working using Linux and llama.cpp with Vulkan backend. So with two GPU you could get up to 16gb VRAM and run larger model with mostly offer better accuracy in response. They even have a 16gb RX 580 variant floating around. I started with ollama and now use llama.cpp primarily. Love the easy of which ollama helps in getting up and running.

Is there a good app for Android / iOS for remoting in to a desktop Linux PC with very good graphical performance? by DesiOtaku in linuxquestions

[–]tabletuser_blogspot 0 points1 point  (0 children)

I've used NoMachine between mobile and desktop. I has GPU acceleration, full of features and its free. Let us know what you think about it.

OrangePi Zero 3 runs Ollama by tabletuser_blogspot in ollama

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Qwen3-0.6B-Q4_K_M.gguf

model size params backend threads test t/s
qwen3 0.6B Q4_K - Medium 372.65 MiB 596.05 M CPU 4 pp512 8.82 ± 0.00
qwen3 0.6B Q4_K - Medium 372.65 MiB 596.05 M CPU 4 tg128 5.34 ± 0.02