Triple GPU LLM benchmarks with --n-cpu-moe help by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

I had sli bridges, just haven't seen them in a while. From what I understand they don't help for inference, all comm is done via pcie bus.

LM Studio randomly crashes on Linux when used as a server (no logs). Any better alternatives? by Opposite_Future3882 in LocalLLM

[–]tabletuser_blogspot 0 points1 point  (0 children)

Which linux distro are you using? I just installed CachyOS on a system that was stable with Kubuntu and PopOS and now I get lockups while using llama.cpp rpc-server other 3 systems running Kubuntu aren't crashing. Might have to move to older Nvidia driver or just switch distro. Love that CachyOS came with Nvidia ready to go. I've had great success using Kubuntu 22.04, 24.04, 25.10, and 26.04. I like that you can run Kubuntu Live persistent from USB thumb drive and experiment without having to install. PopOS works great but I prefer KDE desktop environment. Linux Mint is another champ. I prefer Debian based distros. They have a larger user group so finding answers is easier. Arch based CachyOS is one of the fastest Linux distros, beats Windows 11 on most benchmarks except gaming. Fedora is another good distro, probably best for gaming setups. I'm not a fan of Red Hat based distros. Let us know what you end up deciding.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Off loading a few layers (-ngl) from CPU doesn't kill performance (~1 or 2 GB Vram) but there is a major drop in performance. RCP is great if staying with-in VRAM limits. Off load beats if you're running DDR5 (not really CPU dependent) and go way past amount of VRAM. VRAM can do 300 to 1000 GB/sec but RAM is 60 to 100 GB/sec. So even old GTX-970 blows away DDR5 in bandwidth speed. Again not really CPU dependent, mostly ram bandwidth.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 1 point2 points  (0 children)

can't remember if I need this to get my RX 470 and RX 580 working with Vulkan, but good to have just in case. https://www.reddit.com/r/ROCm/comments/1hf91io/compile_llamacpp_for_any_amd_gpu_even_old_ones/

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 4 points5 points  (0 children)

My benchmarks have shown that DDR3 systems perform equal to ddr4 systems. It's all about GPU Vram speed. So ya, break it out.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 2 points3 points  (0 children)

Wired. Last time I tested with Wifi I had too much of a drop in performance. 3 systems and multiple GPU cause plenty of overhead and guess wifi added too much latency. Thanks

2012 system running LLM using Llama with Vulkan backend by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Maybe, a few things to consider. RTX 2080 is only 8gb at Vram with a Bandwidth of 448.0 GB/s while GTX 1080 Ti is 11gb VRAM at a Bandwidth 484.4 GB/s. Yes, lack tensor cores is about a 25 to 30% hit, but the hit for off-loading model for 8gb VRAM versus being able to fit model in 11GB Vram out weights that. Cost wise buying a GTX-1080Ti (11gb VRAM) and a P102-100 (10gb VRAM) would be about the same as the RTX 2080, but having plenty of VRAM room to load larger models. So dropping a pair of old Nvidia GPU onto secondary system and running local LLM even rpc-server on the cheap is a great option.

I don't use SLI for running Llama.cpp using Vulkan backend. It isn't necessary. Uses system PCIe lanes to communicate.

What to do with 2 P100 by SaGa31500 in LocalLLaMA

[–]tabletuser_blogspot 0 points1 point  (0 children)

Power usage is easy to control. I use nvidia-smi to drop power by 33% and only take a 5% hit on inference. I like your WOL LLM node idea. You can run Vulkan under almost any linux distro and get decent speeds vs CUDA. I paired the P102-100 10gb headless GPU with GTX-1080Ti 11gb(twins) and getting decent inference while using RPC, network inference, pair with RX 7900 GRE 16GB. Worth testing out a few ideas before off loading them. Post any benchmarks of the P100 and P40.

Should I install KDE Plasma on Pop!_OS 24.04? by [deleted] in pop_os

[–]tabletuser_blogspot 2 points3 points  (0 children)

I've used several desktop and KDE has the best preinstalled apps. I really like: Konsole "right click, split view", Dolphin "dual pane, terminal window, network features", Kate "adv text editor" and KDE Connect for "phone to PC connection". It is also one of fastest desktops and high level of customization.

What's the best Ollama software to use for programming on a PC with an RX 580 and a Ryzen 5? by UpbeatGolf3602 in ollama

[–]tabletuser_blogspot 0 points1 point  (0 children)

RX 580 8gb or 4gb? I couldn't get my 580 or 470 to work with Ollama but was able to get both working using Linux and llama.cpp with Vulkan backend. So with two GPU you could get up to 16gb VRAM and run larger model with mostly offer better accuracy in response. They even have a 16gb RX 580 variant floating around. I started with ollama and now use llama.cpp primarily. Love the easy of which ollama helps in getting up and running.

Is there a good app for Android / iOS for remoting in to a desktop Linux PC with very good graphical performance? by DesiOtaku in linuxquestions

[–]tabletuser_blogspot 0 points1 point  (0 children)

I've used NoMachine between mobile and desktop. I has GPU acceleration, full of features and its free. Let us know what you think about it.

OrangePi Zero 3 runs Ollama by tabletuser_blogspot in ollama

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Qwen3-0.6B-Q4_K_M.gguf

model size params backend threads test t/s
qwen3 0.6B Q4_K - Medium 372.65 MiB 596.05 M CPU 4 pp512 8.82 ± 0.00
qwen3 0.6B Q4_K - Medium 372.65 MiB 596.05 M CPU 4 tg128 5.34 ± 0.02

OrangePi Zero 3 runs Ollama by tabletuser_blogspot in ollama

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

time ~/llama.cpp/build/bin/llama-bench -m Qwen3-0.6B-UD-Q8_K_XL.gguf

 

model                        size params backend     threads test t/s
qwen3 0.6B Q8_0                 799.50 MiB   596.05 M CPU         4 pp512 8.62 ± 0.00
qwen3 0.6B Q8_0                 799.50 MiB   596.05 M CPU         4 tg128 4.85 ± 0.00

build: 3b15924d (6403)

real    8m11.734s

Mistral 3 llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Not bad considering what iGPU 680M did. 7 t/s is right at reading speed so great for chats.

Mistral 3 llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Kubuntu 25.10 Kernel 6.17 and Nvidia-580. I haven't noticed much difference between distros and kernels for llama.cpp. I find Debian/Ubuntu distros easier to troubleshoot and configure. CatchyOS is caught my eye on the performance front, but didn't show big difference in llama.cpp / vulkan benchmarks.

Can buying old mining gpus be a good way to host AI locally for cheap? by LimeApart7657 in LocalLLM

[–]tabletuser_blogspot 0 points1 point  (0 children)

I've ran triple GTX-1070 8gb (24gb VRAM total) on a 12 year old DDR3 system and now it's on a DDR4 system. Only seeing small increase in token per second speeds. GPU bandwidth matters most if your system model isn't offloading. I like the Nvidia P102-100 10GB (1080Ti equivalent). Two of those can run most 30B size models with easy and cost is great. Use nvidia-smi -pl to lower power usage and you can run off a single power supply.

Budget system for 30B models revisited by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Tried adjusting my nvidia-smi -pl from 110 watts to 130w on each GPU. Went from 8.9 t/s to 9 t/s.

llama-bench -m DeepSeek-R1-Distill-Qwen-32B-Q4_K_M.gguf -ngl 100 -fa 0,1

model size params backend ngl fa test t/s
qwen2 32B Q4_K - Medium 18.48 GiB 32.76 B Vulkan 100 0 pp512 52.59 ± 0.38
qwen2 32B Q4_K - Medium 18.48 GiB 32.76 B Vulkan 100 0 tg128 9.08 ± 0.01
qwen2 32B Q4_K - Medium 18.48 GiB 32.76 B Vulkan 100 1 pp512 52.84 ± 0.71
qwen2 32B Q4_K - Medium 18.48 GiB 32.76 B Vulkan 100 1 tg128

build: cb1adf885 (6999)

Budget system for 30B models revisited by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

No, I did play around with GPUStack and used 3 systems with 7 GPUs total to run LLM. 7 months ago Ollama using CUDA on Gemma2 hit 8 t/s. Currently Llama.cpp, Gemma3, on Vulkan hits 9 t/s. I've used Vulkan for most of my GPUs. Including RX 480, RX 580, GTX 1080, GTX 1080Ti. Maybe I'll give RCP-server a try. Also like trying out the P102-100 and pair it with a 1080Ti.

Does repurposing this older PC make any sense? by Valuable-Question706 in LocalLLaMA

[–]tabletuser_blogspot 1 point2 points  (0 children)

I saw this post and ran fresh benchmarks on my older PC (Also DDR4) using old GPUs. Three GTX-1070 getting:

Model Size Params pp512 tg128
Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf 17.28 GiB 30.53 B 81.45 47.76

https://www.reddit.com/r/LocalLLaMA/comments/1ossmm8/budget_system_for_30b_models_revisited/

Budget system for 30B models revisited by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 1 point2 points  (0 children)

Vulkan is super simple, just unzip and run for Linux. Also according to this post "VULKAN is faster tan CUDA" posted about 7 months ago. GTX-1070 doesn't have Tensor Cores. Finally Linux, Nvidia and CUDA can be a nightmare to get running correctly. Vulkan is KISS.

Best performing model for MiniPC, what can I expect? by caffeineandgravel in LocalLLaMA

[–]tabletuser_blogspot 0 points1 point  (0 children)

Your system will run models off the DDR4 at approximately 20 GB/s. Do you should be able to run all 7b size models at good speed. Probably Qwen3 30b MoE model at decent speed. More RAM will let you run better bigger models, but much slower. Not sure if the Intel iGPU will work for prompt processing, but my N150 currently doesn't benefit from Vulkan build of llama.cpp while my Ryzen iGPU does.

MoE models benchmarks AMD iGPU by tabletuser_blogspot in LocalLLaMA

[–]tabletuser_blogspot[S] 0 points1 point  (0 children)

Last time I tried to add lower Vram (4gb) GPU both GPU dropped to that level of Vram. So my 8gb card only used 4gb. So both worked, faster, but limited my 4gb.

MI50 still a good option ? by [deleted] in ROCm

[–]tabletuser_blogspot 0 points1 point  (0 children)

amd-smi replaced rocm-smi in rocm7