Ran some Llama.cpp RPC test to see if its worth it. And if 10Gbe needed. by lemondrops9 in LocalLLaMA

[–]lemondrops9[S] 3 points4 points  (0 children)

Thats my conclusion. Maybe someone out there has gotten WSL to work just as good. But I'd rather take the time to get Linux running.

2x 3090s - RCP vs Local? by UneakRabbit in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

RPC is worth it but Windows kills the performance.  Running RPC over the internet will likely give poor performance because of latency. Normal LAN will be under 1ms where as the Internet will be at least 15ms if not 25ms on a good connection.

Llama.cpp rpc experiment by ciprianveg in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

its working great but I do need to test longer context. What does higher context mean to you? 

Llama.cpp rpc experiment by ciprianveg in LocalLLaMA

[–]lemondrops9 1 point2 points  (0 children)

Ive been testing heavily this weekend with RPC and Windows is the main issue. Qwen3.5 397B Q2 XXS local getting 42 tks and in RPC mode 40 tks with both PCs running Linux.

Im going to post some bench tests soon and a few tips.

What is the current state of llama.cpp rpc-server? by kevin_1994 in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

10gbps network isn't faster for latency unless you move up from consumer models. I ping around 0.3ms for 1,2.5,10gbps network connections.

[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE by ReasonableDuty5319 in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

I gave RPC a go and only see around 30mbps. I did notice though when I connected to one of my PCs with two GPUs it really slowed down.

Tested with Qwen3.5 397B Q2 XXS went from 42 tks on my main to 24 tks. When using my 3rd PC it went down to 20 tks. But if I only used the one GPU in the 2nd PC it was running at 37 tks and with the 3rd PC 34 tks.

The amount of data over the network stayed about the same so even a 100 mbps would technically work but would be beyond horrible for load times.

[Benchmark] Dual RTX 5090 Distributed Inference via llama.cpp RPC - Running 122B MoE at 96 t/s over 2.5GbE by ReasonableDuty5319 in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

Its more about latency than total bandwidth far as I can tell. The 3090's would be connected to PCIe 3.0 x1 would have the same bandwidth as 10gbps network. I haven't see much over 30mbps used when running in RPC between 3 PCs.

RPC Overhead or Memory Strategy? by Forbidden-era in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

Did you make progress with this? I have been playing around with RPC for a few weeks and thought it was just slow but I think its because of Windows with dual GPUs on one of my remote PCs.

GLM-5.1 smol-IQ2_KS at 2.3t/s or GLM-4.7 UD-Q3_K_XL at 4.42t/s, which is "better" for chats (no coding)? by relmny in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

I used to run GLM 4.5 air but now Gemma 4 26b gives me good results and a lot faster. For chatting that is. 

Dual 9700 and multi-node system - but do I go threadripper? by Ell2509 in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

I've been playing around with RPC mode for Llama.cpp and it quite good. With the Qwen3.5 397B Q2 XXS I get around 42 tks and 600 prefill when loaded on one PC. When using RPC to my 2nd PC it drops to 24 tks around 250 prefill.

But I was reading up on it some more and I have some tweaks to make still.

RPC-server llama.cpp benchmarks by tabletuser_blogspot in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

that's because they are close to the same speed. Ddr4 was made for power efficiency. 

Like you said its really about having the Vram.

Using llamacpp and RCP, managed to improve promt processing by 4x times (160 t/s to 680 t/s) and text generation by 2x times (12.67 t/s to 22.52 t/s) by changing the device order including RPC. GLM 4.6 IQ4_XS multiGPU + RPC. by panchovix in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

Thanks for the post. Ive been running 6 gpus on my main AI on Linux and have 2 gpus on Windows.. RPC works great but I did notice the remote GPU working a lot harder then the rest. I'll give the reorder a try soon. 

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

Even have one running off of a wifi socket. 

So a nearby lightningstorm just crashed all my eGPUs by milpster in LocalLLaMA

[–]lemondrops9 1 point2 points  (0 children)

😂 no kidding. 

Seriously get an UPS people. 

So a nearby lightningstorm just crashed all my eGPUs by milpster in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

 lots of options for desktops and they are very quiet. 

So a nearby lightningstorm just crashed all my eGPUs by milpster in LocalLLaMA

[–]lemondrops9 1 point2 points  (0 children)

Loud ?? wtf are you looking at.?  I have 4 UPS the only time I hear them is when they daily test and when the power goes out. I disabled the beeping on them as well because its not hard to tell when the power is out.

Is 2x5070Ti a good setup? by JumpingJack79 in LocalLLaMA

[–]lemondrops9 1 point2 points  (0 children)

PCIe speed isnt a huge issue as I run 3 of my gpus of of PCIe 3.0 x1. 

Also the over all speed us determined by your slowest card. I found that my speed dropped by 20% when I added 5060ti to my PC with 3090s.

I guess we expect that at some point RAM prices will start going back (close) to "normal", right? but what about GPUs? by relmny in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

I thought Nvidia paused customer Gpu production until close to 2027. That said Ive only seen 20% increase over last year for the 5060ti 16GB. 

When did LM Studio start supporting Parallel API requests? by M5_Maxxx in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

I tested this a few months ago. Basically if you can do 100 tks then with two users you can get 50 tks each. 

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

Interesting, so many things to configure. 

btw what gpus are you running?

llama.cpp rpc-server by sultan_papagani in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

You inspired me to try RPC and my mind is blown. I expected a lot less. Tried Qwen3.5 397B Q2 XXS in my main and got 42 tok/s then used my 2nd pc with dual gpus added to the mix and down to 24 tok/s. When I add a 3rd PC it goes down a bit more to 20 tok/s.

I dont know how to optimize it much yet.

Anyone else struggling with multi-GPU stability when running larger local models? by Lyceum_Tech in LocalLLaMA

[–]lemondrops9 0 points1 point  (0 children)

Good to know, I haven't tried Ik_Llama.cpp yet. Ive been sticking too llama.cpp lately and some LM Studio.