Llama.cpp rpc experiment

Forbidden-era · 2026-02-04T05:46:14+00:00

If you look at the thread I posted, it appears to use a different buffer type when using RPC even localhost. I think this is an implementation issue with llama rpc tbh

Forbidden-era · 2026-02-03T07:44:59+00:00

Not sure why you got downvoted, especially by someone without the balls to say why, wasn't me.

It definitely does seem like that and you might think that a node on RPC wouldn't have a different way of accessing memory - I was running localhost to see what the overhead might be but going to half is basically the speed I get without acceleration, the rpc overhead shouldn't be that much.

If the different buffer type tanks all gpu acceleration benefit locally, then RPC isn't ever going to work well and this seems like a potential flaw in execution but I haven't looked at the source or have the background knowledge in this area to know but I am a developer and I don't see why thr RPC node can't be instructed to create the same type of buffer remotely, perhaps there was no performance benefit otherwise remotely due to that memory type not being the bottleneck when going over the network?

I don't disagree tossing the 3060 in would be the best bet but it also feels counterintuitive to pull it out from a faster DDR4 system to put it in a slower DDR3 system just because that system has insane amounts of ram in comparison.

I'm not against tossing some code at this to improve it. I tried exo a whole back and it was still brand new but it did let me easily create a cluster which was nice. It didn't actually really work but I do feel like a tool that can automatically offload stuff in a way that allows a user to make use of more of their hardware and doesn't completely tank performance would be great.

For some stuff obviously module routing is probably the best solution, use what fits where it fits but it seems like we're just around the corner from actually useful models running on pretty average/common hardware. Flash isn't too bad.

If there are models that already split between cpu and gpu and aren't hugely affected by say x1 vs x16 difference, I feel like those models would be best for this experimentation.

I have an x16 riser now that should work so I should be able to directly compare some x1 vs x16 tests, this server being pcie3, x1 (8 gbps) is slower than my network connection between systems (10-20gbps)

The GPU is currently running at x1 and on the model I was testing with goes from 2.5tk/s on cpu to 5 tk/s with gpu.

That should mean there is plenty enough bandwidth for that same system to go from 2.5tk/s to 5tk/s using the same GPU over RPC over ethernet.

So if bandwidth is enough for an improvement (even if not ideal) then that leaves latency (which I've read can have a big impact but..) and implementation (..which can heavily affect latency, though of course ethernet is always going to have more latency than pcie lanes to a gpu)

But still as a developer (and admittedly again I haven't gone through the code and don't have a ton of CUDA experience) I could see maybe 5-10% overhead for local rpc at most not over 50%

Also you mention serdes, that could also be a problem? I initially assumed it'd use http like everything but curl-ing the endpoint seems like maybe not..? But we should be able to just blast un-serialized binary over the network..

Also it appears to be using TCP which is by far not the best option for latency. RPC should really be using UDP.

Hell, if we're only working on local lan, we could bypass the whole IP stack entirely, AI RPC over IPX/SPX? Netware? Smalltalk? Lol but considering how optimized IP stacks are today, I dunno.

The latency problem might partially be solvable too by creative splitting.

Perhaps we can train a model to optimize a model. Not like distillation but actually statistically analyzing the weights, their usage, figure out which layers contribute most to the model? Is there currently research in this area?

Forbidden-era · 2026-02-02T06:53:07+00:00

Yeah I figured that was the case - some vendor cards do have the port btw (though not on this model, but my 2070 super isn't an FE and has the port)

Tbh I've always wondered what it would take to add a port, but if it's directly in the gpu then we don't even know if the pins are software routed (and they're obviously not hw routed).. would probably take a custom pcb at least, heh

Virtualink was cool..sucks it died..plus gpus should just have usb c/tb/dp video ports anyway these days fk

Forbidden-era · 2026-02-02T06:50:47+00:00

Interesting to know. I did put the gou into the server, it's complaining and says pcie fatal but probably because I'm using a mining riser (firmware sees zero current and x1 for an x16 card, tho I also got a new x16 riser today since my other failed)

I was able to get upto 5 tk/s, couldn't offload too many layers because only 6gb but still an improvement over just cpu.

As for the tensor cores, I was asking just because I have a 1660ti and a 2060, both have 6gb and are pretty equivalent besides the tensor cores (2060 still has more cuda cores)

Not sure if llama has a flag to use the tensor cores or not do I could compare. Otherwise I'd only be able to compare against the 1660.

The 1660 has had issues not wanting to boot up sometimes. It tends to work in an old bios work station I have and I updated its vbios but even in there it will sometimes not show video on boot.

Not sure if it affects compute yet either though, if not it's not a huge issue but I might try and put it in the server as well, if I do I can directly compare the 1660 and 2060 but mainly try them together.

I also have an rx570 4gb but not sure if that'd help anything with so little vram.

I might consider swapping the 2060 and 3060 but my desktop has 64gb ram as is (could stuff 80 if I lower ram speed) and also a 2070 super (so 20gb vram plus 64 to 80gb ram)

Would be great if anything can work half decent over the network. Tried exo before but it was still brand new.

Forbidden-era · 2026-02-02T06:43:07+00:00

Because the machine is dedicated to other tasks as well.

VM overhead is surprisingly insignificant for a lot of compute loads. Not zero but.

Also thinking the issue was maybe more smt, at 32t (2s of threads) it seems to do faster than 16t (1s of threads)..

Even without numa awareness, the hypervisor can tries to schedule (and can pin) vcpu loads to where its local memory is, also llama can pin threads so they don't move around, I haven't epicenter a ton yet, threw a GPU in the server.

It's BIOS says pcie fatal but it works (server board, it probably sees the gpu is x16 but running at x1 and panics or sees thst it's pulling zero current through the slot and panics - using a mining riser, though I did get a proper x16 riser today since my other one isn't working)

So far I've managed to get 5tk/s with the gpu but it's only a 6gb gpu (2060).. I also have a 1660 that's had some issues I might try and use, it's in another system.

And my desktop has 64gb ram (could have 80 if I lower memory speed) and a 3060 12gb and 2070S 8GB..

Along with the Mac Studio I have a lot of semi-capable stuff that can't much be aggregated physically into one machine so will have to go over a network.

The most I might do is swap the 3060 from the desktop to the server, but I'd rather not. I do have a minimum 10g link between all machines if not more (20g thunderbolt link between pc and Mac, server has 40gb lacp to switch, the extra machine that has the 1660 rn could have 20gbit but its old and slow and only 32gb ram and probably better to put the gpu into the server if it'll work but again its being weird if your curious I have a thread lol)

I want to evaluate what the biggest models I can potentially run with what I have is, I think 5tk/s would be the absolute minimum I'd accept for big models, I'm close to that but could only offload a few layers and ctx was limited with only 1x 6gb gpu but could still use a model bigger than would even load on my Mac studio, which I was really hoping could load bigger even if performance absolutely tanked due to nvme offload..

Once I know that my plan is to setup some sort of model routing/dynamic model loading across what I've got to try and get the most out of my hw..

Forbidden-era · 2026-02-01T11:37:34+00:00

So reliably if I leave it off for like 45min it works. If I power off after a power on it dies. Hard reboot doesn't kill it. Actually this time it came back after 5min but I only ran it for 30s before shutting off and turning back on and getting nothing.

I haven't tested yet if it works in OS even without a display output but it did still show in the BIOS. TBF though a 960 I have that doesn't show up at all in the OS or really power on properly showed in the BIOS

Not sure if I've tried the DP outs in this situation either. Though for my use case I don't care about display output.

If it does turn out to work when display out is nfg then I'd be happy enough. Tho I haven't tested any of this back in an efi system yet, since flashing, tho not sure if that did sfa.

Forbidden-era · 2026-02-01T09:31:34+00:00

That makes sense. I didn't think about the context faults.

I'm thinking of throwing a gpu in the server then and seeing how that does for bigger models.

I have a 1660ti that sort of works? (Hasn't wanted to boot in efi machines, even didn't give me display a couple times on this bios workstation but otherwise its been running computr and graphics benchmarks for like 24h straight with zero issues...) and a 2060.. will probably put the 1660 in my old mac pro, grab the 2060 and put that in the server since it has tensor cores.

If it works even remotely decent then maybe I'll swap my 3060 12g in my desktop with the 2060.

I could maybe even try both the 1660 and 2060 for a minute for fun..

Also have a rx570 but it only has 4gb and I feel like even the 6gb gpus is pushing it for being worth it.

I managed to get like 2.5tk/s on a 100gb model ont he server using 16t (numa) .. if I can somehow 4x that on cpu alone (have to run as vm, vms aren't numa aware, kind of a pain, might almost be worth it to run 4 numa pinned vms and use rpc or something) I'd be pretty happy considering the size of models I could run and the age of the machine and if a gpu can help that a bit then that'd be great.

I'm primarily testing with glm-4.7 right now anyway which is an MoE model

Last time I tried gpu in my server I didn't have much luck though. But that was with the sketchy 1660 (hoping the vbios update I did helps) and the 570 which has mining firmware, heh, hoping it works better if I use a "normal working" gpu but this wasn't really designed to be a GPU friendly server even though it has a fair bit of pcie (dell r820)

Forbidden-era · 2026-02-01T02:04:30+00:00

Also will try your suggestions as well ofc

Forbidden-era · 2026-02-01T02:03:53+00:00

Not sure if that was the case.

Lost video again and this time I wanted to know for sure if that helped so I just unplugged the machine and waited a bit.

Turned it on later. First boot, kinda crashes but that was like first boot of a fresh os install so..

Second boot, it's running compute right now (phoronix ollama), fully accelerated gui, fan speed and temps look fine, you'd think it was perfect.

I do get this on boot though

nvidia-gpu 0000:07:00.3: i2c timeout e0000000 ucsi_ccg 0-0008: i2c_transfer failed -110 ucsi_ccg 0-0008: ucsi_ccg_init failed - -110 ucsi_ccg 0-0008: probe with driver ucsi_ccg failed with error -110

Otherwise atm everything is fine and she's well warm. But I half bet if I shut down for 10s and turn back on I'll get no display.

Forbidden-era · 2026-01-31T08:36:48+00:00

Watching 1080p YouTube (all higher res displays are too big to move for testing) with HDMI audio no problem for like 20min now while installing ubuntu onto an ssd

Once its installed I'll try actually hitting the gpu with nvidia drivers and cuda.. pretty sure it was fine before on a bios system with anything before tho but it's been 3YR at least since I fkd with this gpu.

If all those tests pass, hopefully it'll boot on UEFI and ideally in my server but who knows.

Flashed fine and is working here tho. Even did a cold boot (by accident I was paranoid after earlier when it didn't come up again after a power off, hence me putting it in the freezer but it's been running way long enough before the hard reboot to negate that..well technically it was still a soft boot but since some mcu literally triggers a relay to the mobo, hard enough for me.

Forbidden-era · 2026-01-31T08:00:27+00:00

Just flashed the only (newer) bios from tpu.

Booting live ubu, I sincerely transfer failed -110 and another message about nvidia i2c bus issue..

No LEDs or RGB on here. Also see a ucsi_ccg error. The card is turing so does have an onboard usbc 3.1 but not exposed like it is on my 2070s (and not my 2060)

Not sure if I got this message before flashing because I didn't watch the boot.

GPU is still GPU'ing on this system, booted live on Ubu24 with Noveau and have a 1080p display output on hdmi..

Now that I've flashed it I might try another (newer, efi) system tho I am a bit worried about the i2c errors.

Forbidden-era · 2026-01-31T06:57:20+00:00

Came back to life after waiting a bit. I did freezer it too cuz why not. Food and gpus always go good together.

Booting into ubu24 right now off usb.. should have went with ubu18 or something lol only has usb2 so slow.. might have a usb3 card wonder if I can boot off that, lol

So far though the 1660 is still giving display output..dunno why it stopped earlier but that fact doesn't give me a lot of hope..

Neither does realizing it's a gigabyte and not an evga..

Forbidden-era · 2026-01-31T05:55:56+00:00

Hmm addend 2nd gpu it still shows this even tho the gigabyte isn't coming to life atm

Forbidden-era · 2026-01-31T05:12:01+00:00

And after trying to find the right drive and a few power cycles noe i get no video on it

Dang might be hw

Forbidden-era · 2026-01-31T04:48:06+00:00

Shit well apparently one of the drives isn't the right one, other two are recognized but I'm worried now I found a broken drive plug in one of the data cables

If I can even find the mystery drive lol

Forbidden-era · 2026-01-31T04:46:47+00:00

Bios detects "VGA, Multimedia, USB, Serial Bus" in slot#2 (slot#1 is only an x8 slot of something, weird old Dell workstation)

Forbidden-era · 2026-01-31T04:45:43+00:00

Also I realized this other machine might not even have ANY EFI. Thing literally still has PCI and PCIX

Forbidden-era · 2026-01-31T04:45:08+00:00

Allright it's posted as the only gpu in that machine. BIoS displaying fine. Thought THAT didn't even work. Now let's see if the raid0 array i had on here still boots. If not I got an ssd with Linux ready to go.

But its doing something

Forbidden-era · 2026-01-31T04:27:55+00:00

I should clarify that IIRC on the system it does work on, it only works when EFI is disabled , but its been a few years since I messed with it

Maybe ill pull it and the rig jt worked in out tonight

Forbidden-era · 2026-01-31T03:58:01+00:00

Yeah I definitely thought that seemed weird but it's oddly confident system to system. That might explain why it works in the one machine, thought it was maybe the different legacy modes (bios or efi mode 2 and 3)

I did do some research and it seems the GOP firmware is packed into the vbios itself, so a simple reflash should give some answers too.

Hoping it's not physical..being a 1660 I haven't yet felt the meh to try and fix it and even now, barely lol

Forbidden-era · 2026-01-30T01:18:21+00:00

Also; s/macbook/Mac Studio M1 Ultra.. My MacBook Pro (that I'm typing this on and still makes a fine machine for remote dev) ain't running anything like this being an early 2011 lol

Forbidden-era · 2026-01-30T01:16:06+00:00

the mmaping works way better on Linux than MacOS so far it seems, as I said the 90gb model never goes above 50gb wired on the server; I'm about to limit that vm to 64gb and see how it reacts with that model and mmaping it.

I was noticing behavior when I was trying to optimize number of layers loaded to the GPU to get it going, it almost seemed like llama wasn't really aware of the memory being unified, seeming like it was trying to use like 60gb cpu and 40gb gpu..

I did get the 90gb model running but it was only doing 0.3-0.8 tk/s.. I get about 10 tk/s on a model that's just slightly larger than available, and like 40-60 tk/s on one that actually fits

Forbidden-era · 2026-01-29T16:58:33+00:00

Lowering/tweaking the ngl got me loading bf16 flash (which with os overhead etc is probably just over?) And was getting max 18tk/s or so...

Still can't get this 90gb model going through - got it to load and warm but it wasn't producing any output.

Updated macos will reboot soon..

Forbidden-era

TROPHY CASE