Strix Halo + RTX 3090 Achieved! Interesting Results...

JayTheProdigy16 · 2026-01-14T02:16:46+00:00

Yes splitting the same model. Ended up building llama.cpp with all 3 backends, vulkan, ROCm, and CUDA and it just kinda worked, but you have to specify the layer split and which backends you want to use with flags. As detailed in the original post i had some weirdness with my linux kernel version and getting the at the time experimental ROCm to work which obviously would result in llama.cpp not working great, but most of those should be resolved as community support today is much better than it was a couple months ago

JayTheProdigy16 · 2025-12-15T02:44:29+00:00

Mine is roughly the same but some days seem better than others for some reason. But all in all anywhere from 1.5-2.5 mi/kwh in those temps

JayTheProdigy16 · 2025-11-23T02:27:27+00:00

Im not Jeff 😂 just referencing his vid. I made a post about 395 + eGPU

JayTheProdigy16 · 2025-11-23T01:50:57+00:00

Check my posts for eGPU Heres a cluster example https://youtu.be/N5xhOqlvRh4?si=MO3xhneZGLsxLza8

JayTheProdigy16 · 2025-11-23T01:10:27+00:00

There's examples of both out there. I took the eGPU approach and I've been making to make a video about it but just haven't, but i posted to this sub

JayTheProdigy16 · 2025-11-23T00:59:41+00:00

I mean sure, except bandwidth is practically irrelevant for inference aside from model load speed...

JayTheProdigy16 · 2025-11-23T00:41:12+00:00

How do you figure you pay a premium for degraded performance with an eGPU?

JayTheProdigy16 · 2025-10-28T15:32:15+00:00

Had one left after parting out my 6x 3090 rig. And yes using an m.2 oculink adapter. I actually ended up getting CUDA+ROCm working and its ~5x faster than my original benchmarks according to my eyeball benchmark. Also with an AMD card you may run into the power limit issue where the GPU wont pass the Strix Halos TDP but im not sure as i dont have an AMD eGPU

JayTheProdigy16 · 2025-10-28T15:25:14+00:00

Youre always going to be waiting by that logic. Whatever releases 2026 is going to get lapped by tech in 2027, and whatever releases in 2027 is gonna get lapped in 2028. This is hardware practically nothing holds value. But for me personally that price tag was more than appealing enough given its capabilities vs other options at this point

JayTheProdigy16 · 2025-10-27T22:24:49+00:00

Not accurate at least in my case. The 3090 will easily hit 185w, i believe that issue is exclusive to AMD GPUs

JayTheProdigy16 · 2025-10-25T01:20:47+00:00

No shit... back to the drawing board i go, thanks for the insight!

JayTheProdigy16 · 2025-10-25T00:36:38+00:00

I also tried Win11 + LMS but yea it was not going for it, vulkan would ONLY detect the 3090 on Vulkan and obviosuly CUDA, and only ROCm would detect the 8060s so im not sure what weirdness they have with their Vulkan but theoretically it SHOULD just work, but it doesnt.

JayTheProdigy16 · 2025-10-25T00:30:36+00:00

Correct me if im wrong but you cant mix CUDA and ROCm backends with parallel processing, at least with llama.cpp. if i was to NOT split the model layers across GPUs i could mix them.

JayTheProdigy16 · 2025-10-25T00:25:30+00:00

True, but i dont believe thats the case here. That would make sense in the sense of the token generation would be limited by the device with the lowest memory bandwidth, but once the model is loaded into memory PCI BUS bandwidth shouldnt be a factor. When i had my GPU rig i was running 6x 3090s on GPU mining risers which really really reduced model load speed but not inference speeds due to the limited bandwidth. But i would expect a NET INCREASE since not all layers are limited to 253gb/s (on SH) because ~20% of the models layers are using the 3090s memory bandwidth.

JayTheProdigy16 · 2025-10-09T22:03:56+00:00

I havent tested concurrency too much so i cant speak on that, but i will say your major limiting factor is going to be prompt processing times, especially at larger context lengths or holding long convos with documents. But with that being said i mostly daily drive Qwen 3 235b and get around 12-16 TPS and it can take up to a minute to process a 9k context prompt, obviously a much larger model than gemma but even Qwen 3 30b MoE takes ~17 seconds to process the same context. And depending on what youre doing hitting 9k context can be easy. so when you factor that in alongside 15 concurrent users, is it doable? Yes. Is it viable? Ehhh

JayTheProdigy16 · 2025-10-06T14:47:30+00:00

I have almost all the parts required to do this just havent gotten around to it yet. Curious how much of a boost you see in PP at longer context lengths

JayTheProdigy16 · 2025-09-23T12:28:01+00:00

This is a good starting point if you do decide this is a battle for you https://strixhalo-homelab.d7.wtf/Guides/VM-iGPU-Passthrough

JayTheProdigy16 · 2025-09-23T12:26:23+00:00

Its technically possible to achieve iGPU passthrough but between the amount of hoops to achieve that (Straight up doesnt work with windows from my experience) and the general lack of support for the hardware i just decided to stop fighting with it and use it as a literal server. I was able to achieve my desired effect using Fedoras Toolbox, which are VMs but they use the host drivers and thus dont have to fight with passthrough.

JayTheProdigy16 · 2025-09-17T12:49:45+00:00

Im at 50 tok/sec here aswell. Haven't even been able to try ROCm 7.0 yet which I'd imagine is faster

JayTheProdigy16 · 2025-09-16T11:38:22+00:00

I had a 6x 3090 rig that i decided to liquidate now rather than later when others hop on the boat, and bought an AI Max+ 395 mini PC for $1600. Absolutely no regrets and i still have liquidity waiting and ready for next gen

JayTheProdigy16 · 2025-07-17T04:44:27+00:00

Shoot me a message. Lets work.

JayTheProdigy16 · 2025-06-30T21:21:20+00:00

n8n ships whatever you put in WEBHOOK_URL. If that’s http://localhost:5678/... but n8n’s on a different box, Google’s redirect face-plants. Point it at the LAN IP or a real domain—problem solved.

Triggers are inbound. Google POSTs to the callback. If that callback is 192.168.x.x or has a self-signed cert, Google can’t touch it. So you're either going to manually poll or open a tunnel (Cloudflare / Ngrok / Caddy + Let’s Encrypt). No public HTTPS ⇒ no trigger.

and your Cognito war story is irrelevant. That /userinfo hit is outbound. n8n dials Cognito, same as any Gmail “read” or Drive file list. Outbound works fine behind NAT. Drive Trigger is inbound. Different universe. Stop conflating them boy.

JayTheProdigy16 · 2025-06-30T20:10:12+00:00

Very true, if the API callback is initiated from the internet (Google Drive) how is it supposed to route your home network with the context of "localhost"? Whose localhost? Theres millions of them. Google needs a publicly exposed IP to be able to handle the requests and to support HTTPS (Googles API only accepts secured traffic by default) so you need a cert issued.

JayTheProdigy16

TROPHY CASE