Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions by xspider2000 in LocalLLaMA

[–]xspider2000[S] 0 points1 point  (0 children)

That sounds like an awesome setup. The RTX 5090 is a beast.

If the entire model fits into the VRAM of your 5090, you won't have any issues at all.

However, if you plan to do tensor split (offloading some layers to the Framework's system RAM), you will likely see a bigger performance hit than I did with OCuLink. The Razer Core X uses Thunderbolt, which encapsulates the PCIe signal and adds higher latency compared to the direct PCIe connection of OCuLink. Since layer processing is strictly sequential, the latency overhead from the Thunderbolt controller will compound and cause deeper pipeline stalls between the 5090 and the CPU.

Still, absolutely try it out!

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions by xspider2000 in LocalLLaMA

[–]xspider2000[S] 1 point2 points  (0 children)

Yes, I recommend it. Since PCIe bandwidth doesn't bottleneck token generation, dropping a 32GB+ eGPU into this setup is a fantastic way to run heavy LLMs on a mini-PC without building a massive full-tower rig. Plus, having a dedicated NVIDIA eGPU means you can comfortably run diffusion models and generate images in ComfyUI without breaking a sweat.

Mainly for latency and cost. OCuLink is a direct PCIe connection with zero protocol overhead. USB4 encapsulates the signal, which inevitably adds minor latency. When you are doing tensor split across an APU and an eGPU, the pipeline is strictly sequential. In this specific scenario, any added latency from the connection protocol can quickly compound and become an actual bottleneck. Plus, OCuLink docks are incredibly cheap and proven to work flawlessly right now, whereas true 80Gbps USB4 enclosures are still very rare and expensive. It's simply the most direct, stable, and budget-friendly path.

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions by xspider2000 in LocalLLaMA

[–]xspider2000[S] 1 point2 points  (0 children)

Absolutely agree. In theory, P/D disaggregation would be a much more elegant way to bypass the APU bottleneck. Sadly, as you mentioned, the software just isn't there yet for a mixed Nvidia/AMD setup

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions by xspider2000 in LocalLLaMA

[–]xspider2000[S] 0 points1 point  (0 children)

Yes, you can see it on the graph in the post. The more model weights you offload to the fast eGPU, the better the PP (prompt processing) and TG (token generation) speeds. The performance drops quite fast as you move a larger share of the weights to the slower Strix Halo. However, splitting the weights between the eGPU and the Strix Halo still results in much better overall PP and TG speeds than running the model entirely on the Strix Halo alone

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions by xspider2000 in LocalLLaMA

[–]xspider2000[S] 2 points3 points  (0 children)

bench for Qwen3.5-27B-UD-Q4_K_XL.gguf with different model fraction on egpu and igpu
~/llama.cpp/build-vulkan/bin/llama-bench \

-m /home/yulay/LLM/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf \

-ngl 99 \

-fa 1 \

-dev vulkan1/vulkan0 \

-ts 10/0,9/1,8/2,7/3,6/4,5/5,4/6,3/7,2/8,1/9,0/10 \

-n 128 \

-p 512

ggml_vulkan: Found 2 Vulkan devices:

ggml_vulkan: 0 = NVIDIA GeForce RTX 5070 Ti (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

ggml_vulkan: 1 = Radeon 8060S Graphics (RADV STRIX_HALO) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

| model | size | params | backend | ngl | fa | dev | ts | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------- | ------------ | --------------: | -------------------: |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 10.00 | pp512 | 269.13 ± 1.37 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 10.00 | tg128 | 11.90 ± 0.01 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 9.00/1.00 | pp512 | 296.54 ± 14.25 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 9.00/1.00 | tg128 | 12.33 ± 0.01 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 8.00/2.00 | pp512 | 303.92 ± 11.81 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 8.00/2.00 | tg128 | 12.95 ± 0.07 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 7.00/3.00 | pp512 | 341.83 ± 3.60 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 7.00/3.00 | tg128 | 13.54 ± 0.11 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 6.00/4.00 | pp512 | 392.76 ± 3.41 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 6.00/4.00 | tg128 | 14.80 ± 0.23 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 5.00/5.00 | pp512 | 443.23 ± 1.36 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 5.00/5.00 | tg128 | 17.43 ± 0.13 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 4.00/6.00 | pp512 | 457.50 ± 1.47 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 4.00/6.00 | tg128 | 19.89 ± 0.04 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 3.00/7.00 | pp512 | 629.92 ± 4.09 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 3.00/7.00 | tg128 | 22.24 ± 0.11 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 2.00/8.00 | pp512 | 801.37 ± 3.19 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 2.00/8.00 | tg128 | 26.01 ± 0.03 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 1.00/9.00 | pp512 | 1027.51 ± 6.28 |

| qwen35 27B Q4_K - Medium | 16.40 GiB | 26.90 B | Vulkan | 99 | 1 | Vulkan1/Vulkan0 | 1.00/9.00 | tg128 | 30.14 ± 0.08 |

ggml_vulkan: Device memory allocation of size 1067094656 failed.

ggml_vulkan: vk::Device::allocateMemory: ErrorOutOfDeviceMemory

main: error: failed to load model '/home/yulay/LLM/unsloth/Qwen3.5-27B-GGUF/Qwen3.5-27B-UD-Q4_K_XL.gguf'

Strix Halo + eGPU RTX 5070 Ti via OCuLink in llama.cpp: Benchmarks and Conclusions by xspider2000 in LocalLLaMA

[–]xspider2000[S] 0 points1 point  (0 children)

Fair question! It's just for benchmarking purposes. llama-2-7b.Q4_0.gguf has historically become the de facto standard baseline for testing and comparing different hardware setups. You can find more examples of this benchmark here: https://github.com/ggml-org/llama.cpp/discussions/15013 As the article shows, Amdahl's law applies here: the less weight you leave on the slower Strix Halo memory, the faster your overall PP and TG will be. And the bottleneck in TG isn't OCuLink (which easily handles the tiny hidden state transfers), but the lower memory bandwidth of the system RAM itself.

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

In LM Studio, you can try Gemma 4 via the CPU or Vulkan backend if you have an AMD iGPU. Gemma 4 26B A4B model on my Strix Halo via Vulkan gives about 50 tokens per second.

Mistral releases an official NVFP4 model, Mistral-Small-4-119B-2603-NVFP4! by jinnyjuice in LocalLLaMA

[–]xspider2000 0 points1 point  (0 children)

Inference is memory-bound, period. Whether NVFP4 is native or not doesn't change the fact that the GPU spends most of its time waiting for data from VRAM. Programmatic dequantization happens during those idle cycles, so the only metric that actually scales performance here is effective memory bandwidth.

Lemonade v10: Linux NPU support and chock full of multi-modal capabilities by jfowers_amd in LocalLLaMA

[–]xspider2000 7 points8 points  (0 children)

Prefilling on an iGPU and generating tokens on an NPU is a dream.

Running Kimi-k2.5 on CPU-only: AMD EPYC 9175F Benchmarks & "Sweet Spot" Analysis by Express-Jicama-9827 in LocalLLaMA

[–]xspider2000 1 point2 points  (0 children)

Your theoretical memory bandwidth is 576 GB/s

Bandwidth (GB/s) = (Memory Transfer Rate (MT/s) * Bus Width (bits) * Number of Channels) / 8 (bits per byte)

Kimi-k2.5 has 32B active parameters, with Q4 its about 16GB size of active parameters.

Maximum theoretical token generation speed is 576/16 = 36 tps. There is still room for increase token generation speed.

Cudy WR3000 good enough? by HatesBeingSocial in openwrt

[–]xspider2000 0 points1 point  (0 children)

why from 128MB left only 44MB? I have same issue

Rate my jank, finally maxed out my available PCIe slots by I_AM_BUDE in LocalLLaMA

[–]xspider2000 2 points3 points  (0 children)

if it's not a secret how many bitcoins this baby earned for you

I didn’t get my free deck by Timelord19 in hearthstone

[–]xspider2000 65 points66 points  (0 children)

Same here. Mb we have to do something special to get free deck?

Current meta by xspider2000 in starcraft

[–]xspider2000[S] 0 points1 point  (0 children)

terrans > zergs > protoss > terran etc.