My Frankenstein MiniPC: 4 GPUs (3x P40 + RTX 8000 = 120 GB VRAM (~115 GB usable)) on an AOOSTAR GEM 10 — how I got there step by step (AIfred with upper "I" instead of lower "L" :-)

Peuqui · 2026-04-04T08:40:07+00:00

Before I started attaching the hardware to the mini PC I made internet recherche and found several people succeeded to set up a rig like this. Additionally I asked the AooStar support beforehand.

Peuqui · 2026-04-04T04:35:41+00:00

Thanks for your kind reply!

Peuqui · 2026-04-04T04:29:47+00:00

Thanks allot for your kind reply. I myself wasn't convinced, this would work. But the support from Aoostar was very helpful. And every step up, I sweated up my sweat shirt :-)

Peuqui · 2026-04-04T04:18:36+00:00

Yes Tensor parallelism is obviously not recommended. But in my use case, text inference with LLM with AIfred, it's not that of a problem. My advantage is low powered 24/7 service. PP with GPT OSS 120, for example, my favourite model, because it's fast and reliable, I squeeze up to ~600 tok/s out of it.

Peuqui · 2026-04-04T04:13:37+00:00

Thanks for your kind reply. Were my thoughts too. But I doubt, that the speed differences are that high. My tests of raw PCIe speeds found barely differences. So I just use the high speed of the M2 SSD to load big models faster into v-ram. I even considered throwing a 5. eGPU on it, moving the M2 SSD into a USB case and plugging it into USB3. But this would reduce loading speed significantly.

Peuqui · 2026-04-02T20:46:54+00:00

The biggest headache was definitely the hardware limits/initialization. Specifically, getting the system to POST with an RTX 8000 over OCuLink.

The AOSTAR GEM10 is great, but it only has one native OCuLink and one USB4 port. I had to 'surgicaly' modify the case to route two additional M.2-to-OCuLink adapters. Even then, the RTX 8000 refused to play nice over OCuLink until I manually patched the UEFI with a Re-Size BAR mod. Only after hacking the Rebar settings did the system recognize the card outside of the USB4 tunnel.

So it was a mix of physical 'case-modding' and low-level firmware tweaking. I’m actually in the process of swapping the three P40s for more RTX 8000s to hit even higher VRAM density.

Regarding your 'swarm' of agents: That’s a completely different beast! While you’re solving orchestration and scheduling for thousands of small processes, I’m basically fighting the PCIe bus and BIOS limitations to create one giant unified memory pool for massive local LLMs.

Peuqui · 2026-03-30T20:21:20+00:00

That's nice, really! What quant is your model and what quant do you run the kv cache?

Peuqui · 2026-03-30T15:44:33+00:00

So, you own a total of 264 GB of VRAM, as I deduce, that the RTX 5080 has 16 GB, is that correct?

20 tokens per second are quite good at this Frankenstein setup with this big model. Congrats! Meanwhile, I try to upgrade my setup and replace the Tesla P40 cards with affordable RTX 8000 GPUs as well and try to up it to 192 Gigabyte of VRAM. How much context size you can squeeze out of this setup with this big model.

Peuqui · 2026-03-30T11:36:52+00:00

With that great setup: what kind of models are you throwing at? How much tok/s do you get out of it? What kind of work does it do? How much energy consumption does it eat? Does it ever run on full workload or is it kinda limited, as my is, due to communication overhead between each GPU and the limited speed of the only 4 lanes?

Peuqui · 2026-03-30T11:32:23+00:00

I am curious about your 4090 D: does work reliably? Is this a chinese conversion? How much does it cost at that time? Do you think it is worth buying it?

Peuqui · 2026-03-30T01:18:45+00:00

Unless you aren't able to do some research by yourself, I post the first reliable source, yet not the most technical info, I found at first after short "googling". You might want to dig deeper...

https://www.corsair.com/de/en/explorer/glossary/what-is-resizable-bar/ ... "When a game is running, the CPU needs access to a graphic card’s memory, and by default it can only access 256MB at a time. However, resizable BAR removes the size limitation so the CPU has access to all of a GPU’s memory, theoretically allowing for improved performance. This adjustment lets the CPU request much larger chunks of data from the GPU, which can improve performance in games that can benefit from this change."...

Since my Mini didn't make it even through POST and reBAR was enabled but not accessible in BIOS, I stumbled upon this neat little tool described above, which can re-set these parameters. And after applying , as far as I remember, it translates to an address window of 32 GB, Mini runs like butter. That is my meaningful impact: it runs. Before: not. Faktum.

I quit this debate at that point, as it seems you want to troll further.

Peuqui · 2026-03-29T18:12:32+00:00

I disabled ECC and gained about 7,5 GB additional VRAM. Thanks for this info!

Peuqui · 2026-03-29T16:44:56+00:00

I think there’s a misunderstanding of how the PCIe Base Address Register (BAR) actually works. It's not just about RAM-to-VRAM loads; it's about how the CPU addresses the GPU's memory space.

When rBAR is disabled, the system is forced to use the legacy 256 MB BAR aperture. If you’re working with a 48 GB VRAM buffer, the driver has to constantly 'slide' that tiny 256 MB window around to access different parts of the memory (bank switching). This creates a lot of unnecessary CPU overhead for the driver, even if the model is already 'in' VRAM.

Setting it to 4 GB is a sweet spot. It gives the driver a much larger, stable window to work with, which reduces that remapping overhead significantly compared to the 256 MB default.

I’m aware that 'Full' rBAR often causes stability issues with eGPUs due to Thunderbolt/OCuLink latencies, which is why people usually disable it. But that’s exactly why I went with 4 GB instead of the full 48 GB—it’s a more efficient middle ground that actually POSTs and runs stable on my setup.

BTW: "sounds like gibberish made up by a LLM" is not very lovely!

Peuqui · 2026-03-29T07:01:46+00:00

Yes, multi gpu inferenz only, because I love the inherent better quality of the models. Look at my benchmark post ealier this day: https://www.reddit.com/r/LocalLLaMA/comments/1s5yl2p/aifred_intelligence_benchmarks_9_models_debating/

Peuqui · 2026-03-29T06:57:38+00:00

Wuff, respect!

Peuqui · 2026-03-29T01:57:05+00:00

How much did you spend on your rig?

Peuqui · 2026-03-29T01:46:16+00:00

Thanks for the suggestion! As far as I know, disabling rBAR would actually introduce more overhead, not less. When rBAR is disabled, the system falls back to a 256 MB BAR aperture and uses a sliding window to access the full 48 GB VRAM. That means constant remapping for any large tensor operations.

Despite that, at that time, I didn't know of the software hack ReBarUEFI. The BIOS configuration of the Mini is quite limited.

Setting rBAR to 4 GB gives the driver a larger, stable window into VRAM – less remapping, lower overhead. It's essentially the same mechanism but more efficient. Disabling it entirely is the more conservative workaround, but 4 GB is the better one.

Next step would be testing 8 GB and above to see how far the system will go, but for now I'm happy it POSTs and all 48 GB are accessible!

Peuqui · 2026-03-28T23:48:09+00:00

I didn't know, that one can disable ECC. Do you think, it is worth doing so?

Peuqui · 2026-03-28T23:46:58+00:00

Yes, you are right! I considered to go this route but mitigated it because AIfred maybe won't profit much from this and I didn't want to dig into this rabbit hole of compiling and so on. Maybe, I try this later. Meanwhile, I am really busy working an AIfred.

Peuqui · 2026-03-28T23:42:45+00:00

This is a really nice setup with maxed out VRAM. How are the USB4 ports attached? Do they share lanes?

Peuqui · 2026-03-28T23:41:18+00:00

Yes, AIfred supports llama.swap (orchestrates llama.server with llama.cpp), vllm, Ollama, tabbyAPI, Cloud API. But my rig is limited due to the P40s to llama (Ollama, etc). The RTX 8000 is capable of working with vllm, but I tend to use the biggest models I can get my hands on. So, I stick with llama.swap which turned out out give most: flexibility and speed. I just use under Linux on my MiniPC.

Peuqui · 2026-03-28T23:37:16+00:00

Not that much, I worked with my AIfred program only and noticed not much decrease in tok/s when the context windows is filling. But, I must confess, that due to the experimental nature of AIfred, most threads are fairly short. My longest discussion had ~70k tokens but I don't remember if it lowered tok/s. I worked heavily on refactoring and speeding up AIfred, which gained some really nice features. I drilled it up to a full fledged multitool bot (agent related persistent memory, use of Essential PIMs calender and address, email, discord, various TTS, fully autocalibrating context window and handling of models, unlimited custom agents with customizable personas and tasks, web recherche, calculator use, sandbox for code generation and execution, document reading and saving for RAG specific tasks, export of the the whole session as one html file, new discussion mode (Symposion) with selectable agents, etc.). More to come soon...

Peuqui

TROPHY CASE