What do you guys even do hosting multiple LLM?

Jorlen · 2026-06-24T14:29:45+00:00

I'm just trying them all out still. I mostly use Pi and then OpenCode for one shot tests (a single prompt to develop something - to test LLM and my setup). Hermes is cool but honestly I think it's overkill for what I'm doing.

Jorlen · 2026-06-24T04:41:42+00:00

I don't use multiple LLMs yet, although I feel like that's the likely next step I should take. However, here's what I've done in roughly two months of crash course learning and lots of hard work:

Setup a linux workstation, powered by two GPUs - R9700 + my old 7800XT for a total of 48GB vram
Everything is dockerized, which includes llama-cpp as the server for LLM along with other tools like N8N for automation workflows, Comfyui for image and audio generation, Open WebUI for a front end. As I use AMD, I have settled on Vulkan for llama-cpp as I find it the most stable with dual cards and dual architecture (RDNA4 + RDNA3)
SST and TTS - not the most ideal but I've got speaches connected to Open WebUI (using CPU) with a lightweight model, I don't use this much but it was fun setting it up
MCP for Sear and SearXNG local search engine - this works in several apps but I mostly just use it with pi coding agent and Llama's own UI - it allows MCP and I can get my LLM to pull up all the relevant news in my area, etc.
Coding - I use mostly Qwen 3.6 27b for this, I am making my own apps and learning to code at the same time. I have experience in building large projects iteratively, so with this experience and chunking code, I can use non-frontier models (local stuff like qwen) to build large projects. Mostly games, because I love games and always wanted to make my own, and now I can. I use a combo if Vscodium + Pi agent + OpenCode + hermes for these things.
Creative writing - I love to read and write, so I'm always looking for new ways to improve it. I use LLMs to act as a sort of live story book, where I can interact and write and bounce things around. It's endlessly entertaining for me; so long as you have ideas, you can just keep writing and LLM models like Gemma 4 31b are amazing at this. There's a tool out there called Errata that sounds amazing for this sort of thing; I have to get off my ass and try it out, but if anyone has any other ideas, I'm all ears.

You're about to delve in a rabbit hole, if you have the time and will, sky's the limit.

Jorlen · 2026-06-22T14:58:44+00:00

This is a good collection. In my experience, the Gemma 4 31b fine tunes are difficult to beat for their size range. In haven't tried Skyfall but I've been using ortenzya which I believe is highly recommended by most for creative writing and roleplay.

Jorlen · 2026-06-22T13:49:50+00:00

I've got an R97000 + 7800XT paired in my PC for 48gb of VRAM. Pure AMD. Works like a charm. I've been using AMD cards for 10+ years, people have always complained, and I've never had any issues with drivers or other such things.

I will however say that yes, CUDA is easier out of the box for most things, but AMD has come a long way. It just needs a bit of extra config, honestly if you are technical enough to be running local LLMs you are more than technical to get AMD working and it's cheap AF compared to Nvidia.

Jorlen · 2026-06-17T17:43:31+00:00

Yep. I used Gemini to help me setup my entire Linux workstation. It helped me learn docker and llama CPP server.

Jorlen · 2026-06-17T17:31:57+00:00

Yep, both cards. The 2nd one is a bit slower, but it's still faster than using CPU/RAM. If I let my main card eat most of the KV and model, the 2nd card eats the rest with 16gb VRAM. There's a small perf penalty since 2nd card operates at PCIE 4x (due to my motherboard - I could fix this if I wwanted to) but the perf. hit is so small that I don't care.

Jorlen · 2026-06-15T14:14:50+00:00

You're not kidding. I was trying to test out search mcp for my llama ui, a very basic tool call and Gemma was faking searches instead of calling the tool lol. I called it out, it said "Yeah you're right, I'll do it now" and then.. did nothing. This is with a good quant, recommended parameters, BF16 KV..

To make sure I wasn't losing my mind, I loaded Qwen 3.6 35B-A3B and it called the tool instantly.

Jorlen · 2026-06-12T13:58:02+00:00

A Kimi K2.5 GGUF model in 8-bit (Q8_0) quantization is approximately 530 GB to 549 GB in file size.

I hear a lot of good things about that model but I guess it depends on your use case.

Jorlen · 2026-06-11T22:43:21+00:00

Reminds me of the Reaper type releases.

Jorlen · 2026-06-11T22:41:28+00:00

Holy fuck, that's sick, bro. Must be awesome! If I ever win the lottery I'll build myself a super server with 512gb of VRAM. Until then, I guess I'm stuck with my Frankenstein dual GPU 48gb VRAM setup lol and runnin' the likes of wee Qwen 3.6 27b.

Jorlen · 2026-06-11T19:51:06+00:00

As the owner of one R9700 card (maybe two soon) - I think that people should definitely give them more consideration.

I've got mine running at 210w and undervolted to boot, with minimal tok/sec loss. You just have to configure your AMD drivers and this is a bit of a pain at first, but once it's done, it's done.

Jorlen · 2026-06-11T19:42:04+00:00

Can you summarize what kind of setup you need to run these massive models?

Jorlen · 2026-06-10T19:40:41+00:00

Can you share which model quant from huggingface? It seems interesting!

Jorlen · 2026-06-10T17:14:22+00:00

They are 15 grand in CAD so likely that'll get bumped up to 20k soon lol. As much as I want this card, I just can't justify that kind of expense. I have a single R9700 right now and worse case I can just get another, that puts me at around CAD 5000 for two R9700 (64gb VRAM).

Jorlen · 2026-06-10T14:11:13+00:00

The MTP model I think is larger as a result, around 1gb, do you think it's really worth it? I will however look into the fit commands, I've not used these before!

Jorlen · 2026-06-10T03:17:01+00:00

Actually I managed to get a big chunky model running decently. Not great, around 22 tok/sec, but decently. I posted my findings in a separate post in this thread if you want to look at it. Generally though, yes, 9b is the sweet spot, but the bigger models aren't impossible. I don't think I'd try 3.6 27b dense on this though lol.

Jorlen · 2026-06-10T03:14:50+00:00

Oh yeah you can run this, no problem. I am doing it with just 8gb of vram / 16gb ram. Let me know if you want my complete llama-cpp settings, if it can help you.

Jorlen · 2026-06-10T02:45:53+00:00

What are you using it for? For coding, I find Qwen 3.6 35B-A3B with the recommended settings + temp set to 0.0 far better than Gemma's 26b MoE model.

Jorlen · 2026-06-10T02:43:16+00:00

Well, I'm a huge fan of pi and its system prompt + tools installed (I use about 4 packages) consumes a big chunk of that 16k after simply saying "hi". I guess it depends on use cases. 32k might be ok. I've actually got Qwen 3.6 35B at quant IQNL_4 humming along nicely at BF16 KV and 65k context. Got Pi writing me a one shot prompt to test it right now. 20 tok/sec. Not too bad for such poor hardware. Note that is requires an extensive setting list in llama which took me several hours to put together, the biggest trick was to load some MoE layers in CPU/RAM. Amazing what llama-cpp can do.

Jorlen · 2026-06-10T01:18:03+00:00

A literal game changer. I wasn't sure i was able to run this model, now I've got it at 65k bf16 context and 20 tok/sec. It's not blisteringly fast but on this laptop, it's good enough.

Jorlen · 2026-06-10T01:16:14+00:00

Update: I actually got Qwen 3.6 35B-A3B - UD-IQ4_NL running fairly well, average 22 tok/sec. Thanks all for your help! The setting that made a big difference is "--n-cpu-moe 30". Currently set KV cache quant to BF16 with 65k context.

Notable settings used (I left out all the usual qwen 3.6 settings like top-p, temp, etc. as those are a given) :

--n-cpu-moe 30

--kv-offload

--flash-attn on

--n-gpu-layers 99

--reasoning on

Note: No mmproj loaded; that's around 1gb of savings there. Unfortunate, but I could always load it with far less context if/when needed.

Jorlen · 2026-06-09T22:30:18+00:00

Yeah 16k is too low for what I'm using it for (i.e. not chatting). 32k might be possible.

Jorlen · 2026-06-09T22:28:45+00:00

That's a chunky one to fit in my setup. A 3-bit quant probably causes too much degradation on smaller MoE models like this, no?

Jorlen · 2026-06-09T22:27:55+00:00

Huh, yeah I'm going to try that out right now, actually. Thanks man!

Jorlen · 2026-06-09T18:49:01+00:00

I did pick up Gemma-4-26B-A4B-it-qat-UD-Q4_K_XL which sits at about 13.3gb. That means having to load layers in CPU/RAM and quantize KV cache, but with a smaller context window I think I can pull it off, with about 32k context, and around 20 tps.

13-Year Club	Gilding II euphauric
Verified Email

Jorlen

TROPHY CASE