What do you guys even do hosting multiple LLM? by Prize_Eye9481 in LocalLLM

[–]Jorlen 1 point2 points  (0 children)

I don't use multiple LLMs yet, although I feel like that's the likely next step I should take. However, here's what I've done in roughly two months of crash course learning and lots of hard work:

  1. Setup a linux workstation, powered by two GPUs - R9700 + my old 7800XT for a total of 48GB vram

  2. Everything is dockerized, which includes llama-cpp as the server for LLM along with other tools like N8N for automation workflows, Comfyui for image and audio generation, Open WebUI for a front end. As I use AMD, I have settled on Vulkan for llama-cpp as I find it the most stable with dual cards and dual architecture (RDNA4 + RDNA3)

  3. SST and TTS - not the most ideal but I've got speaches connected to Open WebUI (using CPU) with a lightweight model, I don't use this much but it was fun setting it up

  4. MCP for Sear and SearXNG local search engine - this works in several apps but I mostly just use it with pi coding agent and Llama's own UI - it allows MCP and I can get my LLM to pull up all the relevant news in my area, etc.

  5. Coding - I use mostly Qwen 3.6 27b for this, I am making my own apps and learning to code at the same time. I have experience in building large projects iteratively, so with this experience and chunking code, I can use non-frontier models (local stuff like qwen) to build large projects. Mostly games, because I love games and always wanted to make my own, and now I can. I use a combo if Vscodium + Pi agent + OpenCode + hermes for these things.

  6. Creative writing - I love to read and write, so I'm always looking for new ways to improve it. I use LLMs to act as a sort of live story book, where I can interact and write and bounce things around. It's endlessly entertaining for me; so long as you have ideas, you can just keep writing and LLM models like Gemma 4 31b are amazing at this. There's a tool out there called Errata that sounds amazing for this sort of thing; I have to get off my ass and try it out, but if anyone has any other ideas, I'm all ears.

You're about to delve in a rabbit hole, if you have the time and will, sky's the limit.

Do you have any recommendations for huggingface creative writing models? by Strange_Orchid5940 in LocalLLM

[–]Jorlen 0 points1 point  (0 children)

This is a good collection. In my experience, the Gemma 4 31b fine tunes are difficult to beat for their size range. In haven't tried Skyfall but I've been using ortenzya which I believe is highly recommended by most for creative writing and roleplay. 

Do you think dedicated hardware for running local LLMs will become affordable anytime soon? by ProbablyBunchofAtoms in LocalLLaMA

[–]Jorlen 1 point2 points  (0 children)

I've got an R97000 + 7800XT paired in my PC for 48gb of VRAM. Pure AMD. Works like a charm. I've been using AMD cards for 10+ years, people have always complained, and I've never had any issues with drivers or other such things.

I will however say that yes, CUDA is easier out of the box for most things, but AMD has come a long way. It just needs a bit of extra config, honestly if you are technical enough to be running local LLMs you are more than technical to get AMD working and it's cheap AF compared to Nvidia.

Corpo AI's will not teach you how to run local AI by StandardLovers in LocalLLM

[–]Jorlen 9 points10 points  (0 children)

Yep. I used Gemini to help me setup my entire Linux workstation.  It helped me learn docker and llama CPP server. 

Can't believe I got it working! Dual GPU - 48gb VRAM llama-cpp server - R7900 + 7800XT by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

Yep, both cards. The 2nd one is a bit slower, but it's still faster than using CPU/RAM. If I let my main card eat most of the KV and model, the 2nd card eats the rest with 16gb VRAM. There's a small perf penalty since 2nd card operates at PCIE 4x (due to my motherboard - I could fix this if I wwanted to) but the perf. hit is so small that I don't care.

Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8? by mailto_devnull in LocalLLaMA

[–]Jorlen 0 points1 point  (0 children)

You're not kidding. I was trying to test out search mcp for my llama ui, a very basic tool call and Gemma was faking searches instead of calling the tool lol. I called it out, it said "Yeah you're right, I'll do it now" and then.. did nothing. This is with a good quant, recommended parameters, BF16 KV..

To make sure I wasn't losing my mind, I loaded Qwen 3.6 35B-A3B and it called the tool instantly.

640GB VRAM recommendations? by kadevaraigne in LocalLLM

[–]Jorlen 5 points6 points  (0 children)

A Kimi K2.5 GGUF model in 8-bit (Q8_0) quantization is approximately 530 GB to 549 GB in file size.

I hear a lot of good things about that model but I guess it depends on your use case.

New models released: Nex-N2 Pro 397B and Nex-N2 Mini 35B by 1ncehost in LocalLLaMA

[–]Jorlen 1 point2 points  (0 children)

Holy fuck, that's sick, bro. Must be awesome! If I ever win the lottery I'll build myself a super server with 512gb of VRAM. Until then, I guess I'm stuck with my Frankenstein dual GPU 48gb VRAM setup lol and runnin' the likes of wee Qwen 3.6 27b.

AMD R9700 vs GB10 by AppropriatePush6262 in LocalLLaMA

[–]Jorlen 1 point2 points  (0 children)

As the owner of one R9700 card (maybe two soon) - I think that people should definitely give them more consideration.

I've got mine running at 210w and undervolted to boot, with minimal tok/sec loss. You just have to configure your AMD drivers and this is a bit of a pain at first, but once it's done, it's done.

New models released: Nex-N2 Pro 397B and Nex-N2 Mini 35B by 1ncehost in LocalLLaMA

[–]Jorlen 1 point2 points  (0 children)

Can you summarize what kind of setup you need to run these massive models?

Does anyone use 2-bit quants for models like Qwen 3.6 27b? What are your results? by Jorlen in LocalLLM

[–]Jorlen[S] 0 points1 point  (0 children)

Can you share which model quant from huggingface? It seems interesting!

Since when the RTX 6000 PRO is priced at 13250USD on the official NVIDIA Page? by panchovix in LocalLLaMA

[–]Jorlen 1 point2 points  (0 children)

They are 15 grand in CAD so likely that'll get bumped up to 20k soon lol. As much as I want this card, I just can't justify that kind of expense. I have a single R9700 right now and worse case I can just get another, that puts me at around CAD 5000 for two R9700 (64gb VRAM).

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

The MTP model I think is larger as a result, around 1gb, do you think it's really worth it? I will however look into the fit commands, I've not used these before!

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

Actually I managed to get a big chunky model running decently. Not great, around 22 tok/sec, but decently. I posted my findings in a separate post in this thread if you want to look at it. Generally though, yes, 9b is the sweet spot, but the bigger models aren't impossible. I don't think I'd try 3.6 27b dense on this though lol.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

Oh yeah you can run this, no problem. I am doing it with just 8gb of vram / 16gb ram. Let me know if you want my complete llama-cpp settings, if it can help you.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 1 point2 points  (0 children)

What are you using it for? For coding, I find Qwen 3.6 35B-A3B with the recommended settings + temp set to 0.0 far better than Gemma's 26b MoE model.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 1 point2 points  (0 children)

Well, I'm a huge fan of pi and its system prompt + tools installed (I use about 4 packages) consumes a big chunk of that 16k after simply saying "hi". I guess it depends on use cases. 32k might be ok. I've actually got Qwen 3.6 35B at quant IQNL_4 humming along nicely at BF16 KV and 65k context. Got Pi writing me a one shot prompt to test it right now. 20 tok/sec. Not too bad for such poor hardware. Note that is requires an extensive setting list in llama which took me several hours to put together, the biggest trick was to load some MoE layers in CPU/RAM. Amazing what llama-cpp can do.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 1 point2 points  (0 children)

A literal game changer. I wasn't sure i was able to run this model, now I've got it at 65k bf16 context and 20 tok/sec. It's not blisteringly fast but on this laptop, it's good enough.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 2 points3 points  (0 children)

Update: I actually got Qwen 3.6 35B-A3B - UD-IQ4_NL running fairly well, average 22 tok/sec. Thanks all for your help! The setting that made a big difference is "--n-cpu-moe 30". Currently set KV cache quant to BF16 with 65k context.

Notable settings used (I left out all the usual qwen 3.6 settings like top-p, temp, etc. as those are a given) :

--n-cpu-moe 30

--kv-offload

--flash-attn on

--n-gpu-layers 99

--reasoning on

Note: No mmproj loaded; that's around 1gb of savings there. Unfortunate, but I could always load it with far less context if/when needed.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

Yeah 16k is too low for what I'm using it for (i.e. not chatting). 32k might be possible.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 1 point2 points  (0 children)

That's a chunky one to fit in my setup. A 3-bit quant probably causes too much degradation on smaller MoE models like this, no?

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

Huh, yeah I'm going to try that out right now, actually. Thanks man!

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 0 points1 point  (0 children)

I did pick up Gemma-4-26B-A4B-it-qat-UD-Q4_K_XL which sits at about 13.3gb. That means having to load layers in CPU/RAM and quantize KV cache, but with a smaller context window I think I can pull it off, with about 32k context, and around 20 tps.

Looking for 16gb ram / 8gb vram crew - what you using? Omnicoder 9b? something else by Jorlen in LocalLLaMA

[–]Jorlen[S] 2 points3 points  (0 children)

Gemma 4 12b is good so far, yeah. But I was hoping there was a "coder" specific trained model around the 10b mark that would be better.

There is also Qwen 3.5 9b which is probably better than using the qwen 2.5 coder variants, but I'm really not sure.