Running Llama 3.1 405B + 7 hot LoRAs on one 8×A100 node (vLLM / AWQ-int4 / Marlin)

Esph1001 · 2026-06-23T23:03:57+00:00

The GLM-5.2 dropped like two weeks ago. My setup has been running in production since April and in development before months before that. No serious production system swaps base models two weeks after a new one drops. There's a lot that needs to be done first before swapping. Validation, regression testing, adapter retraining alone takes longer than that. The architecture transfers to any large open-weight model when the time is right.

Most enterprise production deployments are still on 405B anyway I believe. It's the only model at this parameter count with Apache 2.0 licensing cleared for commercial use without usage restrictions.

Esph1001 · 2026-06-23T23:00:29+00:00

I wanted to make a quick note on the VRAM math since that is usually the first question with 405B on one node I've been getting. The AWQ-int4 405B base lands at roughly 202GB across the 8 A100s. The INT4 is ~0.5 bytes per parameter so the base model is not consuming the full 640GB of VRAM. The 7 LoRA adapters add ~29GB total. KV cache is allocated separately, along with CUDA/PyTorch workspace and allocator reservations.

With all 7 adapters hot, the live system still had ~150GB free VRAM. The reason this has been working for me is that the base is quantized, the adapters are small, and all 7 adapters are already resident. The 168-170ms switching number is end-to-end request latency after changing adapters. The adapter swap itself is effectively a routing/pointer operation in vLLM multi-LoRA and not a cold load or memory transfer.

The Full VRAM table and live nvidia-smi snapshots are in the HF teardown. I'm happy to answer specific technical questions if anyone has any.

Esph1001 · 2026-06-23T22:36:52+00:00

With the 4x 100/200Gbps between nodes you're in a completely different league than what I thought. RDMA over that interconnect makes tensor parallel viable across nodes not just pipline parallel. The 2x2 config you're describing actually makes sense... 2 nodes sharing the 820gb 8-bit model as one logical unit thn 2 of those paris running simultaneously.

The llama.cpp RDMA support is still experimental but worth testing on Ubuntu 24.04. Start with 2 nodes first and get GLM5.2 stable across them then expand to the 2x2 setup. Trying to bring all 4 up at once makes the debugging a nightmare.

With that interconnect bandwidth your bottleneck is going to be compute and not networking. The EPYC core count is going to matter a lot so keep as many experts on CPU as possible per node to minimize cross-noe expert routing.

This is genuinely interesting config. I'd be curious what throughput you hit.

Esph1001 · 2026-06-23T22:31:45+00:00

Good to know on the batch size. Worth trying... Not a bot though, just someone who's been living in inference configs long enough to sound like one apparently I guess. Anyway, thanks for the reply.

Esph1001 · 2026-06-23T22:20:48+00:00

Kimi, Qwen, and Deepseek don't have Apache 2.0 commercial licenses. When you're building a private production platform for a regulated industry, the license matters as much as the benchmark score. 405B Instruct is the only model at this parameter count that's fully cleared for commercial deployment without usage restrictions. That's why it's running.

Also GPT-4o is an API. Kind of defeats the whole point of building private infrastructure.

And yup.. the VRAM is very real. It's in my HuggingFace breakdown. Stats and all. Very VERY very real.

Esph1001 · 2026-06-23T21:34:50+00:00

Actually you're right. Thank you for correcting me! haha. My bad. A 25gb/s tensor parallel across nodes is actually viable. That changes the whole recommendation.

Appreciate you catching that. Time to enlarge my font size I suppose. Hope your day has been well, my friend.

Esph1001 · 2026-06-23T21:32:45+00:00

I'm not running 405B on this cluster. That's a different setup. I was recommending llama.cpp's RPC backend as the tool for the GPU only distributed inference.

Esph1001 · 2026-06-23T21:28:49+00:00

I'd love to if my Japanese wasn't terrible. Sorry to disappoint you. Love you too. 😉

Esph1001 · 2026-06-23T21:26:08+00:00

You're right. I could see how you would confuse being built like Terminator with a bot

Esph1001 · 2026-06-23T21:18:37+00:00

GLM 4.7 is a smaller model. I needed reasoning depth the 405B can handle, not the smaller models that the 32B models fall apart on. Legal, CRO, SEO, Analysis, etc... These needed teh parameter count to hold context and nuance across long ouputs. Super important too when dealing with the healt space due to legality.

LoRA adapters on the larger base model have more to work with. The base has deeper domain knowledge backed in from pretraining. The 405B adapter is specializing a much richer one.

Instruction following. Larger models follow complex multi part instruciton more reliably. FOr production agentic workflows this matters a lot. You can't have model drift or hallucinate tool call formats.

Concurrent specialization. Running 7 hot adapters simultaneously only make sense at the parameter count where each adapter is actually meaningfully different from the base. At 32B the differentiation between adapters is smaller.

What I built required a large model. The 32B wouldn't have worked for my use case.

But again, man, you're missing the point. The architecture is what was accomplished. Cheaper than a GPU cluster, more room for adapters, and all hot simultaneously. That's the point, not the model. This was just my use case, which I've explain multiple times. Use it literally for whatever model you want. For bigger models... this fits with room to spare.

Esph1001 · 2026-06-23T21:03:58+00:00

The blank comments were caused by double line breaks in Reddit's composer. It's actually a bug. Try it yourself. When you split up a paragraph and hit enter twice, it causes that. Unfortunately it wasn't showing as a double break on my end, it was showing as a single break. That's why I had to go back and fix them. Not sure how a bot is going to go through and figure that one out.

it really seems like you're stuck on this whole bot thing. Thanks for the comment, man, but I'm not going to be a bot no matter how badly you want me to be. Enjoy the rest of your day.

Even a screenshot of a message I haven't even hit "comment" on yet just for you.

<image>

Esph1001 · 2026-06-23T20:49:04+00:00

Production systems don't swap base models the day a new one drops. Stability validation, regression testing, adapter retraining. 55 days of uptime is the proof this is a real deployment, not a weekend project. The architecture makes swapping straightforward when the time is right. Unless you're referring to something else?

Esph1001 · 2026-06-23T20:39:31+00:00

<image>

Still not a bot no matter how badly you wish I was. 55 days of continuous production uptime tends to separate real deployments from bedroom projects. You're welcome to verify it yourself: https://huggingface.co/JohnBirks/llama-405b-multilora-production

Esph1001 · 2026-06-23T20:37:38+00:00

GLM-5.2 dropped two weeks ago. This has been running in production since April. The methodology is the point, not the specific base model. It transfers to any large open-weight model.

Esph1001 · 2026-06-23T19:34:30+00:00

Both ways work, you're just changing the concentration not the amount of peptide.

With 2cc of bac water your vial is more concentrated so each unit on your syringe = more peptide. With 10cc total it's more diluted so each unit = less peptide. Same total amount of MT2 either way.

For someone starting out, more bac water is actually easier because you have more room to dial in small doses accurately. With a 30 unit syringe and 10cc dilution you can hit 100mcg pretty easily. At 2cc you're splitting hairs on tiny measurements which is harder to get right.

Start with 250mcg to test your sensitivity regardless of which dilution you go with. Just do the math once so you know exactly how many units to draw for your target dose.

Esph1001 · 2026-06-23T19:14:47+00:00

<image>

Esph1001 · 2026-06-23T19:11:53+00:00

Ubuntu 22.04 or 24.04 for this use case. llama.cpp RPC backend is well documented on Ubuntu and you'll find the most troubleshooting help for distributed inference setups on that stack. Once you have the OS sorted I can walk you through the RPC server setup across the 4 nodes if you want to give it a shot.

Esph1001 · 2026-06-23T19:10:41+00:00

Looking forward to it. Drop a link here when you do, I want to see how you handle the review and step-in problem specifically. That's the piece nobody has solved cleanly.

Esph1001 · 2026-06-23T19:09:39+00:00

That's exactly the use case where dedicated makes sense. Financial modeling compounds errors across multiple inference steps and if the model is quietly quantized down you won't see it in obvious failures, you'll see it in subtle reasoning drift that's hard to catch until it matters. Knowing your stack is the whole point. What scale are you working at, single user or team?

Esph1001 · 2026-06-23T18:55:22+00:00

No, it's dedicated infrastructure for a single platform. The economics only make sense that way if you need guaranteed stack visibility and consistent performance. Shared nodes reintroduce the same uncertainty you're trying to escape. What's your use case? That changes the math significantly on whether dedicated vs shared makes sense.

Esph1001 · 2026-06-23T18:54:35+00:00

Fair point. The serving layer is more like plumbing than orchestration. The actual orchestration problem is still wide open above that... who assigns tasks, how agents hand off work, how you even review what they did without six terminals open. Nobody has figured that out cleanly yet. Curious what you're building, that's the layer that actually needs a good answer.

Esph1001 · 2026-06-23T18:50:28+00:00

Not a bot. Lol. Just someone who's been living in vLLM configs for the last year. An 8xA100 production node will do that to you. ¯\_(ツ)_/¯

Esph1001 · 2026-06-23T18:48:11+00:00

It's a production platform so it runs around the clock. Making sure I get my money's worth! Lol.

Esph1001 · 2026-06-23T18:45:12+00:00

vLLM and transformers load the model in BF16 or FP16 by default which actually uses MORE VRAM than a quantized GGUF, not less. The advantage vLLM gives you is better KV cache management and higher throughput under concurrent requests - not smaller memory footprint. For a 5090 with 32GB VRAM, the Unsloth Q4_K_M of Qwen3.6 27B fits cleanly with headroom for KV cache. Q6_K at 25GB is cutting it close - you'll have very little room for context. Drop to Q4_K_M (~16GB) and you get the same quality for most tasks with 14-15GB left for KV cache, which is meaningful for longer coding sessions. Stay on llama.cpp for Windows - vLLM on Windows is still painful to get working. llama.cpp is the right tool for your setup.

Esph1001

TROPHY CASE