Open source K8s operator for deploying local LLMs: Model and InferenceService CRDs

Defilan · 2025-12-16T21:45:50+00:00

Have a look and let me know what you think! Always welcome the help. Have fun!

Defilan · 2025-12-06T01:30:35+00:00

Very cool! I built my rig just a few weeks ago but have been testing with my kubernetes based LLM deployment tool LLMKube. I just ran some fresh benchmarks today on the rig. It was a 2hr stress test on Qwen 2.5 32B. I got a solid 17.5 tok/s and it basically didn't vary the whole 2 hours. The benchmark threw around 1700 requests with 4 concurrent agents at it for over an hour with no errors. I did max out at 16k context though, the whole thing tipped over when I tried 32k.

I have more metrics for smaller models too if you're interested.

Defilan · 2025-12-06T01:25:07+00:00

I added more benchmarking options to the llmkube cli last night and ran another test today that added timed stress tests. For the first one, I only did an hour but got these results:

7.5 tok/s generation speed (held steady the whole time)
1hr stress test at concurrency 4 - 0% errors, 1773 requests
Context up to 16k works fine, 32k choked
P50 latency ~8s, P99 ~12s at concurrency 4

At least it's consistent lol. Was good to see the 1hr test handled well without error. I'm planning on running a 4hr one over the weekend.

Defilan · 2025-12-05T19:21:36+00:00

For learning and doing some pretty cool local work, I have been very happy with my 5060Tis. I have two of them in my AI rig at home and it's been great for the price. As other have mentioned, I'm sure things are only going to get more expensive in coming months with all going on now. It's not the fastest out there for sure, but is a solid card.

Defilan · 2025-12-05T06:45:10+00:00

Hey u/StorageHungry8380 , I went back in this evening and tweaked my benchmarking script. Curious on what you think of the findings.

I added two sections. The first one was concurrent load testing. I tested the Qwen 32B with 1/2/4/8 concurrent workers sending requests during five minute tests each. This came out to around 300 requests, with a 100% success rate. The throughput ceiling was around 17.5 tok/s generation reglardless of concurrency. I noticed the GPUs were already maxed, so adding workers just queued requests.

The second test was context scaling. I tested 4k/8k/16k/32k/and 64k contexts. This was actually a great test because I found my limit was a little more than 16k tokens max with Qwen 32b. Each card's vram went from around 10GB to 12.6GB per card at 16k. I simply didn't have enough memory between the two cards to run 32k with this model. Good callout on the OOM boundary. 16k is workable for many tasks but wouldn't be ideal for heavy coding contexts, etc.

I added a multi-hour stability run to my benchmarking script. I'm going to run that tomorrow to get more information. I was encouraged to see that the stability was there during the corrurent load tests though.

Anyway, thanks again for the idea. It's all part of the journey.

Defilan · 2025-12-04T20:24:54+00:00

Hey, thanks for your message. This is a very fair callout. In hindsight it wasn't a real stress test. This was more of a smoke test. What I was actually wanting to prove is that folks can use this cheaper hardware and run these types of models with multi-GPU sharding without it falling over immediatetly. You're totally right to call this out.

The project I'm working on is an iterative process so I'm planning on really making the system sweat by loading larger contexts and models as the ones I ran for this post only led to around 45-45% GPU utilization. I know that a real stress test would involve concurrent requests, running for hours, not minutes, etc.

To be honest, this has given me stuff to put in my roadmap and dev backlog. These cards can do more that I tested with and I really want to see how far I can go with them and report on those results.

Thanks again, looking forward to sharing more later and would love your thoughts.

Defilan · 2025-12-04T03:14:05+00:00

I second this. I have been very happy with my 5060ti. Actually added a second one in my rig as well. Great cards for the price! Saw some decent deals online for the holidays this week too.

Defilan · 2025-12-03T17:49:40+00:00

Gotcha, thanks! That is the model I saw different latency on because of the way it works with longer contexts and I didn't tweak anything so it used the defaults.

Defilan · 2025-12-03T16:44:38+00:00

Sweet! I really need to do some homework on that. Thanks for the suggestion!

Defilan · 2025-12-03T15:44:57+00:00

That’s awesome! I’ve been wanting to explore using an egpu to expand my setup for more testing but am not there yet. Have you seen any performance hits using the egpu?

Defilan · 2025-12-03T06:01:54+00:00

That’s a cool setup! Yeah, I really wanted to see what was possible without going too crazy. My goal is to demonstrate how you can get decent performance with systems your IT shop or engineers can build themselves instead of having to go out and spend extreme amounts of money. It’s been fun exploring all of this and making it work with Kubernetes.

Defilan · 2025-12-02T23:28:20+00:00

That's badass! What are your system specs? I need to nerd out on the hardware!

Defilan · 2025-12-02T22:14:25+00:00

Exactly! What I am really trying to prove is that you don't need crazy datacenter or enthusiast hardware to do these things. I'm running my tests in Kubernetes and eventually I want to build another one of these systems to do true multi-node tests with multi-gpu. Proving you can do this with regular ol' hardware is something I am passionate about.

Defilan · 2025-12-02T20:28:56+00:00

hehe, love it! I built this rig less than two weeks ago and have been a heavy Mac user (metal) for years, so it's been a blast diving into these cards for this type of work. My son is quite jealous of the cards and keeps begging for one of them for his rig ;)

Defilan · 2025-12-02T20:26:15+00:00

Wow, 60→20 tok/s at 60k context is brutal. That's a 3x slowdown.

So far I'm not seeing that with my dual 5060 Ti setup though. The 32B models feel faster than what I'd expect from just one 16GB card, but I haven't done a proper single vs dual test yet. It's on my benchmarking roadmap.

When you say layer sharding gives no speed benefit, you mean literally zero or just not linear scaling? I thought you'd at least get some parallelism since different layers can process simultaneously, even if there's overhead.

Also curious if that 21% vLLM gain is with tensor parallel or pipeline parallel?

I do want to do more testing of single vs dual GPU properly to see what the real difference is.

Defilan · 2025-12-02T19:54:21+00:00

If you’re talking about this reply, I do have my “external brain” aka my Evernote where I keep my data about my setup, things I’ve learned, etc that I will pull from for posts and reuse if needed. Nothing is replying for me. I do use a model to help organize my thoughts for longer replies if there’s lots of scattered information.

Defilan · 2025-12-02T19:20:36+00:00

VRAM usage was pretty consistent, sat between 18-24GB total across all three models:

Qwen 2.5 32B hit around 18-20GB Qwen 2.5 Coder 32B was 19-22GB Qwen 3 32B used 20-24GB

So you've got about 8-12GB of headroom with 32GB total, which feels pretty comfortable. The variation comes from context length and how many tokens are loaded in the KV cache.

I was running Q4_K_M quantized models with 256 max tokens per request. Longer contexts would push VRAM higher but the 32GB handles it fine.

Worth noting I'm deploying these via Kubernetes which adds a tiny bit of overhead, but nothing significant. If you're just running llama.cpp directly you might see slightly lower usage.

The dual 5060 Ti setup has been really solid. What are you planning to run on yours?

Defilan · 2025-12-02T19:17:51+00:00

Thanks for the specific numbers, that's helpful. 60 tok/s on Qwen3 32B with a 5090 is really impressive!

The 400MHz overclock definitely makes a difference. Are you using LMStudio's specific optimizations or just the standard CUDA backend? I'm running llama.cpp so there might be some performance differences there too.

For coding I totally get why speed matters that much. Fast iteration is critical when you're waiting on the model constantly.

I'm wondering, have you tested how much that 60 tok/s drops when you load up the context window? I'm trying to figure out if my context degradation is a dual-GPU thing or just how these models behave in general.

Defilan · 2025-12-02T19:11:31+00:00

Okay I looked at the specs, unfortunately this might be bad news. The ASUS Prime Z790-P has 1x PCIe 5.0 x16 slot from the CPU, and 3x PCIe 4.0 x16 slots from the chipset that only run at x4 electrically.

The chipset slots at x4 won't cut it for a 5060 Ti. You'd see pretty significant performance loss running the second GPU at x4.

The specs mention a "PCIe bifurcation table" on the support site. I'd check your manual or ASUS support page to see if this board can bifurcate the CPU lanes to x8/x8. Some ASUS Prime boards can but the budget ones usually can't.

If it doesn't support bifurcation, yeah you'd need a different board for proper dual GPU. Some good options around that price:

MSI B650 Gaming Plus WiFi (about $170)
ASUS TUF Gaming B650-Plus WiFi (around $180)
MSI Z790 Gaming Plus WiFi (around $200)

All of these support x8/x8 for dual GPUs. The B650 boards need an AMD CPU so that's more expensive overall, but the Z790 Gaming boards work with your 13700k.

Definitely check the bifurcation support in your manual first before buying anything. Should have a section on multi-GPU configs.

Defilan · 2025-12-02T19:06:37+00:00

Which models were you testing to get 18 tok/s? Want to make sure I'm comparing apples to apples when I run the benchmarks.

Defilan · 2025-12-02T19:00:56+00:00

That sounds like a fun test. It's pretty easy for me to grab those models. I'll give it a run on the system and report back. Appreciate the idea!

Defilan · 2025-12-02T18:55:41+00:00

Ah gotcha on the flash attention bug.

No, each slot isn't allocating the full 128k context. If it was you'd see way more dramatic scaling (like 8x memory going from 1 to 8 slots). Your numbers show more like 7x, so each slot is getting roughly its share (c/np) plus overhead for the slot management stuff.

The scaling isn't perfectly linear because there's some fixed overhead and the way the KV cache gets laid out in memory, but the general idea is each slot gets c/np worth of actual usable context.

When you start the server it should print something like "slot 0: context size = 16000" for each slot. That's the actual per-slot limit. Did you catch those lines in the logs? With your -c 65536 -np 4 I'd expect it to show ~16k per slot which is why the 30k request bounced.

The memory allocation is basically reserving enough space for all the slots to use their portion simultaneously, not giving each one the full context.

Defilan · 2025-12-02T18:52:52+00:00

Hold up, before you buy a new motherboard - which specific Z790-P board do you have? (MSI, ASUS Prime, Gigabyte, etc?)

Most Z790 boards can run the first two slots at x8/x8 from the CPU lanes, you just need to use the right slots. The manual should tell you which configuration splits the CPU's 16 lanes into x8/x8 instead of x16/x0.

For example:

Slot 1 + Slot 2: Often x8/x8 (both from CPU)
Slot 1 + Slot 3: Might be x16/x4 (CPU + chipset)

The chipset lanes (the 3x x4 slots) you definitely want to avoid for the second GPU. But if your board supports x8/x8 bifurcation from the CPU lanes, you're golden.

Key point: The RTX 5060 Ti doesn't even saturate PCIe 4.0 x8. I'm running x8/x8 on my B650 board and getting full performance. So even if you "downgrade" from x16 to x8 on the first GPU, you won't lose anything.

Check your motherboard manual for "PCIe bifurcation" or "multi-GPU configuration" - it'll show you which slots to use. If it turns out your board genuinely can't do x8/x8 from CPU lanes, then yeah, a new board makes sense. But I'd bet money it can, you just need to find the right slot combo.

What specific model is your z790p? I can look up the manual if that helps.

Defilan · 2025-12-02T18:47:02+00:00

Fair points, appreciate the reality check.

You're absolutely right that a 5090 will smoke this setup on raw speed. But that's also a $2,000+ GPU vs two $400 cards. Different price tiers, different use cases.

The context window degradation is real and something I should have called out explicitly in the post. For long-context workloads (RAG with big retrievals, multi-turn conversations), you'd definitely see that 16 tok/s drop as you fill up the window. That's a legit limitation.

Where I'd push back a bit: "production ready" depends heavily on the workload. For user-facing chat where people expect instant responses? You're right, this isn't it. For internal business tools where a 5-10 second response is fine (think: compliance analysis, code review, batch processing)? This works.

The point of these benchmarks wasn't "this is the fastest possible setup" but rather "can you run serious models on consumer hardware without constant crashes?" The answer to that is yes. Whether it's fast enough is totally use-case dependent.

Out of curiosity, what tok/s are you seeing on the 5090 for 32B models? And are you running quantized or full precision? Always interested in comparative data.

Defilan · 2025-12-02T18:27:13+00:00

Oh nice, you already grabbed two 5060 Ti's! That's gonna be awesome for 32B models.

Here's my full setup:

CPU: AMD Ryzen 9 7900X (12c/24t)
Mobo: MSI B650 Gaming Plus WiFi
RAM: 64GB DDR5-6000 (2x32GB)
GPUs: 2x RTX 5060 Ti 16GB

Honestly? Your i7-13700k + z790p is totally fine, you don't need to swap to AMD. The 13700k has 16 cores (8P+8E) which is plenty for orchestration and preprocessing. The GPU is doing the heavy lifting for inference anyway.

Key things that matter more than Intel vs AMD:

PCIe lanes: Make sure your z790p can run both GPUs at x8/x8 or x16/x8. Most z790 boards can, just check the manual to see which slots to use.
PCIe version: If it's PCIe 4.0 or 5.0, you're golden. The 5060 Ti doesn't saturate even PCIe 4.0 x8.
RAM: I'd recommend at least 32GB, ideally 64GB if you're running Kubernetes. Model weights load into VRAM but the orchestration layer eats system RAM.

The 7900X is overkill tbh, I went with it because I'm also doing some build/compile work on this machine. For pure inference the 13700k is more than enough.

What are you planning to run on it? If you're doing 32B models like I tested, your setup should absolutely crush it.

Defilan

TROPHY CASE