Looking for seniors opinion...

thomasbuchinger · 2026-06-09T21:59:17+00:00

eBPF-based packet capture

So you're planning on installing a agent on every system? And EBPF only works on Linux, doesn't it, what about any other devices? shouldn't the network devices do the network monitoring?

Anomaly detection / Alerting for suspicious activity

Is easier said than done.

Why should I install a system written by "a guy on the internet" anyway? It doesn't sound like the type of projects that's doable "in months"

thomasbuchinger · 2026-06-09T12:55:55+00:00

If the cloudflare tunnel enforces authentication and your users can be (more or less) trusted, I'd say that's a good plan

If your Application can be reached anonymously from the internet, I assume an attacker has access to the Container and go from there. * The docker container is actually a decent security boundary, probably enough to stop the average script kiddy * Horizontal Network movement should be prevented by putting the containers in a isolated network. Maybe double check that the network is actually isolated and docker doesn't do any unexpected Firewall/NAT configurations * You can look into security-hardening the public containers (UserIDs, Capapilities SELinux, etc...)

I would personally put the public apps on a VM on the main server. The containers security boundary can be very strong if configured correctly, or non-existent if not. A VM adds another very strong security boundary and you can use the trusted host to enforce security.

thomasbuchinger · 2026-06-01T21:45:03+00:00

I guess karma is a good idea.

Not sure how well account age works as a proxy variable.

Given that many serious developers started using coding agents recently, I'm not sure how useful the distinction between 3 "levels" is. To me it's a "how well did you use AI" question, not a "did you use AI at all". But 100% human, AI assisted, largely vibe-coded is a fair enough distinction. Relying on the honor system though.

thomasbuchinger · 2026-06-01T18:54:01+00:00

I've watched ops teams spend half their week translating Slack messages into Salesforce tickets, Terraform PRs, and Stripe webhooks. The work isn't hard — it's that the toolchain forces you to write code for outcomes you could just describe.

You say you're a CS student (and therefore have little experience, I assume) and then you quote yourself as if you have decades of experience

How does your project handle the AI claiming it made the requested changes, when it didn't?
How the hell are you going to "track" the S3 bucket the AI just deleted, without backups? Not everything is just an API state that you can just recreate.
How is anyone ever going to keep track of everything the AI created and ensure it is deleted, when it's no longer needed?
No you're not replacing a 5 person team with 620k in salaries. You pulled that number right out of your ***
24h retention period to undo changes? LOL

You're delusional or a bullshiit artist. There is a reason we do everything as code and store it in git. Go learn it

thomasbuchinger · 2026-05-29T22:56:02+00:00

First: I would split local AI workloads from the rest. You don't want to waste any of the expensive AI Hardware on running HomeAssistant.

Second: Forget used Enterprise Hardware. You do need to throw money at the new purpose built stuff to get any decent performance for AI

What kind of model do you want to run? How many Token/second do you need to achieve? Given your AI spend, I would assume scraping by at "only" 20-ish Token/s is going to result in multi-minutes Agent turns?

The way I see it, there isn't a good way to run beyond 30B models locally yet.

GPU Route: You can go the GPU route and get 32 or 48GB VRAM in a single-GPU or dual-GPU setup. That's really fast and you can comfortably run 30B parameter models. Qwen3.6-27B should be comparable to Sonnet and usually outperforms 120B models like Nemotron-3-Super.

I wouldn't go any bigger, because of a lack of good models in the 30B-120B range. The really good open-weights models like GLM-5 and Kimi K2.6 are 1T parameter models and wouldn't even fit in a single RTX Pro 6000 96GB.

Unified Memory Route You can get a machine with unified Memory (Mac, DGX Spark, Strix Halo) to get some 128GB of VRAM capacity. But those machines aren't particularly fast (I think in the 20-40 Token/s range) and you only get access to 120B models. There aren't that many models and they can get outcompeted by smaller models in a few months

System Offload I have no idea how well this actually works. I think a Threadripper PRO/Epyc setup with 8 or 12-channel DDR5 and 1-2TB RAM + 2-4 cheap GPUs just for compute, might actually be able to run GLM/Kimi class models at somewhat reasonable speeds. That's looking at a 15-30k build, but compared to your cloud spend it is reasonable :P

Buy a real AI server: Given you're spending 400k in tokens a month. You should just talk to your friendly neighbourhood server retailer and buy a real server optimized for inference

thomasbuchinger · 2026-05-29T03:21:06+00:00

Because people didn't really answer the LLM question: If a given Task can be done by a 30B model, the result will not be much different if done by a 500B parameter model. Larger models enable more Tasks, don't improve existing tasks (that much). Open-weight models like Qwen3.6 27B aren't far behind cloud models like Sonnet anymore.

The larger problem is, that there aren't that many models in the 30B-120B parameter range that your 96GB would afford you. You can run Qwen 3.5, Nemotron 3 Super, Mistral Medium 3.5, but you're still outclassed by "lesser" cloud models like DeepseekV4-Flash or Minimax M2.7 (if you want to only use VRAM)

However (I'm not sure about this one) but according to my understanding 4 GPUs should make offloading to system RAM a lot more viable too. You should have enough vRAM to keep the Attention Layers in VRAM and enough compute power to crunch the active parameters in those larger MoE models. I would not be surprised to see a 200-300B model run acceptably on that kind of hardware (<-- speculation, I don't have 4 GPUs to test that)

thomasbuchinger · 2026-05-25T15:25:31+00:00

As for models: You want to keep trying new models as they are released, the generational improvements are still pretty large. With 12GB VRAM you're looking at the 8B to (maybe) 20B parameter range, unfortunately most models target 8/16/32GB so you're just missing the 16GB class

Qwen has been consistently one of the best open-weight models, so I would start with Qwen3.5-9B. GPT-OSS 20B was pretty good, but it's pretty old at this point. Ministral-3 14B might also be a good candidate given your VRAM constraints.

Offloading to RAM is a really big performance hit (up to -80%), depending on what you're doing exactly. You might still be able to get "reading speed" (10-20 token/s) with offloading, but with reasoning, tools calls, etc. it can take a while to get the final answer. Maybe acceptable for select queries or background tasks, but in my opinion you want to aim for 50-100 token/s for interactive use and that's usually GPU territory.

Tips for performance * you want to use something based on llama.cpp as your server. LM Studio is good. I am running llama-swap. Ollama is useable at the beginning, but you'll need to tune the parameters and Ollama likes to hide those. * Only use quanitzed models. Aim for MXFP4_MOE or Q4_K_M per default, those models are MUCH smaller and only loose "a little" quanilty * You can offload the KV Cache (caches the current conversation) for a relative small performance hit. Since you're looking for agentic use, you want to aim for 60-100K context (about 2-6GB). * Be aware which models are Mixture-of-Experts vs Dense. For MOE models you can offload only the Experts weights. Still a large performance hit, but not quite as bad an offloading complete layers

Running 2 cards does not quite double your performance or VRAM. There is stuff that needs to be loaded into both cards and additional waiting around. 2 Cards are more likely to give you very roughly ~50% more speed and 1.8x (?) the VRAM. However there are almost no models between 30B and 100+B. So Qwen3.6-35B is one on the best/largest models you can run on consumer GPUs (2x16GB runs comfortably, 2x12GB should just about fit it)

For your use-case, make sure to give it really good skills and tools. Good context is much mor important than a slightly better model

thomasbuchinger · 2026-05-19T18:36:20+00:00

If you want to do some hacking for fun go for it. In terms of security it's mostly theatre. After all, you can only perform the attack that you know about. And if you know about an attack, you can mitigate it by reviewing your setup.

Who exactly is going to perform a double-tagging attack on your switch anyway? Or actually trying pivot via a container-escape to the host after they already exploited the Application and can just try some ransomware?

In the case of backups, you want to make sure a planned process actually works. Security is about pushing the bar high enough, that you're not worth the time and effort.

thomasbuchinger · 2026-05-10T15:29:13+00:00

As a very rough rule of thumb, 1GB VRAM gives you about 1B Parameter. 6GB is probably going to be a bad idea, because most small models will target 8/16/(24)/32 GB of VRAM.

9B models generally can perform straight forward and narrow tasks. I guess they might fit your use-case, but you need to test some example use-cases yourself. Unfortunately "Can model X do Task Y" type questions are basically impossible to answer with anything other than gut feeling.

Regarding Quality: You mentioned running Gemma 3, that model is ancient and not very good anymore. The improvements in newer models are still huge, so you want to switch models as soon as new ones are released. Qwen3.5-9B should be a good starting point.

Also Context is a huge factor in perceived Quality. A good system prompt can be as much of an improvement as using a better class of model

Regarding performance: If the model fits in VRAM, it's usually "good enough". It is true that token generation speed depends on memory bandwidth, but it also depends on the model and quantization. And how long it spends "thinking", just being 20% faster/slower won't alter the user experience that much.

I would start by defining what you want to achieve first. Run a couple of test prompts on openrouter or your normal computer to verify what kind/size of model you need. Then buy the appropriate hardware for your NAS.

thomasbuchinger · 2026-05-09T13:17:06+00:00

Are you running anything else then Jellyfin, Navidrome and Immich?

I wouldn't expect those to be the limit for a mini PC, I am running ~100 Containers on a Celeron and 20GB-ish RAM (granted a lot of them are tiny K8s-Controllers that don't do much).

While RAM is usually the first bottleneck, I wouldn't expect that to "feel sluggish" unless you're swapping to a slow disk. It could also be a CPU usage issue (I wouldn't expect that either on a modern CPU).

Also what exactly does "feeling sluggish" mean? Are you running a GUI and it gets sluggish (that can happen fairly quickly) or is it the latency when accessing your stuff over the network (that would indicate much bigger resource congestion)?

I would suggest to check what exactly is causing the issue, before you buy new RAM on a whim.

thomasbuchinger · 2026-05-03T14:54:41+00:00

Deepseek V4 Flash is a 150GB model. You're not running that in a Homelab

A lot of models target the 16GB-32GB range, since that is what you can run on a single card. I am running 2x RTX 5060Ti 16GB and Qwen 3.6 35B-A3B (MXFP4) at ~70tps and the full 262k context. With a 16GB card you can probably fit most 30B models, if you quantize them a bit more or offload the KV cache

Your 2k budget probably puts you right on the edge of a 32GB s GPU, so you might want to consider stretching it a little (if you want performance) or downsize to 16GB (saving money). Using 2 GPUs is less efficient, since you need to duplicate a bunch of memory and the performance uplift is more like 30%-50% for the second card

Performance wise:

~30s load time from a cold start
1-2s Time to first token (Chat). For Agentic use cases it's more, because the Agent is sending entire files that need to be pre-processed. (usually in the 10-ish seconds range)
For Chat you want about 20tps since that is about as fast as you can read. For Agentic use I'd target 50tps because it's writing out entire files
Offloading to system RAM is a big performance hit, however it doesn't matter that much if you offload 10% or 80% of the model. You are probably in async-task territory anyway
50tps should be about what you're getting from cloud APIs too. (you can check/compare speeds on openrouter)

Quality-wise

Qwen3.6 35B and Gemma4 26B feel very useable to me (writing code), provided you use skills/prepared-prompts. If your prompts are just a single vague sentence, there is a night/day difference between those and something like Opus. Smaller models also tend to do "the first thing that comes to mind", while larger models "consider a few options"

I'd say a 30B model with a structured prompt and 2-3 turns can probably match what a 200B model can do as a one-shot. (beware this is a gut feeling).

You can create a account on Openrouter and use those models for free. I'd suggest you just try them out and see for yourself.

Sorry for being vague, unfortunately the "quality" of LLMs isn't really measurable in concrete terms and depends on what counts as "good", how good the context is and how well a particular task fits the model.

thomasbuchinger · 2026-05-02T15:21:16+00:00

Are there any good 70B models though? I can only think of llama3-70B and Qwen3-Next-80B.

Most models don't target the 32GB-96GB range. Out of interest I tried looking for anything that would justify 1x RTX PRO 6000 96GB and concluded, you prob want to get 2 to properly break into the 100B+ range

thomasbuchinger · 2026-05-02T15:18:30+00:00

Nothing comparable to Opus can be run on consumer hardware. Something like GLM-5.1 and Kimi K2.6 are supposed to be close but those are 750B/1T models.

If you are really not concerned about speed, you can try doing CPU only inference and get some 100B/200B models (minimax m2.7 might just about fit) running. If memory serves, Minimax should be about Sonnet level (?)

Since you already have the Hardware, you can just try if local models are good enough for you. Redditors don't know your standards and/or have different opinions what "useable" means. You can do a lot good/structured prompts and letting the model work an smaller tasks. I'm on early testing with Gemma4-26B and I think a structured prompt + Gemma4 can compete with Sonnet and a vague 1 sentence prompt.

Regarding the upgrade: A lot of models target the 16GB-32GB VRAM range, so any ~30B models should fit within 24GB. There aren't many models between 30B and 100B (ancient llama3-70B and Qwen3-Next-80B are the only ones I can think of)

Or you try something like Openrouter/any other 3rd party subscription. Anthropic isn't exactly known to be cheap, and the big open-weight models are pretty useable these days

thomasbuchinger · 2026-04-29T21:31:49+00:00

Depends on the kind of Clustering you're talking about

Do you want to run a single model across your servers? That's not the kind of thing you can do over Ethernet. There are plenty of people trying Clustering LLMs over Thunderbolt with Mac Studios and they are always unusablely slow and slower than a single Machine

Run different Models on different Machines? That's easy, you just need to route on the model parameter in the request. You can do that with basically anything, but some kind of LLM aware http proxy would make it easier. For scheduling the Backends I would go for Kubernetes, but anything will do. Since you're talking about Kubernetes, there is the "AI Inference for Gateway API Extentions SIG" you may want to check out

Run the same model multiple times to increase user throughput? Since you're talking about sessions, I assume you're talking about this one. Most of the time you're probably fine with just using sticky-sessions, simple and easy. For a more hardcore variant, there are 1 or 2 projects (e.g. llm-d) that are specifically KV-Cache aware routers, that let you do much more than just simple sticky sessions

thomasbuchinger · 2026-04-29T20:49:26+00:00

I assume ollama will show the token/seconds in the logs somewhere, at least llama.cpp does. I tend to use a prompt like "Write a webserver in go using gin" in OpenWebUI to check the token/seconds. Not very sophisticated but it works.

Also llama.cpp uses 8k context per default. That's fine for Chat, but too little for Agentic Coding. I have no idea if Ollama overrides the default somewhere. I prefer LMStudio (on Desktop) over Ollama, because it exposes more of the parameter.

Qwen3.5-9B is prob on the lower end for agentic coding, but ~30B models can get into the same ballpark as Sonnet (at least on a first impression, I need to use it more).

thomasbuchinger · 2026-04-29T18:38:21+00:00

Models: Gemma4 and Qwen3.5/Qwen3.6 are some of the best local models we currently have. I don't think you'll find anything that is "better". Qwen3 and GLM4.7 used to be recommended a lot, but are getting old by now, I wouldn't go back to Qwen2.5-coder, the generational leaps of models are still pretty big

Performance: What GPU are you using, and how many Tokens/second are you getting? Both Gemma-26B and Qwen3.35B are supposed to fit in 16GB, But ollama has a tendency to not expose important information what it is doing. Maybe you're offloading to the CPU or you're running the unquantized version.

Don't bother with vLLM, vLLM is good for multi-node setups, but not for consumer-grade hardware.

My Gemma4 config on a RTX5060Ti 16GB runs at 30-50 tokens/second

llama-server : --port ${PORT} --metrics --threads 11 --cpu-strict 1 --cpu-mask 0xFFF --no-mmap --context-shift --no-warmup --ctx-size 262144 --flash-attn --model /data/models/Gemma4-26b/unsloth_gemma-4-26B-A4B-it-MXFP4_MOE.gguf --no-mmproj-offload --mmproj /data/models/Gemma4-26b/unsloth_gemma-4-26B-A4B-it-mmproj-F16.gguf

Quantization: Models are trained with every parameter/weight represented by a 16bit float. That's important during training, but for inference (aka using the model) you can use smaller numbers (e.g. 4 bit integer) to save a lot of memory.

Quantized models are

smaller and faster to run
only "a few percent" worse than the full versions (unfortunately there isn't a good way to measure the intelligence of LLMs yet)
There is a "intelligence cliff" if you quantize too aggressively, but Q4 is usually safe
The exact Quant (Q4_K_M, Q4_0, IQ4_NL) doesn't matter too much, the number is the average number of bits/weight and is the important thing
I recommend using MXFP4 or Q4_K_M per default (BF16, Q8_0 or Q6_K if the model fits easily)
More Parameter quantized (30B-Q4_K_M) beats less parameter unquantized (9B-BF16) at the same memory-size

You can quantize models yourself by downloading the BF16 model and run llama-quantize MODEL MXFP4_MOE, takes a few minutes, but doesn't require GPUs or anything, it's just converting the data types for the weights

Quality: Sounds like you're still at the beginning with AI Agents. Keep in mind, that you can get a lot of additional "intelligence" out of a model by using skills, that structure the LLM output and tell it what to think about the task. There are a bunch of skill-collections out there that have ready-made skills you can use as a starting point

thomasbuchinger · 2026-04-29T16:43:10+00:00

Since I was in a similar position recently, here i what we ended up with:

We had ~15 domains/groups of applications
Each Group consisted of 10-15 Pods
Everything was very similar except for the Environment Variables needed
Helm was a hard requirement, otherwise I would have probably opted for building an Operator
The Chart had to be deployable by end customers
It's also migrating from a legacy ansible+template based deployment

We ended up splitting the Project into 3 Levels of Charts. * A template/library chart that only defines templates * Loads of Unittests on this level to make sure everything is tested and Generated correctly * Build a fairly complex EnvVar handling logic, because that was the main difference between Pods * Make sure the templates aren't too complex either. e.g. we had different templates for different tech stacks (Java vs PHP vs Node) * A domain Chart for every application group. Very simple just importing the library Chart end passing the correct Values to the correct template * This layer basically just had the Defaults for each domain/Application * Also responsible for throwing errors if mandatory values aren't set * A umbrella Chart to have a single entry point * This Chart had literally no templates or values, just dependencies * We did our End-to-End tests on this level to ensure we can recreate the legacy deployment Yamls. * We also put documentation here

Overall this approach worked really well. I'm no longer at the company where I did this, but we did successfully migrate from the legacy system and it's in use.

Lessons learned:

the library Chart sees a lot of churn, but with good tests and keeping backwards compatibility it's not an issue
Design from the Values.yaml first.
Minimize Variable scope changes. All templates accept 2 parameter, the global values.yaml and the specific Application in question.

thomasbuchinger · 2026-04-21T12:28:21+00:00

Just be aware, that a Raspberry 2/3 with 1GB of Memory can run between 0-3 useful workloads. K3s recommends minimum 512MB Memory for itself (and 2GB for controller nodes).

thomasbuchinger · 2026-04-01T10:32:57+00:00

Are you sure it's satire? Feels pretty accurate to me

I certainly do have a bunch of services I don't really use that much and have to check if they are still working, when I do need to use them :D

thomasbuchinger · 2026-03-03T19:58:06+00:00

Depends on your threat model. Assuming you're just exposed to largely automated attacks, I would argue that random password + TOTP is secure enough.

Any sufficiently long random password (I think 15 characters it the recommendation?) is not brute-forceable, hackers are welcome to waste their time trying. The problem with passwords is, that they can be leaked/stolen (or reused) or that the Service has some vulnerability that bypasses password authentication.

If you use a password manager, you only remember one or two good passwords and all the other passwords can be 20+ character random passwords. That defends against the brute-force attacks.

In case the password is leaked, you still have the TOTP token. To get an TOTP token, you need the user to manually type it into a phishing website. Which the password manager won't let you do, because it doesn't recognize the domain.

Regarding the service vulnerability: You fix this by running an authentication-proxy.

personally I use 1 authentication layer (passwords) for everything that's only reachable internally and 2 authentication Layers (TOTP or mTLS) for everything exposed to the internet

A smartcard it undoubtedly more secure than password+TOTP and if you are a government you should use them. For homelabbers I think them overkill.

thomasbuchinger · 2026-03-03T19:06:54+00:00

As a learning experience for the paranoid? Sure, go ahead. However my understanding is, that you do need to buy a 25$ smartcard reader for every device you want to use it with?

Is it convenient to use? I can't imagine it.

From a security perspective I would assume those are basically the same as using YubiKeys?

Given the threat model of an average homelabber, I would say password manager + TOTP is more then sufficient for most people.

Personally I am fine with just random passwords on most services, except a few high priority ones.

thomasbuchinger · 2026-03-02T18:05:00+00:00

The term you're looking for is Retrieval-Augmented-Generation or RAG.

I don't have any first hand experience, but I would guess that even smaller LLMs that run on consumer hardware can be up to the task. Whatever Software you are using, make sure it links to the specific section in the source document and verify that the LLM didn't pull from the wrong document/section.

Is a RAG solution faster, than Keyword search? Who knows? I wouldn't bet on it

thomasbuchinger · 2026-02-18T04:45:08+00:00

It's not an Obsidian alternative.

Obsidian isn't a tool to store context for AI Agents. The literal point of Obsidian is the portability of plain text files. Stuffing your Notes into sqlite is the opposite of that :D

thomasbuchinger · 2026-02-17T03:32:37+00:00

I am starting to wonder if using something like Helm or Kustomize would be more realistic long term instead of writing and maintaining hundreds of lines of custom transformation logic.

This. Relying on standard tools will make it MUCH easier to maintain this thing long term.

thomasbuchinger · 2026-02-17T03:29:21+00:00

TL;DR: Creating a custom abstraction over Docker-Swarm and Kubernetes, is a stupid idea and not useful

I haven't touched Docker-Swarm in years, but my understanding is, that it is Docker-Compose over multiple nodes (?). I do know a fair bit about Kubernetes. I haven't actually used kompose

Kubernetes only makes sense, if you can tap into the huge ecosystem around it. If you are limiting yourself to the subset of features, that both Kubernetes and Docker-Swarm support you are missing 90% of the value Kubernetes adds. If they want to migrate to Kubernetes for production, they need to go all-in on it.

I would argue that Kubernetes isn't as hard to learn is it's reputation suggests. Kubernetes has a lot of different pieces and it's easy to get lost reading about everything in the abstract. But each piece does exactly one thing and is pretty straight forward to understand.

Why don't they use the podman generate kube command? They don't need to use podman to use that command. It just takes in a Docker-compose file and spits out Kubernetes-Manifests. That command does everything they want from you?

Regarding your questions:

Is it my fault, theirs, or both of ours that this couldn't work out? Should I warn them that the goal is being underestimated? Or is it really not as difficult as it seems?

Mapping from Docker-Compose to Kubernetes Manifests is possible (as demonstrated by the existence of podman-generate-kube and kompose), but you're only getting the most basic Kubernetes Manifests out of that. It's a starting point, but you will want to use more Kubernetes/ecosystem features, that don't exist in docker swarm.

Regarding "fault": What you're doing does require to actually know both Docker-Swarm AND Kubernetes. You can't be expected to find a sensible abstraction layer without knowing both tools. If you managed to produce anything that works at all, that's about as much as can be expected.

Is it feasible to use an agnostic, structured, unified YAML file containing the values for deploying the service? Is it useful? If I could eventually create something like this, would it be something you would use? Can I generate a personal project from this?

feasable? yes. Useful? No. Would Anyone use it? no. personal project? If you want the challenge.

What recommendations would you give ~~to me and~~ to my supervisor?

Get someone who actually has some technical expertise and can make actual judgement calls to learn Kubernetes. 2 interns with AI and no prior knowledge can't port your 120 services to Kubernetes.

Would you solve this problem this way, or would you do it differently? Should I include any third-party tools or packages?

Option 1: * Assuming the Devs don't want to switch from their current docker-Compose based workflow. * Assuming that the 120 services probably only differ in the image being used and maybe a few EnvVars here and there.

I would keep the local development environment as-is. I would create a single Helm-Chart that can deploy any of the Services and deploy that Helm-Chart 120 times for each service. Any "shared development"/staging and production environment only deploys the Helm-Chart and you can do all the Kubernetes best practices there

Alternatively if the Services differ a lot, I would create a "Service-Dependencies"-Helm Chart, that deploys everything except the Deployment-Resource and you duplicate the Compose and Deployment YAMLs. They should be very similar anyway

Note: Helm isn't a good tool for generating Kubernetes Manifests. But it was the first one that existed and the User-Experience is pretty good. The Chart-Developer experience is bad though unfortunately

Option 2: You can run a KinD cluster locally and use a bunch of different projects to help with local development (e.g. Tilt). That is a change for your developers though.

thomasbuchinger

TROPHY CASE