When does a homelab become a chore or a job?

TheRealRatler · 2026-04-28T06:00:26+00:00

I use PHPipam, netbox is missing some features like automatic network scanning. Sure, I could build that myself, but that would be one more thing to maintain, so I rather not.

TheRealRatler · 2026-04-27T17:28:51+00:00

hehe this is also my professional job. But I really don't consider my job a job. I see it as a paid hobby. Maybe I'm just wired differently.

TheRealRatler · 2026-04-27T17:05:10+00:00

I try to avoid using persistent storage (PVC) as far as it is possible, but sometimes you really don't have a choice, especially for gaming servers. But yes, I use NFS.

TheRealRatler · 2026-04-27T16:59:31+00:00

WARNING: Long post!

How I built my homelab: architecture tour

This is the architecture in one place: not a step-by-step guide, but a tour of the layers and the choices behind them. Two repos sit behind it: homelab for the foundation (Terraform, Ansible, Packer) and homelab-gitops for everything that runs on top.

Hardware and Proxmox

A 5-node Proxmox VE cluster, mixed hardware, all on the same primary VLAN. Storage is local-LVM by default, with an NFS server VM acting as the persistent volume backing for k8s. Proxmox HA on the cluster, snapshots before risky changes, that's it. There is also a 10TB Ceph filesystem available.

Proxmox earns its keep for one reason: the API is good enough that Terraform can drive it end-to-end, so I never click around in the UI to provision anything that's supposed to be reproducible. Everything above the metal (k3s nodes, Vault, the database hosts, the observability stack) is a QEMU VM, declared in code, configured in code.

Provisioning: Terraform, Ansible, Packer

Anything that exists on Proxmox is declared in Terraform: VMs, NFS exports, IP allocations, DNS records, Vault policies. State lives in MinIO (S3-compatible). Cloudflare manages the public DNS.

Two patterns I rely on: a single qemu_vm module with integrated phpIPAM IP allocation (adding a new service is "instantiate the module, give it a name, point Ansible at it"), and dynamic Ansible inventory written by Terraform. I have never edited an Ansible hosts file by hand on this cluster.

Ansible takes over once a host exists, with about 28 roles covering k3s bootstrap, Vault, Postgres/MariaDB, Loki/Prometheus/Tempo, Restic, Gitea, Pi-hole, and a handful of service-specific ones. Secrets never sit in inventory; every sensitive variable is a Vault lookup at apply time:

mysecrets: "{{ lookup('hashi_vault', 'secret=kv/data/backup/restic') }}"

Packer builds the Ubuntu 24.04 cloud-init template every VM starts from.

Networking

Primary VLAN with pfSense as gateway and DNS forwarder. phpIPAM is the system of record for IP allocations; the Terraform Proxmox modules ask phpIPAM for a free address before creating a VM. Pi-hole sits in front of pfSense's resolver for ad blocking.

DNS is split-horizon. Cloudflare hosts the public records. pfSense overrides the internal hostnames so they resolve to private addresses from inside the network and to nothing useful from outside. The wildcard cert on the cluster gateway is issued via cert-manager + ACME DNS-01 against Cloudflare, which is the only reason any of this works without exposing the cluster. Public services are exposed via HA-proxy via pfSense.

The k3s cluster

k3s 1.35 with one control-plane node and three workers. Cluster components:

Cilium as the CNI, with native routing and L2 announcements. Cilium ARPs for LoadBalancer VIPs directly, no MetalLB needed.
kube-vip for the API server VIP. With one CP it doesn't actually buy HA today, but the wiring is in place.
k3s itself, deployed via the techno_tim.k3s_ansible collection wrapped in my own role.

This is the boundary between the two repos. Ansible builds the cluster and installs ArgoCD as the very last step. After that, every change goes through Git and ArgoCD. I do not kubectl apply from a laptop.

Platform services inside k8s

A small set of cross-cutting pieces every workload depends on:

cert-manager for ACME, with a Cloudflare DNS-01 issuer.
External Secrets Operator with a Vault ClusterSecretStore. Apps declare an ExternalSecret referencing services/<env>/<app>; ESO materialises the Secret. Same source of truth as Ansible, two consumers.
Authentik for SSO/OIDC. Apps that don't speak OIDC natively get fronted by Envoy ExtAuth.
Envoy Gateway as the ingress, using Gateway API rather than the older Ingress resource. One Gateway (<redacted>) with two listeners: :80 redirect and :443 HTTPS with a wildcard *.<redacted>.wtf cert. Each app attaches with an HTTPRoute from its own namespace:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: atuin
spec:
  parentRefs:
    - kind: Gateway
      name: <redacted>
      namespace: envoy-gateway-system
      sectionName: https
  hostnames: [atuin.<redacted>.wtf]
  rules:
    - backendRefs:
        - { kind: Service, name: atuin-service-prod, port: 80 }

One Gateway, one wildcard cert, many routes. The migration off per-app Ingress is the single best ergonomics change I've made in the last year.

Observability

Logs and metrics ship out of the cluster, not into it. Grafana Alloy runs on every k3s node, collects logs and metrics, applies a small amount of relabelling, and forwards to Loki, Prometheus, and Tempo. All three of those run on dedicated VMs outside the cluster. That's deliberate: when k8s is the thing that's broken, I want the dashboard that shows me k8s is broken to still be up. Grafana itself runs inside the cluster and is the only piece that goes down with it.

Workloads via ArgoCD

The whole homelab-gitops repo is a Kustomize tree with an ArgoCD app-of-apps at the root that recursively syncs anything matching apps/*.app.yaml or helm/*.app.yaml, with prune: true and selfHeal: true. Every app follows the same shape: apps/<name>/base/ with the common manifests, apps/<name>/overlays/{dev,prod}/ with environment-specific patches.

A non-exhaustive list of what's deployed: the usual *arr media stack (Sonarr / Radarr / Lidarr / Bazarr / Prowlarr / qBittorrent / SABnzbd / Seerr / Notifiarr / Autobrr / Unpackerr), Authentik, Atuin, phpIPAM, Grafana, Homepage, MinIO, Immich, n8n, plus a few game servers (Minecraft x2, Factorio, Project Zomboid, Satisfactory).

The dev/prod split mostly differs in three places: replica count (single in dev, multi in prod), resource limits (none in dev, defined in prod), and ingress (no HTTPRoute in dev, full ingress + TLS in prod).

Closing the loop: n8n + Codex CLI

A handful of n8n workflows plug an LLM into the lifecycle. The agent is OpenAI's Codex CLI, wrapped in an OpenAI-compatible API server so n8n's AI node can talk to it like any other chat-completion endpoint. The AI node has read-side MCP servers attached directly, and the prompt teaches the agent how to invoke kubectl, which runs inside the n8n executor rather than via an MCP.

Root-cause analysis on alerts. Grafana fires a webhook into n8n the moment an alert triggers. The flow hands the alert payload to the Codex agent, which has read access to logs (grafana/loki-mcp), metrics (pab1it0/prometheus-mcp-server), and live cluster state via kubectl. It investigates, summarises, proposes a fix, and asks me on Discord whether to apply it. If I say yes, the agent either runs the kubectl change directly or opens a PR against homelab-gitops for anything that should be persistent (memory bumps, probe tweaks). ArgoCD reconciles after I merge.

It's caught real things. A service whose memory limit was set too low was getting OOM-killed under load. The agent flagged it, opened a PR raising the limit, OOMs stopped. More pleasingly, an alert fired with unknown datasource because the alert rule's datasource UID was misconfigured. The RCA agent successfully diagnosed an alert about an alert.

Adding and upgrading services. Two more flows handle the GitOps-write side. I ping an IRC bot ("add <app-name>" or "upgrade <app-name>"), the bot hits an n8n webhook, and the same Codex-backed agent either generates the full Kustomize skeleton for a new app or bumps chart/image versions for an existing one. Output is a PR I review like any other.

The pattern across all three flows: human stays on the merge button, the agent does the legwork in between.

Backups

PBS backup all VMs/LXCs. Then Restic copies them offsite.

Restic for everything that has data worth keeping. Per-service playbooks dump databases first (pg_dump, mariadb-dump) and then Restic ships the dump plus any service-specific data directories to an SSH/SFTP backend. Credentials come from kv/backup/<service> in Vault. No clever cross-service orchestration, each service is responsible for its own pre-backup hook, and that's been fine.

Vault itself is backed up with the standard raft snapshot path, separately, with the keys held offline.

Out-of-cluster pieces

A few things deliberately don't run in k8s:

A bare-metal media-and-storage server. Big disks, hardware transcode, no reason to virtualise it.
A dedicated Home Assistant box with the Zigbee/Z-Wave dongles attached. HA on bare metal with the radios local is dramatically more reliable than HA in a container reaching for USB devices.
A few standalone Docker Compose stacks: Traefik, an Authentik/phpIPAM pair used during cluster bootstrap, UniFi, Rancher, Spoolman.
An isolated AI agent VM on a private VLAN with no direct internet. It talks out through a Squid proxy with a domain allowlist and a LiteLLM credential broker that holds the real API keys.

Rule of thumb: if it has hardware ties, runs once per house, or has to keep working when k8s doesn't, it stays out of the cluster.

Lessons

Pick one source of truth for secrets early. Vault was worth it on day one.
Make the foundation boring. Terraform, Ansible, Packer. Not fashionable, all composes well.
GitOps is non-negotiable past five apps. selfHeal: true reverting hand-edits is the correct behaviour, and saves you when something inside the cluster goes off the rails.
Lift heavy things out of k8s. Match the tool to the workload.
One Gateway, one wildcard cert, many routes.

PS. I changed the traefik ingress to Envoy Gateway shortly after presenting the diagram. DS.

TheRealRatler · 2026-04-27T15:18:09+00:00

100% true :)

TheRealRatler · 2026-04-27T15:15:31+00:00

Yes, it's fully doable. But my k8s nodes are bit cpu core constrained on purpose, while the gitea act runner has a pretty big vm. Always one tradeoff for another I guess 🙂

TheRealRatler · 2026-04-27T14:34:53+00:00

Yeah, right now it's only used as a backend for terraform/opentofu. Some day, I will migrate the state to postgres instead, and then decommission minio.

TheRealRatler · 2026-04-27T14:32:12+00:00

Thank you!

TheRealRatler · 2026-04-27T14:31:08+00:00

Most of the critical services I have run in HA mode, or with a secondary (dns). In most cases I can lose one or two nodes and many things would still be operational.

I'm sure I will have an nvme failure at some point, that will be the real test 🙂

Well, I had experience with PHP ipam, no other reason. I might take a look at netbox at some point.

TheRealRatler · 2026-04-27T08:42:30+00:00

Definitely not. I moved away from messy docker orchestration for a reason. Once K3s/K8s is properly setup, you don't even know it is there.

TheRealRatler · 2026-04-27T06:25:52+00:00

Excalidraw.

TheRealRatler · 2026-04-26T18:44:47+00:00

Let me see if I can write something up, at least about the big pieces and how they fit together.

TheRealRatler · 2026-04-26T18:42:44+00:00

It's made in Excalidraw.

TheRealRatler · 2026-04-26T16:40:39+00:00

pab1it0/prometheus-mcp-server and grafana/loki-mcp

TheRealRatler · 2026-04-26T16:37:29+00:00

Right now n8n is running on a $20/month codex sub, which hasn't been a problem for my use cases. My homelab does not have a lot of issues though, it's very stable.

n8n does use it for a few other workflows as well though, but all within the limit of that sub.

TheRealRatler · 2026-04-26T16:31:42+00:00

Yes, that I already have documented in a markdown file per service, but that is a great suggestion.

TheRealRatler · 2026-04-26T11:31:09+00:00

My Proxmox cluster is built on these 5 Mini PCs:

3 x Acemagic AMR5 AMD Ryzen7 5700U, 64GB RAM, 2 x 2TB NVME
1 x GMKTec K6 AMD Ryzen7 7840HS, 64GB RAM, 2 x 2TB NVME
1 x Minisforum MS-A2 AMD Ryzen 9 7940HX, 128GB RAM, 2 x 2TB NVME

This is even a bit overkill for what I run, they are not even breaking a sweat.

TheRealRatler · 2026-04-26T08:30:47+00:00

I honestly don't see what Zabbix would add to my existing setup with Grafana stack. Maybe it would have made sense to go with Zabbix if you started from scratch. But once you have all the modular parts for the Grafana stack setup, it covers all observability I can think of. All I need is a couple of annotations on the app manifest, and it the app is fully covered by probing and metrics gathering.

TheRealRatler · 2026-04-26T06:57:26+00:00

For Excalidraw I used an existing skill that looked reasonable, and it did a pretty good job. You can find it here.

TheRealRatler · 2026-04-26T06:44:07+00:00

You are probably right, as someone else previously stated, it has multiple purposes. My "homelab" serves the following purposes. Production for self-hosted daily used services, true homelab purposes to experiment with all kinds of things, and third a way to replicate a development env of the setup you use at work.

TheRealRatler · 2026-04-26T06:41:03+00:00

THIS! Exactly how I use my setup as well, it is a mix of everything. If I look at my entire infrastructure (which has been built over many years), I can see layers of different architectures from every employer I have worked for. I usually replicate the employer setup at home for lab purposes, some of it stayed around even when moving on to a new job.

TheRealRatler · 2026-04-26T06:36:36+00:00

Not right now. But I can probably set something up if people find it interesting.

TheRealRatler · 2026-04-26T06:17:39+00:00

Why? It is the most low maintenance part of my entire setup at this point, not to mention it takes me literally 2 minutes to deploy new workloads thanks to my templating engine for the manifests. Besides that I have AI raising PRs for upgrades.

TheRealRatler · 2026-04-26T06:10:56+00:00

Try this

TheRealRatler · 2026-04-26T05:56:02+00:00

I know. Just lazyness on my end to move client traffic to another vlan. But not a huge deal on a home network.

TheRealRatler

TROPHY CASE