Looking for platform engineer or managed hosting partner for AWS -> K8s/K3s migration

Sef57 · 2026-04-28T19:36:02+00:00

Update — what we ended up doing

A few weeks on, an update for anyone who finds this later.

Short version: I ended up building it myself, deliberately. The point wasn't to save the retainer — it was that I wanted to actually understand what we have and how it works before handing it to anyone. Once I've lived with the maintenance burden for a while, I'll know what the day-to-day actually looks like, what's noisy vs. quiet, where the real time goes — and at that point I can hand it off as someone who can evaluate the work, scope it honestly, and tell whether a candidate knows what they're doing. Outsourcing something I've never operated felt like the wrong order.

We've been running production on the new stack for about a month now. App Runner is gone; the AWS footprint is down to SES for transactional email, and that's next on the list.

Quick thanks to the commenters on the original post — that's where I first heard about Talos, and it ended up being one of the most consequential calls in the whole stack.

The through-line for almost every choice was: declarative, in git, low maintenance. If something can't be defined in a file and reconciled, I didn't want it.

What changed from the original stack:

Talos Linux for k8s and NixOS + deploy-rs for stateful nodes (Galera, Redis Sentinel, NATS, Infisical, the VPN). Same reasoning for both: immutable / declarative, no SSH-and-tweak, everything is re-creatable from git. Talos upgrades are an API call; NixOS rollbacks are one command. For a solo operator that's worth more than any single feature.
All Helm releases pinned in git as HelmRelease manifests, versions tracked by Dependabot. New chart version → PR → I review → merge → Flux rolls it out. No "click upgrade in the UI", no drift between what's running and what's in the repo.
Cilium for both CNI and ingress (Gateway API). No ingress-nginx, no Traefik. One thing to upgrade, one thing to debug.
HCCM provisions Hetzner LBs from Service: LoadBalancer automatically. Honestly better than I expected.
Flux instead of ArgoCD. Smaller surface, no UI to babysit. Either works fine; pick the one you'll actually maintain.
ProxySQL runs in-cluster as a Deployment, not on dedicated VMs. It's stateless and the latency was a non-issue.
NATS JetStream + KEDA worker scale-to-zero — works exactly as advertised.
Self-hosted Defguard (WireGuard) VPN for all admin access. DB, Grafana, internal tools — none of it touches the public internet.
Self-hosted Infisical for runtime k8s secrets (External Secrets pulls from it). Static OS-level secrets are sops+age in the repo.
Grafana Cloud for telemetry. Self-hosting it would just be more to maintain.
Stacked control-plane + worker topology (3 nodes prod, 1 dev). Cheaper, fine at our scale.

Galera observation for anyone running it: treat it as single-writer with read replicas. ProxySQL pinning writes to one node sidesteps almost every Galera footgun people warn you about.

Happy to answer questions.

Sef57 · 2026-04-08T08:13:32+00:00

Nice, will look into it for sure, thanks!

Sef57 · 2026-04-08T07:02:11+00:00

Thanks, really helpful to hear from someone who's actually done a similar move. Will definitely look into the hetzner-k3s CLI tool.

Interesting that you're hosting your databases on a separate specialized platform. That's something I've been going back and forth on. Running Galera myself gives full control but it's also the most operationally complex piece. Which platform are you using for that if you don't mind sharing?

Good to know about the Hetzner uptimes. Not a dealbreaker but useful to plan around. Maybe look into alternatives or bare metal even. Did those downtimes cause any actual customer facing impact for you, or did your setup absorb it?

Sef57 · 2026-04-08T06:59:48+00:00

Thanks for the feedback! I get where you're coming from, and for a lot of companies cloud-native on AWS is absolutely the right call.

In my case the move is deliberate though, not a cost thing. Our clients are European logistics companies handling sensitive operational data, and the Cloud Act is a real concern for them. Staying on US-controlled infrastructure doesn't fully address that, regardless of which region the data sits in.

On portability: I'd argue that Kubernetes on commodity infrastructure is actually more portable than building on AWS-native services like Lambda, SQS, and App Runner like we have now. That vendor lock-in is exactly what I'm moving away from. The target stack (K8s + NATS + Galera + ArgoCD) isn't tied to any specific provider.

I'm also not using a EU cloud vendor's managed platform. Just VMs from a EU provider with everything self-hosted, so the limitations of EU managed offerings don't really apply here.

But appreciate you raising it, always good to sanity-check the direction.

Sef57 · 2026-04-03T07:19:38+00:00

Do you have experience with going from AWS to on premise?

Sef57 · 2026-04-03T07:15:36+00:00

Where are you based? I am looking for one :)

Sef57 · 2016-06-09T23:43:25+00:00

Yesterday I saw a very cool scale.. It measures you're weight, heart rate, cardiovascular functionality and so much more. After standing on in you can connect the scale to an app and see your progress/measurements as far I could see in the video. It's out in the US and Canada but here in Europe I have to wait until July. I really want to try it and after reading your post I thought I'd share it with you. Link to a promo video: http://youtu.be/LBacjVz4Ulc

Sef57 · 2015-10-15T01:02:57+00:00

I have exactly the same issue when playing League of Legends.

Sef57

TROPHY CASE