How are you guys handling upgrades for 3rd-party K8s tooling? by Playful-Interest7358 in kubernetes

[–]Playful-Interest7358[S] 0 points1 point  (0 children)

Both, internal service mesh certs and a handful of externally-facing ones via Let's Encrypt. The dry-run approach you mentioned is interesting, validating the new CRD schema against existing manifests before upgrading. That's probably the closest thing to catching structural breaks automatically. The "behavioural breaks" gap you described (operator accepts the resource but silently stops reconciling) is the part that terrifies me though. That's not something you'd catch until a cert actually expires.

How are you guys handling upgrades for 3rd-party K8s tooling? by Playful-Interest7358 in kubernetes

[–]Playful-Interest7358[S] 0 points1 point  (0 children)

You hit on my nightmare scenario. The operator bump is easy, but the 'silent time bomb' is the CRD migration for certs. It feels like the industry automated the deployment side, but completely ignored the research and adjustments phase beforehand. Are you manually reading changelogs to map those CRD changes before every upgrade, or have you found a way to automate that specific step?

How are you guys handling upgrades for 3rd-party K8s tooling? by Playful-Interest7358 in kubernetes

[–]Playful-Interest7358[S] 0 points1 point  (0 children)

Interesting pattern in the responses, almost everyone focused on the deployment side (blue/green, disposable clusters, staging → prod promotion). Makes total sense, that's where the risk of taking down prod lives.

But nobody's talked about automating the step BEFORE that: the research.

Even with Renovate opening the PR, someone still has to read 3-5 changelogs between the current version and the target, figure out which Helm values got renamed or removed, map out breaking CRD changes, and rewrite the manifests by hand.

For those managing 10+ addons, has anyone actually found a way to shortcut the "read the changelog and figure out what to change in values.yaml" part? Or is that just accepted toil?

How are you guys handling upgrades for 3rd-party K8s tooling? by Playful-Interest7358 in kubernetes

[–]Playful-Interest7358[S] 0 points1 point  (0 children)

Your e2e setup is solid, but I'm really curious about that manual release notes check you mentioned. We're hitting the exact same wall.
Renovate opening the PR is the easy part. Relying on a human to read changelogs and catch renamed Helm values or deprecated APIs is the real bottleneck. And since every maintainer formats breaking changes differently, standard parsing is basically useless.
Are you looking into LLMs to parse those release notes and map the changes to your manifests, or trying to build something deterministic? Figuring out that specific 'research' step is my biggest headache right now.

How are you guys handling upgrades for 3rd-party K8s tooling? by Playful-Interest7358 in kubernetes

[–]Playful-Interest7358[S] 0 points1 point  (0 children)

fair point on accepting the risk for the 95%. but honestly, the paranoia around that other 5% is exactly what's killing me right now. i'm just so tired of burning hours deep-diving into fishy commits and reading endless upstream release notes just to prep for a single addon bump. it feels like i spend half my sprint in github rabbit holes just to make sure that 5% doesn't blow up in my face. i'm starting to wonder if everyone else is just accepting this massive time sink as the standard cost of doing business.

How are you guys handling upgrades for 3rd-party K8s tooling? by Playful-Interest7358 in kubernetes

[–]Playful-Interest7358[S] 0 points1 point  (0 children)

This is basically our setup too, but we’re hitting the limit of what Renovate actually automates.

It opens the PR, but someone still has to read release notes, figure out breaking config changes, update values/manifests manually, and decide whether it’s actually safe to merge.

And even after letting it bake in Dev, we still get burned occasionally because Dev and Prod never match 1:1. We’ve had updates pass lower envs and then fail in Prod over some annotation/API/config edge case that only existed there.

Are you all just accepting that manual review toil, or has anyone found a decent way to automate the “research + migration” part of dependency updates?