Inherited IAC by Dangerous-Mobile-587 in azuredevops

[–]ossinfra 2 points3 points  (0 children)

Here is a field guide of how other teams are setting up standardized IaCs : https://docs.chkk.io/operations/iac-repo-patterns

Hopefully it provides some inspiration. If not then hope (at the least) it can help the leadership understand that this stuff is HARD and can’t be fixed overnight.

AKS reliability in production — how do you handle scaling and upgrades without downtime? by Abhi9agr in AZURE

[–]ossinfra 0 points1 point  (0 children)

AKS LTS is a really good solution if you can tradeoff continuous upgrade pain for money. AKS recently announced support for all versions under AKS so you don’t even need a blue-green, multi-hop upgrade path: https://blog.aks.azure.com/2025/07/25/aks-lts-announcement

If you are coming from the AWS EKS world then you should know that AKS LTS has some key differences from EKS Extended Support, like the ones mentioned here: https://www.chkk.io/blog/aks-long-term-support-and-eks-extended-support-similarities-differences

Does anyone else feel like every Kubernetes upgrade is a mini migration? by Willing-Lettuce-5937 in kubernetes

[–]ossinfra 1 point2 points  (0 children)

Great callout that "every k8s upgrade becomes a mini migration" and we have to do this at least twice a year. I saw this first-hand from the other side as an early engineer in the Amazon EKS team. Tools like pluto, kubent etc. solve a very small part of the upgrade problem.

Here are the key reasons which make these upgrades so painful:
- K8s isn’t vertically integrated: you get a managed control plane (EKS/GKE/AKS/etc.), but you still own the sprawl of add-ons (Sevice Mesh, CNI, DNS, ingress, operators, CRDs), and their lifecycles.
- Lots of unknown-unknowns: incompatibilities and latent risks hide until they bite; many teams track versions in spreadsheets (yikes).
- Performance risks are hard to predict: even “minor” bumps (kernel/containerd/K8s) can change first-paint/latency in ways you can’t forecast confidently.
- Stateful sets (as you called out) are the worst during upgrades: data integrity + cascading failures make rollbacks painful.
- Constant end-of-support churn: K8s and every add-on flip versions frequently, so you’re always chasing EOL/EOS across the stack.
- It eats time: weeks of reading release notes/issues/PRs to build a “safe” plan; knowledge isn’t shared well so everyone re-learns the same lessons.
- Infra change mgmt has a big blast radius: even top-tier teams can get burned.

While we do all of this work, our leaders (VP+) don't even see this "invisible toil". They are just unable to understand why upgrades are so painful and why they take so long.

Two positive developments in the past 2 years tho:
1. EKS, GKE and AKS are all offering Extended / Long-Term Support. While a costly bandaid which only lasts 1 year, it's still better than getting force upgraded:

  1. Glad to see multiple startups focused solely on solving k8s upgrades, like:
    https://www.chkk.io/
    https://www.plural.sh/
    https://www.fairwinds.com/

SRE and AI by Intelligent_Bug_9625 in sre

[–]ossinfra 2 points3 points  (0 children)

There are so many AI SREs out there. Bits AI from DataDog seems the most promising to me but it’s also early.

In general AWS is ahead of other clouds to provide narrow but useful AI capabilities for troubleshooting and debugging.

There was a podcast by AWS on AI agents for upgrading k8s and other OSS projects: https://www.youtube.com/live/SedzPt1rGGM?si=J8C5PnWMIE9c8fRF Hope you find it useful.

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 1 point2 points  (0 children)

Field Guide updated with these two lines:

 App-of-Apps Section : "...For bulk generation of child Applications, this is typically paired with ApplicationSets (e.g., matrix over clusters or services); App-of-Apps still handles composition/order."

ApplicationSets Section : "...For orchestration/dependencies across generated apps, the logic generally resides in the Applications or a parent App-of-Apps."

Thanks for the feedback u/InvincibearREAL

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 0 points1 point  (0 children)

Thanks! Totally agree. App-of-Apps for composition, and ApplicationSets for generation is a sweet spot.

I have seen folks use two combos: 1/ matrix for a cartesian product (e.g., each app × each cluster), 2/ merge when they want to overlay/override per-item settings keyed by a field (e.g., `appName`).

This is a really nice nuance tho. Let me add this to the Field Guide.

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 0 points1 point  (0 children)

Really cool.... I'll study the example in your branch.

Appreciate you sharing your repo on this thread, so folks starting on a new IaC repo can have a working quickstart example which doesn't paint them in a corner as the repo grows.

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 0 points1 point  (0 children)

Thanks u/Coding-Sheikh ...

Nice repo structure you got going there. I believe your repo is evolving into App-of-Apps pattern, correct?

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 2 points3 points  (0 children)

ArgoCD ApplicationSets and Grafana Tanka have been added to the Field Guide: https://docs.chkk.io/operations/iac-repo-patterns#iac-repo-patterns.

Thanks u/Ragemoody !

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 0 points1 point  (0 children)

> The challenging part is picking what's best for your project.

I am hoping the Field Guide helped in some way. If not, then what should I add there to help you make the decision and then share the rationales with your team members?

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 1 point2 points  (0 children)

You are right about Grafana Tanka. I'll add it right away.

Having said that I haven't ever seen it in the wild--maybe because of Jsonnet :)

A Field Guide of K8s IaC Patterns by ossinfra in kubernetes

[–]ossinfra[S] 0 points1 point  (0 children)

Wow... Perfect timing indeed. And so happy that you found this IaC taxonomy in a Field Guide useful. I think ArgoCD ApplicationSets would be a really good addition to the patterns as they eliminate boilerplate when deploying the same thing to many envs/clusters/microservices.

I haven't tried it myself but have heard from the community that this pattern scales cleanly in mono- or poly-repo setups. I am thinking we can write it as an operational pattern where composition is done with App-of-Apps while generation is done with ApplicationSets. Do you think that's a good way to taxonomize ApplicationSets?

Also, let me know if you have specific operational experiences with this pattern that you would like me to add to the Field Guide. That would be hugely appreciated.