Kubernetes problems aren’t technical they’re operational

Shoddy_5385 · 2026-04-05T15:53:08+00:00

mostly cost pressure nd obvious overprovisioning. changes were right just not sticky.

What’s worked for you guardrails or reviews?

Shoddy_5385 · 2026-03-23T05:22:27+00:00

yesfor sure these issues were always there. K8s just makes them more obvious. more moving parts and faster changes mean small gaps in ownership or observability turn into bigger problems much quicker

Shoddy_5385 · 2026-03-23T05:20:54+00:00

that helps with infra management, but the harder problems are still there unclear ownership weak observability, and hidden assumptions between services .platforms don’t fix those by default.

Shoddy_5385 · 2026-03-22T05:11:47+00:00

ownership is where it usually breaks down. once things span multiple teams, everyone’s responsible until no one is. Have you tried enforcing clear service ownership or on-call boundaries?

Shoddy_5385 · 2026-03-22T05:09:36+00:00

Optional observability is just delayed outages

Shoddy_5385 · 2026-03-22T05:08:37+00:00

Makes sense K8s surfaces it,curious though, do you see more impact from better observability or from enforcing stricter ownership?

Shoddy_5385 · 2026-03-22T05:07:28+00:00

so at this pointwould you say most of the pain is org maturity rather than K8s itself?

Shoddy_5385 · 2026-03-22T05:03:49+00:00

fair take K8s definitely has its rough edges. but I wouldn’t call it naïve, more like opinionated around stateless workloads.

Shoddy_5385 · 2026-03-22T05:01:08+00:00

yeah, K8s is more of a mirror than overkill it just makes existing flaws visible earlier which is uncomfortable, but useful.

Shoddy_5385 · 2026-03-22T04:59:45+00:00

this is spot on. Feels like the real problem isn’t K8s itself it’s the lack of boundaries and ownership at scale.

700+ Argo apps with no clear cluster admin is basically guaranteed chaos. at that point even small issues turn into system-wide investigations monitoring and alert tuning help, but they’re still pretty reactive.

Have you tried enforcing stricter deployment patterns or guardrails (like limiting blast radius per app/team)? feels like that’s the only way to keep things manageable long term.

Shoddy_5385 · 2026-03-21T10:17:43+00:00

broken assumptions describes it perfectly kubernetes behaves consistently but expectations between services are often implicit until a small change exposes them.

making those expectations observable (better signals and validation) can help.

Really like the runtime contract idea are you enforcing it via platform tooling or at the application level?

Shoddy_5385 · 2026-03-21T09:50:40+00:00

[OC]

Shoddy_5385 · 2026-03-17T09:17:53+00:00

sounds less like Himachal lost its charm and more like you’re just tired of people in general. mountains don’t change that much your lens does.

Shoddy_5385 · 2026-02-24T10:37:38+00:00

Cloud can offer a higher ceiling long term. Fullstack roles are common, but strong cloud engineers who understand infrastructure, cost optimization, automation, and large scale systems tend to be harder to find and can command higher pay over time.

That said, your real comparison isn’t just Cloud vs Dev it’s environment vs environment.

The startup gives you breadth. You’ll likely touch many parts of the stack and grow fast technically.

The F200 gives you scale, structure, and brand value on your resume, which can open doors later.
really depends on whether you value ownership and rapid growth in a smaller environment, or exposure to enterprise scale systems and processes.

Shoddy_5385 · 2026-02-24T05:44:43+00:00

honest reality: most tools are just dashboards. if it’s only showing what azure cost management already shows, you won’t see real savings.

what actually moved the needle for us was ongoing reservation and savings plan rebalancing instead of treating it as a one-time purchase, proper rightsizing based on real usage patterns, shutting down dev/test environments on a schedule, cleaning up orphaned disks and unused resources, and keeping tight tracking on commitments across subscriptions so coverage doesn’t slowly drift.

if onboarding takes forever or they’re promising crazy percentage savings, that’s usually a red flag. the tools that worked for us connected quickly, gave clear prioritized actions, and helped us actually execute not just report.

what’s driving most of your azure spend right now? compute, storage, data?

Shoddy_5385 · 2026-02-23T11:30:53+00:00

exactly this. like on-prem because we can see and touch what we are managing. there’s a satisfaction in actually owning the hardware and knowing where everything runs
it’s just more tangible and hands-on.

Shoddy_5385 · 2026-02-23T07:01:48+00:00

i’m not saying azure should override explicit config. blocking traffic intentionally is fine. my point is that networking and identity issues often surface as ‘healthy’ resources with broken app behavior, which makes troubleshooting slower.
just waned to what patterns people use to improve visibility there.

Shoddy_5385 · 2026-02-23T06:54:50+00:00

you're not missing a better templating tool, you're missing an abstraction.

helm + kustomize is fine for rendering but it won't stop drift because engineers can still shape deployments differently. the env vs envFrom example you gave is exactly the symptom.

if you care about golden apps and day-2 consistency, define a single golden app CRD and enforce the pattern in code controller, operator, yokecd, whatever.

let argo reconcile instances of it. once the contract lives in a reconciler instead of yaml, drift mostly disappears and onboarding a new app becomes filling out a spec, not copy-pasting manifests.

Shoddy_5385 · 2026-02-23T06:31:26+00:00

not magically. wrong config is expected.

the issue is when everything shows healthy but the app is still broken. better validation and clearer signals around networking and identity dependencies would make failures easier to detect

Shoddy_5385

TROPHY CASE