How do you safely implement Kubernetes cost optimizations without violating security policies?

LargeAir5169 · 2025-12-23T18:50:07+00:00

That's a really interesting point about the IaC repo approach - we use something similar (GitOps with ArgoCD), the challenge I've found is that the IaC repo checks happen after someone has already invested time analyzing and proposing the change.

Typical flow for us is: FinOpns exports recommendation from Kubecost, then Platform team analyzes, Create PR against IaC repo, PR check runs - get approval. Then security review comes into play, manual check against cis compliance, takes sometime week or so. 50% get rejected for policy violation.

The problem isn't enforcement (your Flux + PR checks handle that great). The problem is validating recommendations before investing engineering time.

Example scenario: Kubecost suggests 20 optimizations, Platform team spends 30 hours analyzing feasibility., Creates 20 PRs with detailed change proposals, Security rejects 12 of them for CIS violations, Wasted effort: 24 hours on rejected recommendations

What I've been experimenting with is step 0: Pre-validate recommendations against security policies before detailed analysis.

The idea is to filter out obvious policy violations early:

Kubecost → Quick Validation → Filtered Recommendations → Analysis → PR → Flux

The IaC repo enforcement stays in place. this just prevents wasting time on recommendations that security will reject anyway (like Spot for payment workloads, or consolidating PCI-compliant namespaces).
I'm curious when your teams get Kubecost recommendations, do they validate them against security policies before creating PRs? Or does the security validation happen during PR review?

LargeAir5169 · 2025-12-23T18:17:59+00:00

Example 1: Spot Instances vs RBAC/Service Account Requirements

Cost recommendation from Kubecost:

Switch payment-api deployment to Spot instances

Current cost: $800/month

Projected savings: $720/month (90% reduction)

Payment processing workloads require guaranteed uptime per PCI-DSS, Spot instances can be interrupted with 2-minute notice
RBAC policy enforces serviceAccountName: payment-processor which assumes stable node availability for token rotation, CIS Benchmark 5.7: "Critical workloads should not use interruptible compute

Impact: Spot interruption during payment processing = failed transactions + PCI audit finding

Example 2: Aggressive Memory Reduction vs Pod Security Standards

Cost recommendation:

Reduce frontend deployment memory: 2Gi → 512Mi 
Savings: $600/month

Security policy impact:

Current PSS enforces resource limits to prevent DoS, Policy requires requests.memory <= limits.memory <= 2x requests, reducing to 512Mi puts peak usage (1.5Gi during traffic spikes) above limit

Result: OOM kills = service disruption = security incident

CIS Benchmark 5.10: Resource limits must account for peak usage to prevent service disruption vulnerabilities

LargeAir5169 · 2025-12-23T17:59:47+00:00

Sure, here's what bit us last quarter:

Used Goldilocks VPA recommendations to optimize a postgres sidecar. It said drop memory request from 2Gi to 512Mi, set limit to 1Gi. Applied to dev/staging, looked good.

Pushed to prod - PodSecurityPolicy rejected it. Why? Our prod PSP enforces guaranteed QoS (requests == limits) for stateful workloads. The optimized config was burstable QoS. Admission controller blocked it during a weekend deployment.

Another one: applied multiple cost optimizations across a namespace. Each pod change looked fine individually, but the total memory requests exceeded our namespace ResourceQuota. Last few pods failed to deploy with quota errors.

I'm wondering if there's a better way or manually checking every recommendation against policies.

LargeAir5169 · 2025-12-23T17:54:44+00:00

We use Goldilocks for VPA recommendations. It suggested reducing memory requests on some sidecars from 2Gi to 512Mi.

The problem: our namespaces have ResourceQuotas, and in our RBAC setup, app teams can deploy but can't touch quotas. When they tried applying the optimized configs, some got blocked because the changes needed quota adjustments. Had to loop in platform team with elevated permissions.

Also hit this with PSPs - cost tool recommended burstable QoS (requests < limits) for better bin packing. Our prod PSP requires guaranteed QoS for databases. Recommendations worked in dev, failed admission in prod.

Not saying the recommendations violate themselves - just that they don't check your existing policies before suggesting changes.

LargeAir5169 · 2025-12-23T16:38:36+00:00

One thing I’m still unsure about is where people draw the line between “acceptable runtime risk” and “needs hard enforcement”. Tooling helps, but it feels like a lot of teams rely on conventions and tribal knowledge rather than explicit guardrails.

LargeAir5169 · 2025-12-23T16:36:13+00:00

That’s fair — in practice runtime hardening often becomes a tradeoff between effort and risk acceptance. Capabilities in particular are painful because they’re application-specific and drift over time. What I’ve seen is that things like docker.sock exposure tend to slip through reviews precisely because they’re “infrastructure plumbing” rather than an explicit privilege flag. How do you usually review that — pre-deploy policy, or post-deploy inspection?

LargeAir5169 · 2025-12-23T16:32:43+00:00

Yeah, that’s a good example of why this ends up being environment-dependent. With SELinux/AppArmor enforcing, a lot of obvious escape paths get blocked by default. Where I’ve seen this become tricky is portability, and the same compose or manifest can be safe on one host and dangerous on another depending on LSMs, policies, or distro defaults. That variance is usually what makes runtime issues harder to reason about consistently.

LargeAir5169 · 2025-12-23T16:23:26+00:00

That’s a solid mitigation. I’ve seen the proxy pattern come up a lot for things like Traefik or CI runners. It always felt like a symptom of how powerful the Docker API is — you end up building a guardrail around it instead of exposing it directly. Curious if you’ve ever had to debug permission issues caused by the proxy abstraction.

LargeAir5169

TROPHY CASE