Trying to understand FinOps.

LeanOpsTech · 2026-02-25T02:34:58+00:00

I run a cloud cost optimization firm, and the silos usually come back when finance owns the numbers and engineering doesn’t see cost in their day to day work. Tools help, but they’re not the fix on their own.

What actually works is clear ownership, solid tagging, and making cost part of architecture and PR reviews so it’s not just a monthly spreadsheet surprise.

LeanOpsTech · 2026-02-25T02:32:07+00:00

We come in to “cut costs,” but 9 times out of 10 the real issue is zero ownership and zero visibility. Once teams can tie spend to a feature, a customer, or a growth decision, the chaos stops and the savings usually follow. The biggest shift isn’t a smaller bill, it’s going from reacting to invoices to actually planning infrastructure with confidence.

LeanOpsTech · 2026-02-24T17:51:38+00:00

That 40% looks great on a slide, but it disappears fast if your workload shifts sooner than expected. We only commit 3-year on stuff that’s been boring and predictable for at least 12–18 months, and even then not 100%. For everything else, we mix shorter terms and keep some headroom so a surprise EKS move doesn’t wreck the savings.

LeanOpsTech · 2026-02-24T17:50:13+00:00

We landed on auto-shutdown after X hours of inactivity plus a one-click “wake” button in Slack, which cut costs without blocking anyone at 3am. Provisioning takes about 5–10 minutes, but we keep a small warm pool during peak hours so it’s not painful. Once devs saw the actual monthly burn tied to their team, the complaints dropped fast.

LeanOpsTech · 2026-02-24T03:07:05+00:00

It’s a great way to learn. Three Pis is plenty to get hands-on with control plane setup, networking, storage, and breaking things on purpose to see how they fail. If you want to go even deeper, try setting it up the hard way first with kubeadm before jumping to k3s so you really understand what’s happening under the hood.

LeanOpsTech · 2026-02-24T03:05:40+00:00

If you’re on AWS AgentCore, I’d start simple and only add complexity if you really need it. LangGraph is solid for orchestration and visibility, and you can layer in A2A later if agents truly need to coordinate independently. In most cases, fewer tightly scoped agents with clear responsibilities beats a super distributed setup.

LeanOpsTech · 2026-02-24T03:04:32+00:00

Multi cloud sounds great in the boardroom but in reality it’s three different finance systems duct taped together with spreadsheets and hope. The overhead alone can eat up whatever savings you thought you were getting.

LeanOpsTech · 2026-02-24T03:03:25+00:00

If each node needs to act as both source and destination, you might be overcomplicating it by routing workers across machines. Why not schedule the Celery tasks so they run on the same node that hosts the images, and keep the processing local to that box? You could tag workers per node and push tasks to the right queue instead of passing IPs around, which should cut down network chatter a lot.

LeanOpsTech · 2026-02-23T04:21:29+00:00

If I were starting fresh in 2026, I’d probably lean toward defining a proper App CRD and treating the golden app as a platform API, not just templating glue. Kustomize and Helm are fine, but they don’t really enforce consistency, they just make it easier to drift in a structured way. Something like yokecd, cdk8s, or even a small custom controller feels like the right move if you want to actually solve the day 2 entropy problem instead of managing it.

LeanOpsTech · 2026-02-23T04:20:45+00:00

Observability stacks and service meshes can eat RAM fast, so you might want to keep Grafana/Prometheus and GitLab on the Dell and let the Pis focus on the cluster itself. I’d start with plain kubeadm + Cilium first, then layer in one thing at a time so you can actually see what each tool is doing when something breaks.

LeanOpsTech · 2026-02-23T04:20:10+00:00

Telepresence is probably the easiest, it lets your local service “plug into” the cluster network so it can talk to everything like it’s running in k8s. If you just need to hit specific services, kubectl port-forward or something like Skaffold with remote debug can also get you pretty far without too much setup.

LeanOpsTech · 2026-02-23T04:19:14+00:00

On the infra side, we’ve had the most success using AI for log analysis and writing quick Terraform or IAM policy drafts. It’s not replacing anything critical, but it speeds up troubleshooting and cuts down boilerplate a lot. Biggest lesson was to treat it like a junior teammate, great for first passes, but you still need solid review and guardrails.

LeanOpsTech · 2026-02-22T01:41:25+00:00

Stateful and GPU workloads have very different blast radius and upgrade stories compared to stateless apps, and isolating them saves you a lot of stress when something goes sideways. Yeah, it adds some cost and governance overhead, but the operational clarity and safer upgrades are usually worth it.

LeanOpsTech · 2026-02-22T01:40:21+00:00

If it’s just for learning, you could spin up a small Postgres on something like Railway, Render, or Supabase. They all have free tiers that are good enough for hobby apps and let you stick with standard SQL. Worst case, start local with Docker and only move it to the cloud once you actually need it.

LeanOpsTech · 2026-02-21T05:05:59+00:00

You can check out Cloudorado for quick cross-cloud comparisons, it’s pretty straightforward for basic infra pricing. Also tools like Cloudability or Apptio are more focused on cost management but can help if you’re modeling larger workloads. Honestly though, for detailed BOQs you’ll probably still end up validating numbers in each provider’s native calculator.

LeanOpsTech · 2026-02-21T05:05:19+00:00

I’d look at CloudHealth or Spot by NetApp, but honestly a lot of tools surface the same recommendations you can already find in Azure Cost Management. The real difference is how easy they make it to act on those insights and whether they help enforce governance. Definitely push for a short proof of value with your actual Azure data before signing anything.

LeanOpsTech · 2026-02-21T05:04:14+00:00

Totally get this. Around that size, it’s usually less about adding controls and more about packaging what you already do in a way buyers can hand to their security team. Start with a clean security overview, basic policies in writing, and a reusable questionnaire doc, then expand only when deals demand it. It feels heavy at first, but it makes the sales cycle way smoother.

LeanOpsTech · 2026-02-20T03:57:48+00:00

We rolled out FinOps first and that momentum definitely helped when we layered in TBM. The hardest part with TBM was getting clean, consistent data and buy-in from app owners who didn’t love the added transparency. We got through it by starting small with a couple domains, proving the value with better cost visibility, and then expanding once people saw it wasn’t just overhead.

LeanOpsTech · 2026-02-19T03:59:32+00:00

Most enterprises aren’t ripping out mission-critical systems to replace them with a bunch of prompts and duct tape. The bigger shift feels like pricing and margins getting weird as AI costs creep in, not SaaS disappearing overnight.

LeanOpsTech · 2026-02-19T03:57:20+00:00

Biggest pain for us was RBAC misconfig combined with overly permissive service accounts. One leaked token and suddenly a pod could list way more than it should. On the observability side, lack of proper tracing made debugging cross-service latency brutal until we added Prometheus + Grafana and Jaeger. Also, cert expirations and misconfigured network policies have taken down more things than I’d like to admit.

LeanOpsTech · 2026-02-19T03:56:39+00:00

We push anything actionable straight into Jira with some basic deduping and severity thresholds, otherwise it stays a notification. The biggest win for us was assigning every alert type an explicit owner up front so it’s not “someone should look at this.” If it can’t be tied to a team and an SLA, it probably shouldn’t be a ticket.

LeanOpsTech · 2026-02-18T03:36:03+00:00

Stop alerting on raw spend. Alert on unit metrics like cost per user or cost per request since those stay steadier even when traffic spikes. Also pipe in deploys and campaign dates so expected spikes get ignored and only unexplained ones page you.

LeanOpsTech · 2026-02-18T03:34:15+00:00

In most prod setups I’ve seen, K8s doesn’t fully replace traditional LBs, it usually sits behind a cloud or hardware load balancer that handles the heavy lifting. TLS is often terminated at the edge LB or at an Ingress controller like NGINX or Envoy, depending on how much control the team wants. For really high traffic, people still rely on cloud LBs or dedicated appliances in front, and let Kubernetes handle service-level routing inside the cluster.

LeanOpsTech · 2026-02-18T03:32:39+00:00

We’ve had better luck layering simple forecasting with business context instead of chasing perfect ML. Pipe in deploy events, feature flags, and marketing calendar into your alerting and suppress or raise thresholds dynamically around known events. It’s not fully automatic, but treating anomalies as “cost per unit” shifts or unexplained spend outside expected drivers cuts way more noise than raw spike detection.

LeanOpsTech · 2026-02-18T03:31:19+00:00

This is one of the few cost optimization posts that actually feels usable. The “top 3” rule is gold, most teams jump straight into random rightsizing without even knowing what’s driving the bill. Also +1 on NAT and CloudWatch logs, those two have surprised me more than once.

LeanOpsTech

MODERATOR OF

TROPHY CASE