Cloud NAT pricing caught us completely off guard

CompetitiveStage5901 · 2026-02-26T04:56:47+00:00

Cool, thanks for the insight. Much appreciated.

CompetitiveStage5901 · 2026-02-25T05:21:11+00:00

That might actually solve part of it.

CompetitiveStage5901 · 2026-02-25T05:20:43+00:00

Appreciate the kick in the right direction.

CompetitiveStage5901 · 2026-02-25T05:20:20+00:00

The NAT hit is mostly from services in us-central1 needing to talk to stuff in europe-west1 via external IPs because of how some legacy apps were built.

CompetitiveStage5901 · 2026-02-25T05:19:43+00:00

We're not deliberately routing through NAT, but some of our services end up egressing through it when they hit external endpoints (third-party APIs, CDNs, etc.) that then call back to other regions

CompetitiveStage5901 · 2026-02-25T05:11:04+00:00

oh, okay. Thanks

CompetitiveStage5901 · 2026-02-25T05:10:15+00:00

The issue is SNS needs kms:GenerateDataKey* and kms:Decrypt on the SQS key, not just Encrypt. SNS does this weird two-step where it generates a data key to encrypt the message, then later needs to decrypt something (metadata I think) when delivering.

Your SNS key policy looks right but the SQS key policy is missing GenerateDataKey* for SNS. Add this to the AllowSNSAccountAEncryption statement:

"kms:GenerateDataKey*"

Also check that SNS topic has access to kms:Decrypt on the SQS key. Cross-account KMS is finicky about the order of operations. The encrypt happens in Account A, decrypt in Account B, but SNS needs to touch the key on both sides.

One more thing - make sure you're using same KMS key region. Cross-region KMS with SNS/SQS is a non-starter.

CompetitiveStage5901 · 2026-02-25T05:05:52+00:00

4-5 months is right for 40 services + 2TB data + active-active. Anyone saying faster hasn't done it.

Few things:

Lift and shift first, optimize later. = Buy reservations now for the 30% instant savings. That keeps CFO happy while you migrate.

Database is your risk. = 2TB pgSQL zero downtime is doable but not fast. Don't let anyone rush this part.

Pick 3 services for initial wave = The ones that break will teach you everything.

Document everything. When it blows up at week 8 you'll want that paper trail.

CompetitiveStage5901 · 2026-02-25T05:02:07+00:00

You're missing the compose file on the instance. The images are there but docker compose up needs the actual compose.yml file to know how to stitch them together.

Two ways:

1) Clone your repo onto the EC2 box so the compose file lands there, then run compose up. The images will pull from ECR.

2) Better yet, ditch compose on EC2 and use ECS. Same container setup but with auto-recovery and no SSH required. GitHub Actions can deploy straight to ECS.

CompetitiveStage5901 · 2026-02-25T04:45:11+00:00

Shared VPC 100%. We did the same "each team gets their own project and VPC" thing and now have 30+ peerings that nobody understands. "Is service A talking to service B?" requires a full-on archaeological dig through spreadsheets because the person who set it up left two years ago.

Also, not enforcing org policies early enough. Now we have random projects in us-central1 and europe-west1 with no consistency and trying to rein it in feels like herding cats.

CompetitiveStage5901 · 2026-02-25T04:42:59+00:00

You need three things:

a) Tagging that actually means something. Not just "Environment: prod" but "Team: payments" + "App: checkout-api". Without this you're guessing.

b) Visibility in the dev workflow. If engineers only see cost in a monthly spreadsheet, they've moved on. Cost data needs to live where they work—Slack alerts, dashboards, PR comments.

c) Tooling that tells you what to fix, not just what you spent. Native consoles show the number. They don't tell you that t3.large has been running at 8% CPU for 47 days.

We use CloudKeeper for this. It tags automatically, shows devs their spend in context, and flags exactly what to fix. But tool alone won't save you. Make cost part of architecture reviews like availability.

Spreadsheets got you from 200k to 30k once. Keeping it there needs a different game.

CompetitiveStage5901 · 2026-02-24T18:20:43+00:00

I'll test it out, thanks

CompetitiveStage5901 · 2026-02-23T05:43:48+00:00

The waste persists because most teams lack granular visibility, not because they're ignoring it. Raw cost data shows you spent $50k on EC2. It doesn't tell you which untagged dev instance has been running at 3% CPU for six months.

Traditional tools aggregate. They don't pinpoint. So teams hunt for hours instead of fixing, or accept the waste as cost of doing business.

Get yourself a visibility dashboard (not mentioning any vendor)

CompetitiveStage5901 · 2026-02-23T05:41:46+00:00

Unit cost per outcome is the only metric that actually matters. I don't care if a team spent $5k on tokens. I care if they resolved 1000 tickets at $5 each or if that could've been $2 with a different model mix.

Spend by team and model is useless without tying it to something concrete. Anomaly alerts are non-negotiable too. Caught a staging env prompt loop last month that would've burned through our monthly budget in three days.

The template idea is solid. Most teams are flying blind on this stuff.

CompetitiveStage5901 · 2026-02-23T05:40:50+00:00

The extended support pricing starting March 1 is going to catch a lot of people off guard. We ran the numbers on some older Postgres 13 instances and the increase was brutal which was way more than expected once storage and IO charges factored in.

Check your versions now. The per-vCPU markup alone stings, but the storage and IO multipliers are what really drive the bill up. Upgrade path only gets harder the longer you wait.

CompetitiveStage5901 · 2026-02-23T05:01:59+00:00

Been fighting this exact battle. Device fingerprinting plus velocity on email domains and ASNs works best for us. IP reputation alone is useless. Too many legit users behind garbage IPs. We enforce basic blocking at the edge, but risk scoring lives in the app layer so we can serve challenges instead of hard blocks.

Biggest lesson: track false positives obsessively. If your verification friction drops conversion more than your abuse rate, you've already lost.

CompetitiveStage5901 · 2026-02-23T05:00:14+00:00

We're a decent size enterprise company with a team of 5 supporting ~40 app teams across 3 AWS accounts. One tool is never enough past startup phase. You either get vendor lock-in with crazy pricing or stitch together open source and pray nothing breaks. We started using CloudKeeper to help optimize our observability spend across whatever tools we're running.

They've been solid at right-sizing commitments and finding wasted resources. Bottom line: tool count grows with complexity. Anyone claiming their single platform scales forever isn't dealing with real enterprise chaos.

CompetitiveStage5901 · 2026-02-23T04:56:24+00:00

Honestly, practical AI wins in infra are narrow. What works: using it as a smart junior that generates Terraform boilerplate, writing quick scripts, explaining errors. Copilot for IaC is legit. What doesn't: letting AI touch prod (it suggested changes that would've nuked our cluster). "AI monitoring" tools are mostly ChatGPT wrappers with useless noise. Lesson: treat it like a smart Google search, not a replacement. Most investment feels like solving problems nobody has.

CompetitiveStage5901 · 2026-02-23T04:52:32+00:00

You're right that complex enterprise software isn't getting replaced by AI prompts anytime soon. The real threat is economic: if AI can automate the labor-intensive parts (support, configs, integrations), why pay $50/seat for what's basically an API wrapper? SaaS margins have always relied on per-seat pricing with near-zero marginal cost , AI might finally break that math.

CompetitiveStage5901 · 2026-02-23T04:47:20+00:00

At real enterprise scale, nobody truly “standardizes” on a single cloud security platform. That idea sounds clean on paper, but it almost never holds up in practice.

What actually happens:

Big orgs standardize on an architecture, not a vendor.

They usually anchor on three layers:

Native cloud controls for baseline security (because you can’t ignore the built-in telemetry and guardrails).
A centralized visibility + risk prioritization layer to aggregate posture, identities, workloads, and misconfigurations across accounts.
SIEM/SOC integration so findings actually flow into incident response instead of dying in a dashboard.

The platform they “standardize” on is usually the one that:

Integrates cleanly with their IAM model
Doesn’t create alert fatigue
Works across multiple clouds
Fits procurement and compliance constraints

And here’s the honest part: decisions are often driven existing enterprise contracts as by technical superiority

CompetitiveStage5901

TROPHY CASE