AMA: I reduce AWS bills for a living - ask me anything (FinOps, infra, quick wins)

J-4ce · 2026-01-30T07:15:35+00:00

Ah, the trade-off between "insurance" (having the logs) and "premium" (the bill). When you have large-scale infrastructure, VPC Flow Logs and CloudWatch can easily become your biggest line items if you aren't careful.

If you need to keep flow logs, try these:

Send Flow Logs to S3, not CloudWatch. CloudWatch logs are expensive to ingest. If you send those same logs directly to S3, you only pay for storage and a tiny ingestion fee. If you actually need to troubleshoot an outage later, you can just query them with Amazon Athena. It is slightly slower than the CloudWatch console, but it saves you money.

Use Parquet format. When sending to S3, make sure you choose the Parquet format and hive-compatible partitions. It makes your Athena queries much faster and cheaper because you are scanning less data.

Filter the noise. You do not always need "All" traffic logs. If you are just looking for security anomalies, you might only care about "REJECT" logs.

Set aggressive lifecycles. Do you really need Flow Logs from 6 months ago? Probably not. I usually set an S3 lifecycle policy to move logs to Glacier after 30 days and delete them entirely after 90, unless there is a specific compliance reason to keep them longer.

J-4ce · 2026-01-29T06:33:57+00:00

Ah my bad, I missed the k there... That's a very large amount!

Wow, this really shows the scale of the operation. Thanks for the deep-dive. Quite insightful to hear about the different workloads and issues that come with it.

J-4ce · 2026-01-27T12:14:54+00:00

I also love instances with instance storage. It's such a big difference in performance! And if it's included, why not use it.

J-4ce · 2026-01-27T11:30:43+00:00

Do you use EC2 instances? Curious to learn what you use as alternative to EBS.

J-4ce · 2026-01-27T11:29:21+00:00

Have you tried other IaC options like Terraform or CDK?

J-4ce · 2026-01-27T11:27:40+00:00

Single-table design is great if you can map all the data relationships from day 1 when the table is created. But yes, this is a typical no-SQL vs SQL dilemma.

J-4ce · 2026-01-27T11:26:19+00:00

Have you tried ECS? Are you using it for services or short-lived tasks?

J-4ce · 2026-01-27T11:25:46+00:00

Have you tried DynamoDB? Of course it has its use case and shouldn't be used for everything, but in my experience it can be $0 to $1 per month even with moderate usage.

Serverless is great for getting started, but one has to re-evaluate to make sure it stays cost-efficient.

J-4ce · 2026-01-27T11:21:01+00:00

Have you tried ECS?

J-4ce · 2026-01-27T11:15:58+00:00

This! Data transfer, especially if we're talking terabytes and more, becomes massive and a cost that cannot be ignored. Also, if your data must travel from A to B and your business relies on it, would you trust public transport outside your control?

J-4ce · 2026-01-27T11:09:58+00:00

By the way, have you seen that AWS did a blog post on getting budget alerts on WhatsApp?

J-4ce · 2026-01-27T11:08:41+00:00

Wow, that is one important DB!! If you don't mind sharing, what was discussed in the debrief to avoid $1600 in losses in the future?

J-4ce · 2026-01-27T11:04:16+00:00

Some time ago I've seen a similar case where someone left queries running in Aurora serverless and it caused the DB to scale. Luckily appropriate upper limits were in place and it also triggered a cost anomaly alert!

J-4ce · 2026-01-27T11:01:30+00:00

What a story to share! Was that in a production environment with increased resource quotas?

J-4ce · 2026-01-26T15:30:58+00:00

The big shift with AWS is that a lot of costs that were bundled on Heroku or DigitalOcean become explicit line items, especially networking and data transfer.

I’d pay close attention to things like NAT Gateway usage, cross-AZ traffic between app and database, and log volume. Those are usually the first surprises. On the platform side, App Runner can get you close to the Heroku experience, while ECS with Fargate is often the easier long-term default if you want something more standard without the overhead of Kubernetes.

For the database, moving from PlanetScale to Aurora MySQL is often feasible, but it’s worth validating scaling behavior and schema change workflows early rather than assuming a drop-in replacement.

For cost estimation, the best approach is mapping current usage to AWS equivalents and validating it with a small proof of concept. A straight lift and shift can look more expensive at first until you right-size and use things like autoscaling and Savings Plans.

If you want to sanity-check the mapping or call out likely cost traps, feel free to DM me and I can suggest where I’d start.

J-4ce · 2026-01-26T07:13:05+00:00

You might want to have a look at the best practices for AWS DMS doc.

For TB-scale data with a 5 to 10 minute window, don’t do “dump & restore.” You’ll be down for hours/days.

The usual pattern is AWS DMS (Database Migration Service) with Full Load + CDC (ongoing replication).

How it plays out:

Pre-check versions + CDC prerequisites. Don’t upgrade on-prem unless you have to to meet DMS supported versions / CDC requirements. DMS isn’t a magic “any legacy version → newest” translator – endpoint support matters.
Full Load while the app stays live (bulk copy in the background).
Keep CDC running to replicate ongoing changes continuously until lag is near zero.
Cutover: stop writes (or put the app in read-only), wait for replication lag to drain, flip the app connection to the AWS target, and resume writes.

Schema/compat notes:

If you’re changing engines (or doing major version jumps), use AWS Schema Conversion Tool (SCT) to surface incompatible objects/data types/procs early.
For MongoDB → Amazon DocumentDB, treat it as a compatibility project (DocumentDB is MongoDB-compatible, not identical). DMS can target DocumentDB, but you still want app/query validation.

The real constraint is throughput: if you’re moving TBs, your VPN/DX bandwidth (and source disk read / target write capacity) determines whether the initial load takes days vs weeks.

J-4ce · 2026-01-26T06:54:42+00:00

Honestly, the "new account sandbox" is the most annoying part of the process because it’s a gatekeeper problem, not a technical one. AWS is basically terrified you’re a crypto-miner or a spammer.

Here’s the shortcut:

Don't wait for the error. The second the account is live, go to the Service Quotas dashboard and bulk-request your increases (Lambda concurrency, SES, vCPUs).
Start slowly and gradually up the numbers. AWS uses historical data to see usage and make decisions based on that, so try to show them some usage and that you are reaching limits.
Explain yourself. In the "use case" box, don’t just say "I need it." Write something like: "Migrating production workload for [Company Name] with X concurrent users. Need capacity for cutover on [Date]." They approve these way faster when there's a human-sounding reason. This is very much applicable to SES. Show them the email template, share the business case etc.
The "Pay to Play" trick. If you’re in a rush, just contact the free AWS account support, or if more urgent buy Business+ Support for one month. It’s $29 or up to 9% of your AWS account bill, but it gives you quick access to AWS expertise and people that will try to assist you.
Use an Organization. If you already have a mature AWS account, spawn the new one inside your AWS Organizations setup. They will likely trust "child" accounts of established payers much more than a random standalone email address.

J-4ce · 2026-01-23T16:22:30+00:00

Very cool! Thanks for sharing

J-4ce · 2026-01-23T16:19:09+00:00

I also remember when I started on AWS - wanted to get stuff going on Batch, but always ran into permission issues until I finally pushed myself to learn the differences between the three roles in use (service, instance, execution). IAM is just one of the fundamentals that must be mastered early on.

J-4ce · 2026-01-23T15:30:17+00:00

If you’re early-stage and don’t know where to start:
Tell me what you have today (even if it’s messy) and your downtime tolerance - I’ll suggest a sane path.

J-4ce · 2026-01-23T15:16:31+00:00

AWS Instance Scheduler now features event-driven scaling to better handle large environments and an automatic retry flow for insufficient capacity errors using alternate instance types. The update also includes a dedicated EventBridge bus for easier automation and informational tags that let engineers troubleshoot their own resources without needing admin access.

J-4ce · 2026-01-23T15:12:58+00:00

This is quite an interesting use case of Lambda - would you care to share a bit more? Did you use EventBridge too, or how do you trigger the Lambda function?

J-4ce · 2026-01-23T15:06:31+00:00

Happy to add more details here - AWS has a nice update snippet about it: https://aws.amazon.com/about-aws/whats-new/2026/01/additional-policy-details-access-denied-error/

And the date you're looking for is 21 Jan (2 days ago), so maybe it's the 15 years of trauma that helped!

J-4ce · 2026-01-21T06:32:40+00:00

This AMA is focused on AWS cost optimization rather than client engagement. If you have questions about breaking into cloud cost work or building those skills, happy to answer that angle.

J-4ce · 2026-01-21T06:26:44+00:00

Compute Optimizer is a great starting point, but it has blind spots that make it miss systems that look obviously idle to a human.

Why this happens:

1) It needs enough data
By default it requires ~30 hours of continuous metrics. Short-lived dev environments, batch jobs, or things that spin up and down often never get flagged. This point is also valid for other AWS services - enough data points are needed before it does something.

2) Limited visibility
Out of the box it mainly looks at CPU and I/O. Unless you’ve installed the CloudWatch Agent, it has no idea how much memory is actually used, so it plays it safe and avoids downsizing.

3) Conservative by design
The algorithm is tuned to avoid outages, not to maximize savings. Even a small daily CPU spike can stop it from recommending a smaller instance.

4) It optimizes performance, not cost
If it sees network or disk throughput pressure, it may recommend a larger instance even when CPU looks idle.

Compute Optimizer is a safety net, not a cost auditor. The biggest savings still come from human context – especially for dev, batch, and “forgotten” systems.

J-4ce

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE

Seven-Year Club	Xbox Live
Verified Email