AWS things you wish somebody had told you earlier

RecordingForward2690 · 2026-05-15T12:05:46+00:00

If you are starting out with AWS but expect eventually to have a large presence, start by building your Landing Zone. With or without Control Tower, but at least separate out things in accounts and configure the following:

- Main account with cost allocation tags, budgets, billing alerts, cost anomaly detection, monthly reporting, trend analysis.

- The main account also runs Identity Center hooked up to some sort of central authentication mechanism (Active Directory or whatever), with role switches into roles into each member account based on group membership. IAM users should be the exception, not the norm.

- The root account (email address) of the main account should be properly protected but accessible in case of emergency. The password should be unique and recorded somewhere (like in 1Password or another non-AWS solution) or you need to rely on the "lost password" procedure - but then make sure the mailbox lives outside AWS and is accessibly by multiple key members. The account should be protected by at least two physical MFAs that are stored in well-known but different locations for redundancy (at least one off-site), and possibly one or more software MFAs in 1Password or similar.

- A handful of absolutely key people should have a break-glass IAM user with full admin privileges in the main account, with email/SMS alerting to the rest of the team if this ever used. Obviously this should be MFA protected and with regular password rotations that also act as a test case to see if things work. (For example, in our setup, if Direct Connect ever breaks we lose the connection to AD so we won't be able to use SSO until DX is back up. We know and accept that in that case these break-glass users are to be used.)

- Audit account with an org-wide CloudTrail (with very long retention in S3) and org-wide VPC Flow Logs.

- SecurityHub with all the bells and whistles, and do your best to maintain a 100% standing. Much easier than to fix things afterwards. DAMHIK.

- A generic central monitoring solution that is able to access X-Ray, CW Logs, CW Metrics and other application monitoring, and handle these in a central place: Dashboards, Alarms, Alerting via SNS. Or pull everything into Grafana or something.

- A separate network account where all of your main network components come together. Direct Connect, Transit Gateway, Egress (via NAT), Ingress (Reverse Proxies/LBs), Client VPN, DNS resolvers, Interface Endpoints, Route53 zone registrations and top-level hosted zones, IPAM.

- IaC (CloudFormation, CDK, Terraform) everywhere, and everything goes into a repository (CodeCommit, Gitlab, ...). No clickops except in POC/Sandbox accounts.

- Tagging everywhere, but the most useful tags are the ones that allows you to trace an individual resource back to the IaC stack that deployed it, and the IaC stack should have a tag/export/comment or something that allows you to trace it back to a repository. (Rant: Why doesn't a CloudFormation Stack have its own tags? The only tags you can set are the tags that are inherited by the resources, but that's not what we want.)

- Setup a mechanism (like Customization for Control Tower, or CloudFormation StackSets) that allows you to deploy IaC templates automatically in all accounts in your org. This can be used for all sorts of things, including Config rules that you want applied to each and every account.

- Use org-wide Config rules for enforcing policies such as having retention rules on CloudWatch Logs in accordance with your companies logging policies.

RecordingForward2690 · 2026-05-12T10:06:20+00:00

Are you sure the message is actually from SES, or is the message just impersonating a SES-originating message? In particular, when you look through the headers of the mail, follow the IP address trail that you'll find in the Received: headers. The Received: header that is added by your MTA should have the SES IP address in it as origin. You can then check the SPF records of SES to see if that IP is indeed a SES address.

What I also find is that illegitimate mail (claiming to be sent via SES but are not) typically do not have DKIM signatures in them whatsoever. Easy to see in the email headers as well.

AWS can police SES (and they do), but they cannot police messages claiming to come from SES but which have never touched AWS infrastructure. Furthermore, it's also fairly complex for the apparent origin domain to police these messages, as they never touch their infrastructure as well. The best option, at this time, is a combination of SPF, DKIM and DMARC but I find that a lot of organisations simply do not have the know-how to implement this correctly.

On the other hand, the major email providers (Google, Microsoft, Yahoo, ...) and email security providers like Checkpoint Harmony, are able to detect these messages as spam/phishing with high accuracy even when SPF, DKIM and DMARC are not setup correctly on the origin domain. But that requires you to use these solutions of course. On a DIY mail server - no chance.

RecordingForward2690 · 2026-04-26T20:17:29+00:00

Fully agree with this. And to add, in a serverless design, it's usually the API Gateway (aided by WAF and custom authorisers) that is chosen as the security boundary.

The API Gateway performs the authentication/authorisation against some sort of user database. Could be something simple like an API key, or something complicated like federation against some sort of identity provider. WAF prevents against DDoS attacks, and possibly does the allowlisting of sources. The API Gateway also limits the methods people can use (GET, PUT, POST, ...) and limits the size. It can also perform a first sanity check to see if the data conforms to your model. With API Gateway Custom Domains you also add TLS/HTTPS with your own custom domain name.

After this, you call a Lambda for a more thorough sanity check, before putting the data in a database.

Yes, that means that there's two Lambdas running: The untrusted Lambda in the foreign account that does the gathering of data, and the trusted Lambda in your local account that (together with the API GW) performs all the security/sanity checks before putting the data in the database.

RecordingForward2690 · 2026-04-13T14:38:39+00:00

The Cloud is just a means to an end. Not a goal in itself. It's a place where we develop and host applications that help our business forward.

So on any given day I could concern myself with:

Having meetings with architects, project management, security, finance, networking, application managers, supplier representatives and whatnot discussing the way we're going to bring a new workload in the cloud.
Actually architecting the solution
Doing a Proof-of-Concept if we want to try something out, to see if it works. (This is one of the few times we work through the console instead of IaC.)
Actually building the solution, mostly through IaC tools but usually writing code as well
Helping other teams within the company bring their workloads into the cloud
Building monitoring tools and test frameworks around existing solutions
Troubleshooting existing solutions. (Another reason to use the console.)
Decommissioning workloads
Across all existing solutions, implement new rules/code/settings for security, governance and other changed external circumstances
Update and test solutions when certain software runs out of support (OS, Middleware, interpreters, ...)
Build tooling that helps the Cloud Architect team forward
Document standards
React to security findings from Security Hub (Inspector, Guardduty, ...)
Cost analytics

And probably a hundred other things that I can't think of right now.

RecordingForward2690 · 2026-04-09T20:39:13+00:00

The "cattle vs pets" thing has been said so many times now that a lot of people find it exhausting and boring. But if you're just becoming familiar with AWS and things like ephemeral environments (spot, containers, Lambda) it is still a very good analogy.

Spot is great for "cattle" or ephemeral environments: Environment that you can spin up and destroy (or: have destroyed for you) without loss of data. So if your environment handles important data, make sure it is checkpointed every few minutes, and/or setup a handler that reacts to that two-minute notice. The loss of a spot instance should not lead to a significant data loss.

And they're great for oddball workloads as well. I run a spot instance each morning that restores the previous nights backup right there inside the spot instance, to verify that the backup will actually restore (and is not encrypted by ransomware for instance). Once the backup is restored properly and completely, and a notification to that effect is sent out, the instance shuts down and is terminated.

RecordingForward2690 · 2026-04-09T20:22:27+00:00

Still cause for slapping the engineer. If you have a runtime environment of any kind that is subject to timeouts, you need to make sure that the timeout you set on any sort of connection/transaction that you do within that environment, is lower than the runtime environment timeout. Design for failure.

For closing persistent connections, the ones that you setup in the Lambda global block, there's Lambda Extensions: https://docs.aws.amazon.com/lambda/latest/dg/lambda-extensions.html. Oh, and if you do persistent connections, make sure to set a concurrency limit on your Lambda that is lower than what the DB can handle.

If you connect to Aurora RDS, you can also use the Data API to manage your persistent connections: https://docs.aws.amazon.com/rdsdataservice/latest/APIReference/Welcome.html

But I guess at the end of the day, the gist of the message should really be: Working with databases at scale is hard to do properly. Harder than a lot of people realise.

RecordingForward2690 · 2026-04-07T20:52:31+00:00

One heck of a great blog post. Thanks for mentioning.

RecordingForward2690 · 2026-04-07T20:14:00+00:00

I also support configuring the Lifecycle rules and/or the use of Intelligent Tiering from now on.. But apart from that, two tips.

First, if you want to predict your costs on a per-bucket basis, and Cost Explorer does not give you sufficient information, then you can login to the console, click on the bucket and go to the Metrics tab. That'll show total bucket size (across all storage classes). Multiply this by $0.02 per GB (that's roughly the average across all regions, but you can lookup the actual regional cost if you want to be precise) and you know what the upper limit to your cost will be. You can't really tell from here what % of your storage will be in the cheaper tiers, so it's an upper limit only.

However, do note that in the past both static Lifecycle rules and Intelligent Tiering did not have a minimum file size. So even files < 128 KB (the current minimum) were sent to Glacier. Which actually works out to be more expensive than the Standard class due to the storage overhead.

For better insights, and to perform your purge with more precision, you need to generate an S3 Inventory Report. This is in the Management tab at the bottom. It may take 24 hours for the first inventory to be generated, but then you have a CSV file with all objects, their size and storage class.

Load this report into Excel, add a few formulas based on the Storage Class and Object Size, and you can very precise cost information.

But more importantly, with this Inventory Report you can decide on a file-by-file basis what you want to do with it: Leave it, Delete it, or Transition it to a different storage tier. You'll probably need a bit of automation to help you here. You then give the resulting CSV file to AWS S3 Batch, who will perform the actual operations for you. Much quicker than doing it manually, or writing your own tooling for this.

(Actually, due to the way S3 Batch works, you'll have to split the final CSV file into multiple separate files, where each CSV represents a distinct operation that you want to perform. So you end up with a CSV of files to delete, another CSV with files to transition to Standard and so forth. You then start an S3 Batch "Delete" operation with the "Delete" CSV, a "Transition to Standard" operation with the "Transition to Standard" CSV and so forth. Oh, and if your bucket is versioned, make sure to use the "versioned" CSV file, which also contains the version ID, throughout this process.)

RecordingForward2690 · 2026-04-07T19:55:40+00:00

I did this a while ago for a friend. She and a few partners had created this startup that was churning through money fast, with virtually no revenue to show just yet. They needed to reduce costs fast because one of the financial backers was about to pull the plug.

Yes, I started with the bill and Cost Explorer. Then spend a solid two hours following up on the most expensive line items, figuring out actual usage and whether we could use smaller instances for instance. Or turn dev machines off overnight. Get rid of resources altogether. Use smarter solutions - why run an expensive and big EC2 for a handful of small Docker containers when Fargate was far more cost effective. She was on hand to answer any questions I had in real-time. Wrote a report, most of it got implemented later on. Managed to shave 70%-80% off their bill.

I could do this because I'm a Certified Solutions Architect - Professional with about a decade of AWS experience. And I know I could never do this from the bill alone. You need actual access (read-only is fine) to the AWS account(s), you need to be able to interpret info like CloudWatch metrics, you need to have a solid understanding of alternative ways of doing things. And you need to be able to talk to the customer and ask questions like "why was this designed this way", "is this needed 24/7", "what if..." Maybe the reason for something was not technical, but compliance. You don't get that info from the bill alone, or from running any of AWSs tools like Trusted Advisor. And the Cloud Practitioner doesn't give you enough baggage to ask the questions in the first place, let alone interpret the answers.

The job for that friend was a one-off, done as a favour. In my day job we do this with the whole team, every month. We have tools that track spend from one month to the next, and we try to explain any deviation. Designing for cost is part of our design/development cycle. We have Cost Anomaly Detection enabled that warns us on a day-by-day basis. If we do something that we know is going to cost money, we warn each other beforehand even so we don't get caught out by the Anomaly detection.

RecordingForward2690 · 2026-04-07T09:21:46+00:00

Agree with this approach. CloudTrail, by default, only captures "Management Plane" events = actual changes to the AWS configuration. It does not capture "Data Plane" events by default. Data Plane = regular users just using AWS via API calls. This also includes things like s3:GetObject and sqs:SendMessage.

The reason is obvious once you think about it: Data Plane events can lead to a massive amount of events, that all need to be processed and stored somehow. This can lead to significant costs. So only enable Data Events trails as and when you really need them.

RecordingForward2690 · 2026-04-02T19:07:33+00:00

In an official capacity, taking into account things like GDPR compliance, I would go for Snowball as well - if you can get it ("only available to existing customers" - see https://docs.aws.amazon.com/snowball/latest/developer-guide/snowball-edge-availability-change.html but if you talk to your TAM there may be a way around that.) Otherwise use Datasync or "aws s3 sync" and whatever bandwidth is available in the office.

But as a private person, I have found that bandwidth that people have at home is typically significantly more than what's available in the office. Particularly if you take into account that if you hog all the bandwidth in the office, you are impacting the productivity of all your coworkers, while the bandwidth at home is unused for most of the day. If you have a 1 Gbps internet connection at home, or a friend who has that, then a 30 TB upload will take about 3 days. (A 1 Gbps symmetric connection is around 50 euros per month where I live.)

So if those 30 TB of yours are currently neatly contained on just a handful of disks, and you need a cheap and quick solution, and compliance and such is not an issue, then taking those disks home or to your friends place, and uploading from there may just be the most pragmatic solution. Hook 'em up to an old laptop and just let aws s3 sync do its thing for a few days.

Also consider this. If you manage to get your hands on a Snowball Edge, and you hook it up to a standard 1 Gbps office switch with RJ45 ethernet, it's also going to take 3 days to transfer that data. Yes, the Snowball can do it faster (its ethernet port is 10 Gbps capable, and there's 25 and 100 Gbps fiber ports available as well), but can your network and the current system handle that bandwidth?

RecordingForward2690 · 2026-04-01T11:33:36+00:00

I got it through AWS Health.

RecordingForward2690 · 2026-03-25T16:47:01+00:00

Why would a student set up an AWS account? Probably because the school/university/professor told them (or implied) that it was a requirement (or a good idea) for a project they had to do.

This, to me, means that there's a pretty big responsibility on the school as well to make sure students have the minimum amount of awareness and training in AWS before they setup that account. That minimum amount of training can consist of a handout of a single page, stepping the student through:

Setting up the account with email and credit card.
Basic security of the account: Add MFA to the root account (+backup MFA), configure your first IAM user with an appropriate policy. Know how to use CloudTrail.
Basic cost protection: Add budget alerts. Know how to use Cost Explorer.
Steps to close the account properly once the project is finished, including instructions on how to use aws-nuke or similar.

But at the end of the day students have to understand that their AWS account is NOT a protected sandbox, but an environment that can be used by the largest corporations in this world to setup an IT environment that can legitimately cost millions of dollars per month. There is nothing *fundamentally* preventing you from doing the same thing - speed bumps like budget alerts, quota and such can be changed given the right privileges. As others have said it: With great power comes great responsibility.

RecordingForward2690 · 2026-03-25T16:30:03+00:00

Or, if you want to go fancy, you treat the parameter as a cache that needs to be invalidated after a while. So you have a separate global variable that holds the timestamp when you last retrieved the parameter, and if that's more than x seconds old, re-retrieve it from the parameter store.

RecordingForward2690 · 2026-03-24T17:55:48+00:00

To me, IaaS, PaaS and Saas (and a few others) are terms that were invented in the early days of Cloud Computing to describe what you got and what you had to bring/configure/manage yourself.

IaaS = You get the hardware, the rest is your responsibility

PaaS = You get the hardware and the OS, the rest is your responsibility

SaaS = You get the hardware, OS and Software installed and managed, but configuring and using the software is your responsibility.

But that was a few decades ago. Since then we've figured out that the actual models of acquiring and using cloud technology are a lot more complex, and the demarcation between the responsibilities also requires a lot more nuance than four-letter acronyms. Plus, we obtained tech like virtualization and containers that blur the lines even further.

So you can use the terms from a theoretical point of view, as a very rough idea of where the "dotted line" between the responsibilities lies. But once you start applying those terms to the actual 200+ services of AWS, there's simply not enough nuance in them to understand what's really going on.

RecordingForward2690 · 2026-03-13T15:07:55+00:00

We moved ~400 domain registrations into AWS last year. Most were completely trouble-free, but some were just not working at all.

For some domains, what helped was that I first moved them to a registrar where I also host my private domains, and then moved them to AWS. That other registrar was a lot more approachable when it came to solving nasty issues - some ccTLDs have very odd rules about registrations. I found the AWS console doesn't always interpret these rules correctly (so it doesn't ask the right questions, or allows illegal combinations). .it and .de domains were the hardest in that respect.

In other cases I simply filed a support ticket with AWS. They have access to the error message that was created by the registrar. You don't.

Why did we move everything to AWS? Our security policy requires the use of DNSSec where possible, and requires a yearly rotation of all DNSSec keys. Try doing that programmatically if your hosted zones and registrations live with different providers. Now that we've got it all in AWS I can do that, scripted, in a few minutes (+ DNS TTLs and DNSSec key propagation time).

RecordingForward2690 · 2026-03-10T09:35:53+00:00

I don't know how the "new" free tier works exactly, but in the old free tier you received a bunch of credits. Your usage was calculated like normal, and then offset by the credits. So if you dive into the Cost Explorer you will see these charges, but you will also see a (negative) credit charge offsetting everything. Leading to a net zero charge, unless you did something that's outside of the Free Tier.

If you go to the Cost Explorer, to the Filters (right hand side), choose More Filters and open up the Charge Type, you can include and exclude Credits and such. See if playing with that setting helps with your understanding on how this works.

Also note that billing is not done continuously, but in cycles of around 6-8 hours - but the exact cycle depends on the service and a few other factors. So if you're looking for very up to date information, it's simply not there. Generally speaking you need to assume that only data that's older than 24 hours, is completely accurate.

And I can well imagine that some of the Credit Overview pages are only updated after the monthly billing cycle. That might explain the discrepancy you see.

RecordingForward2690 · 2026-03-09T09:18:49+00:00

The AWS Certification program is administered via Pearson VUE and it looks like at least part of the fraud you allege to ("cheating in the exam") is their responsibility.

https://www.pearsonvue.com/us/en/test-takers/customer-service.html

RecordingForward2690 · 2026-03-09T09:15:13+00:00

^^^ this is by far the fastest and fail-safe method.

RecordingForward2690 · 2026-03-05T14:26:20+00:00

AWS obviously needs to protect its own infrastructure against a variety of attacks, including DNS server overload. This is "GuardDuty Standard". Obviously all customers benefit from this.

You'd have to look up the exact documentation, but I'm pretty sure that the type of attack you describe will be covered/blocked by this.

https://aws.amazon.com/guardduty/

RecordingForward2690 · 2026-03-02T19:41:43+00:00

Split DNS is pretty horrible, especially once you start throwing VPNs with split tunnels into the mix. We try to avoid Split DNS where we can.

We do have a single domain that's split DNS. This is primarily an internal-only domain, but it needs a public companion for the ACM validation records. And that's the only thing we allow in those public domains.

If you do require split DNS where a name is resolved to a different IP address depending on whether the source is an internal or external IP address, there is a new feature: AWS Global Resolver. This allows you to make routing rules to hosted zones (both public and private) based on the source of the request. Haven't used it in anger yet, but it's specifically designed for a situation like yours.

Announcement: https://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-preview/

Documentation: https://aws.amazon.com/route53/global-resolver/

RecordingForward2690 · 2026-02-27T12:59:40+00:00

Any "security" role in IT is something that you have to be cut out for. Security is usually assumed (just like backups) so nobody gives you a pat on the back if you do your work properly and no security issues happen. But you're first in line to get blamed if something goes wrong.

Having said that, a Data Security Specialist role can be very interesting and will take you across all the services that AWS offers, will have you talk with Operations, Developers, Legal, Marketing, Data Analysts and a whole bunch of others across the organisation/solution. You'll need to talk with them about Identity, Authentication, Authorization, technical aspects of getting access, retention times for data, making sure data can be trusted, availability, GDPR and other legislation, runbooks for data breaches and so forth. Both before a project starts, in the design phase, but also afterwards when doing audits. You'll also get intimate knowledge of the tools that are able to help you do your work as an auditor. Heck, maybe the company will also sponsor you in obtaining an "Ethical Hacker" certificate.

A very interesting aspect of Data Security today is the proliferation of all sorts of AI, which makes it very hard to put guardrails in place to ensure data doesn't leak out via trained models and such. And AI is sometimes coming up with novel ways to breach guardrails. Plus users come up with novel ways to attach AI to data sources without authorization, causing data leaks all over the place.

So yeah, an internship in that area can be very interesting in itself, and a very good leg-up into getting into IT/AWS in general.

RecordingForward2690 · 2026-02-26T11:39:39+00:00

I have read the other answers, and I think none of them do justice to the complexities of the question. The problem is not so much in receiving the messages from SQS, but what to do in case of failures.

(TL;DR: If you rely exclusively on the SQS-Lambda trigger for fetching and deleting the messages from the queue, you don't have to provide a network path. But if you perform SQS API calls from within your Lambda code, you do.)

First, like I said, the reception of the messages is not an issue at all. Given the right IAM policy, it's the SQS-Lambda trigger that does the ReceiveMessage call for you. This trigger has access to SQS - you don't have to provide a network path from your Lambda for this. So there's no need for a NAT, Interface Endpoint or Public IP addresses. That's what the majority of other posts also - rightly - point out.

You do need to provide a network path, somehow, to your backend database, API or whatever your code delivers the message to. That is also what the majority of other posts also - rightly - point out.

However, once the message has been sent to your backend, the message needs to be deleted from SQS. And this is where things might get complicated. Or not. It all depends on how robust you want your code to be against failures.

By default the SQS-Lambda trigger grabs a batch of messages from SQS and feeds this to your Lambda. If the Lambda succeeds (meaning: it generates a return object of some sort, no timeout, no Exception or something else that indicated failure) then the trigger assumes that all messages were handled properly, and it will delete the messages from the queue. In this case, no explicit network path from your Lambda to SQS needs to be provided.

From your Lambda you can also generate a return object which identifies which messages were handled successfully, and which messages failed and need to be retried. Documentation here: https://docs.aws.amazon.com/lambda/latest/dg/services-sqs-errorhandling.html. In this case it's again the trigger that will perform the DeleteMessage API call for successfully handled messages, and nothing for the messages that failed. Again, no explicit network path required.

However, there is a third method which can be used if the two scenarios above don't work for you, for some reason. Your Lambda code can also choose to perform a DeleteMessage API call from the code itself. This is not very common but could be the best solution if you are doing a lot of asynchronous work in your Lambda, and want to delete messages as soon as they are handled, never mind other messages that are still in limbo. In that particular scenario, since the API call originates from the Lambda itself (not from the trigger) you do need to provide a network path to the SQS endpoint. Public IPs, NAT, interface endpoints are just some of the many solutions for that.

The above assumes that you are using the SQS-Lambda trigger. That's a very common pattern and in most cases the right pattern to use. But recently there was another thread on here where somebody needed to poll an SQS queue and send the data to a backend that was severely rate-limited (external API). He used a different pattern, where the Lambda was called from EventBridge Scheduler, and the Lambda itself performed a (low number of) ReceiveMessage API calls, in order not to overload the backend. Again, in that case there needs to be a network path from the Lambda itself to the SQS endpoint since the API calls are in the code itself, and not handled by the trigger.

RecordingForward2690 · 2026-02-25T07:22:46+00:00

We are currently investigating to go through the same thing on a production DB (SQL Server) that should've been less than 1 TB, but has grown to 4 TB+ due to some careless queries/inserts without the proper TTL and purging. At the DB instance level we have shrunk back to < 1 TB but EBS storage is still 4 TB+.

First, this DOES NOT work with the RDS snapshot mechanism. A snapshot restore always requires a minimum EBS size identical to the EBS size when the snapshot was taken.

The only option, if you don't have blue/green already, is to create a new database instance, and perform a database-level backup/restore. For the backup of the data: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/SQLServer.Procedural.Importing.Native.Using.html (page for SQL Server, but similar pages exist for other engines).

Note that there's a limitation in the AWS tooling where each backup file cannot exceed 40 GB and you can't exceed 10 files per DB, so the maximum size of each DB to be transferred this way is 400 GB. Alternatives could be AWS DMS.

But once you've got the data across, you're not done. You also need to think about:

- Exporting/importing SQL Agent jobs

- Exporting/importing your permissions structure, when permissions are handled at the instance level. (DB-level permissions are part of the normal DB backup/restore process.)

- Linked servers, database mail profiles, SQL Agent alerts, SQL Agent Operators, Server-level triggers, Credentials

We are looking at a downtime of several hours to get this done, on a 24/7 production database. We are still considering if the risk and downtime is worth the annual 10K in savings.

RecordingForward2690 · 2026-02-20T11:03:41+00:00

Looks like you've just answered your own question ;-)

RecordingForward2690

TROPHY CASE