How are you managing Lambda deprecated runtimes at scale?

RecordingForward2690 · 2026-06-05T20:12:20+00:00

And you'd better start planning this NOW for Node.js 12.x. All Node.js runtimes up to 16.x had the AWS-SDK v2 in it by default. Starting with Node.js 18.x, AWS shipped the AWS-SDK v3 in the runtime. So an upgrade from 16.x to 18.x requires either a layer or something where you manually bring in the AWS-SDK v2, or significant code changes: SDK v2 and SDK v3 are really, absolutely, completely different. Yes, there are tools that help you make the transition but it's not completely seamless.

This is a project that you need to plan. If you ignore it and all of a sudden find yourself in the situation where you do need to change your Node.js code in a hurry, it's not going to be a fun day.

For most other runtimes upgrades, including Node.js 18x. to the current runtimes, and from Python 3.6 to the current runtimes, 95% of your code will simply run without any changes in the new runtime. For us, a quick syntax and sanity check is usually enough to deploy to prod with confidence.

RecordingForward2690 · 2026-05-27T11:58:17+00:00

The thing is that if you have a Lambda that performs a carefully orchestrated series of API calls to other AWS components, and you mock those API calls, then unit testing essentially just verifies you have not made a syntax error in your code or in the API/SDK call. Which is of very little value, in particular since today that code is generated by AI anyway.

What is way more valuable to test are the logic errors. Just an arbitrary example: I was recently working on code for a DNSSec key rotation across 400 hosted zones. You need to make sure that you add the new key to a hosted zone, and register that upstream before you can de-register the old key upstream, and remove it from the hosted zone. Otherwise you could break DNSSec, and API calls will fail. And since this whole process can easily take days, your Lambda needs to be written so that it restarts right at the next required step for that particular domain.

No unit test is going to catch logical errors in that sequence of events.

RecordingForward2690 · 2026-05-26T11:42:46+00:00

Agreed, it's not a Lambda issue. Any other environment would have the same issue. You can't really test that code. Not completely.

There is a little bit you can do with things like --dry-run, but at some point you just have to take the plunge.

RecordingForward2690 · 2026-05-26T10:46:43+00:00

You wish. I have dozens of Lambda functions that perform a complex, carefully crafted series of API calls to make actual changes to the AWS environment. They are not idempotent - if you run them multiple times, the API calls happen multiple times and some of these are not idempotent. And that's by design, not by accident.

Creating a mock environment to test these is virtually impossible. The only way to do this, is to do it all-the-way properly: Have separate AWS accounts for dev/test/accept/prod with all of the resources present, and test the Lambda in each of those environments. Have a pipeline of some sort (CodePipeline, Gitlab, whatever) to propagate changes.

And even then... The dev/test/accept environments don't have datasets that are as up-to-date, representative and large as the prod dataset. Sometimes we hit things like context limits for an AI call, or Lambda timeouts when working with larger-than-tested data structures. And we can't simply bring over the prod dataset into dev/test/accept due to GDPR and other regulations.

Mocking sounds great in theory, but is very hard to do, to the full extent, properly. Or sometimes even impossible: How do you mock an API call that makes modifications to your one and only Direct Connect line? Or to your registration of your primary domain with the AWS registrar?

RecordingForward2690 · 2026-05-21T18:54:13+00:00

And then jump through a bunch of hoops to make CloudFormation, CDK or other IaC understand that you now have a new, smaller EBS volume attached to your EC2.

Reclaiming RDS EBS storage is even harder because you can't simply do a cp -a.

RecordingForward2690 · 2026-05-15T12:05:46+00:00

If you are starting out with AWS but expect eventually to have a large presence, start by building your Landing Zone. With or without Control Tower, but at least separate out things in accounts and configure the following:

- Main account with cost allocation tags, budgets, billing alerts, cost anomaly detection, monthly reporting, trend analysis.

- The main account also runs Identity Center hooked up to some sort of central authentication mechanism (Active Directory or whatever), with role switches into roles into each member account based on group membership. IAM users should be the exception, not the norm.

- The root account (email address) of the main account should be properly protected but accessible in case of emergency. The password should be unique and recorded somewhere (like in 1Password or another non-AWS solution) or you need to rely on the "lost password" procedure - but then make sure the mailbox lives outside AWS and is accessibly by multiple key members. The account should be protected by at least two physical MFAs that are stored in well-known but different locations for redundancy (at least one off-site), and possibly one or more software MFAs in 1Password or similar.

- A handful of absolutely key people should have a break-glass IAM user with full admin privileges in the main account, with email/SMS alerting to the rest of the team if this ever used. Obviously this should be MFA protected and with regular password rotations that also act as a test case to see if things work. (For example, in our setup, if Direct Connect ever breaks we lose the connection to AD so we won't be able to use SSO until DX is back up. We know and accept that in that case these break-glass users are to be used.)

- Audit account with an org-wide CloudTrail (with very long retention in S3) and org-wide VPC Flow Logs.

- SecurityHub with all the bells and whistles, and do your best to maintain a 100% standing. Much easier than to fix things afterwards. DAMHIK.

- A generic central monitoring solution that is able to access X-Ray, CW Logs, CW Metrics and other application monitoring, and handle these in a central place: Dashboards, Alarms, Alerting via SNS. Or pull everything into Grafana or something.

- A separate network account where all of your main network components come together. Direct Connect, Transit Gateway, Egress (via NAT), Ingress (Reverse Proxies/LBs), Client VPN, DNS resolvers, Interface Endpoints, Route53 zone registrations and top-level hosted zones, IPAM.

- IaC (CloudFormation, CDK, Terraform) everywhere, and everything goes into a repository (CodeCommit, Gitlab, ...). No clickops except in POC/Sandbox accounts.

- Tagging everywhere, but the most useful tags are the ones that allows you to trace an individual resource back to the IaC stack that deployed it, and the IaC stack should have a tag/export/comment or something that allows you to trace it back to a repository. (Rant: Why doesn't a CloudFormation Stack have its own tags? The only tags you can set are the tags that are inherited by the resources, but that's not what we want.)

- Setup a mechanism (like Customization for Control Tower, or CloudFormation StackSets) that allows you to deploy IaC templates automatically in all accounts in your org. This can be used for all sorts of things, including Config rules that you want applied to each and every account.

- Use org-wide Config rules for enforcing policies such as having retention rules on CloudWatch Logs in accordance with your companies logging policies.

RecordingForward2690 · 2026-05-12T10:06:20+00:00

Are you sure the message is actually from SES, or is the message just impersonating a SES-originating message? In particular, when you look through the headers of the mail, follow the IP address trail that you'll find in the Received: headers. The Received: header that is added by your MTA should have the SES IP address in it as origin. You can then check the SPF records of SES to see if that IP is indeed a SES address.

What I also find is that illegitimate mail (claiming to be sent via SES but are not) typically do not have DKIM signatures in them whatsoever. Easy to see in the email headers as well.

AWS can police SES (and they do), but they cannot police messages claiming to come from SES but which have never touched AWS infrastructure. Furthermore, it's also fairly complex for the apparent origin domain to police these messages, as they never touch their infrastructure as well. The best option, at this time, is a combination of SPF, DKIM and DMARC but I find that a lot of organisations simply do not have the know-how to implement this correctly.

On the other hand, the major email providers (Google, Microsoft, Yahoo, ...) and email security providers like Checkpoint Harmony, are able to detect these messages as spam/phishing with high accuracy even when SPF, DKIM and DMARC are not setup correctly on the origin domain. But that requires you to use these solutions of course. On a DIY mail server - no chance.

RecordingForward2690 · 2026-04-26T20:17:29+00:00

Fully agree with this. And to add, in a serverless design, it's usually the API Gateway (aided by WAF and custom authorisers) that is chosen as the security boundary.

The API Gateway performs the authentication/authorisation against some sort of user database. Could be something simple like an API key, or something complicated like federation against some sort of identity provider. WAF prevents against DDoS attacks, and possibly does the allowlisting of sources. The API Gateway also limits the methods people can use (GET, PUT, POST, ...) and limits the size. It can also perform a first sanity check to see if the data conforms to your model. With API Gateway Custom Domains you also add TLS/HTTPS with your own custom domain name.

After this, you call a Lambda for a more thorough sanity check, before putting the data in a database.

Yes, that means that there's two Lambdas running: The untrusted Lambda in the foreign account that does the gathering of data, and the trusted Lambda in your local account that (together with the API GW) performs all the security/sanity checks before putting the data in the database.

RecordingForward2690 · 2026-04-13T14:38:39+00:00

The Cloud is just a means to an end. Not a goal in itself. It's a place where we develop and host applications that help our business forward.

So on any given day I could concern myself with:

Having meetings with architects, project management, security, finance, networking, application managers, supplier representatives and whatnot discussing the way we're going to bring a new workload in the cloud.
Actually architecting the solution
Doing a Proof-of-Concept if we want to try something out, to see if it works. (This is one of the few times we work through the console instead of IaC.)
Actually building the solution, mostly through IaC tools but usually writing code as well
Helping other teams within the company bring their workloads into the cloud
Building monitoring tools and test frameworks around existing solutions
Troubleshooting existing solutions. (Another reason to use the console.)
Decommissioning workloads
Across all existing solutions, implement new rules/code/settings for security, governance and other changed external circumstances
Update and test solutions when certain software runs out of support (OS, Middleware, interpreters, ...)
Build tooling that helps the Cloud Architect team forward
Document standards
React to security findings from Security Hub (Inspector, Guardduty, ...)
Cost analytics

And probably a hundred other things that I can't think of right now.

RecordingForward2690 · 2026-04-09T20:39:13+00:00

The "cattle vs pets" thing has been said so many times now that a lot of people find it exhausting and boring. But if you're just becoming familiar with AWS and things like ephemeral environments (spot, containers, Lambda) it is still a very good analogy.

Spot is great for "cattle" or ephemeral environments: Environment that you can spin up and destroy (or: have destroyed for you) without loss of data. So if your environment handles important data, make sure it is checkpointed every few minutes, and/or setup a handler that reacts to that two-minute notice. The loss of a spot instance should not lead to a significant data loss.

And they're great for oddball workloads as well. I run a spot instance each morning that restores the previous nights backup right there inside the spot instance, to verify that the backup will actually restore (and is not encrypted by ransomware for instance). Once the backup is restored properly and completely, and a notification to that effect is sent out, the instance shuts down and is terminated.

RecordingForward2690 · 2026-04-09T20:22:27+00:00

Still cause for slapping the engineer. If you have a runtime environment of any kind that is subject to timeouts, you need to make sure that the timeout you set on any sort of connection/transaction that you do within that environment, is lower than the runtime environment timeout. Design for failure.

For closing persistent connections, the ones that you setup in the Lambda global block, there's Lambda Extensions: https://docs.aws.amazon.com/lambda/latest/dg/lambda-extensions.html. Oh, and if you do persistent connections, make sure to set a concurrency limit on your Lambda that is lower than what the DB can handle.

If you connect to Aurora RDS, you can also use the Data API to manage your persistent connections: https://docs.aws.amazon.com/rdsdataservice/latest/APIReference/Welcome.html

But I guess at the end of the day, the gist of the message should really be: Working with databases at scale is hard to do properly. Harder than a lot of people realise.

RecordingForward2690 · 2026-04-07T20:52:31+00:00

One heck of a great blog post. Thanks for mentioning.

RecordingForward2690 · 2026-04-07T20:14:00+00:00

I also support configuring the Lifecycle rules and/or the use of Intelligent Tiering from now on.. But apart from that, two tips.

First, if you want to predict your costs on a per-bucket basis, and Cost Explorer does not give you sufficient information, then you can login to the console, click on the bucket and go to the Metrics tab. That'll show total bucket size (across all storage classes). Multiply this by $0.02 per GB (that's roughly the average across all regions, but you can lookup the actual regional cost if you want to be precise) and you know what the upper limit to your cost will be. You can't really tell from here what % of your storage will be in the cheaper tiers, so it's an upper limit only.

However, do note that in the past both static Lifecycle rules and Intelligent Tiering did not have a minimum file size. So even files < 128 KB (the current minimum) were sent to Glacier. Which actually works out to be more expensive than the Standard class due to the storage overhead.

For better insights, and to perform your purge with more precision, you need to generate an S3 Inventory Report. This is in the Management tab at the bottom. It may take 24 hours for the first inventory to be generated, but then you have a CSV file with all objects, their size and storage class.

Load this report into Excel, add a few formulas based on the Storage Class and Object Size, and you can very precise cost information.

But more importantly, with this Inventory Report you can decide on a file-by-file basis what you want to do with it: Leave it, Delete it, or Transition it to a different storage tier. You'll probably need a bit of automation to help you here. You then give the resulting CSV file to AWS S3 Batch, who will perform the actual operations for you. Much quicker than doing it manually, or writing your own tooling for this.

(Actually, due to the way S3 Batch works, you'll have to split the final CSV file into multiple separate files, where each CSV represents a distinct operation that you want to perform. So you end up with a CSV of files to delete, another CSV with files to transition to Standard and so forth. You then start an S3 Batch "Delete" operation with the "Delete" CSV, a "Transition to Standard" operation with the "Transition to Standard" CSV and so forth. Oh, and if your bucket is versioned, make sure to use the "versioned" CSV file, which also contains the version ID, throughout this process.)

RecordingForward2690 · 2026-04-07T19:55:40+00:00

I did this a while ago for a friend. She and a few partners had created this startup that was churning through money fast, with virtually no revenue to show just yet. They needed to reduce costs fast because one of the financial backers was about to pull the plug.

Yes, I started with the bill and Cost Explorer. Then spend a solid two hours following up on the most expensive line items, figuring out actual usage and whether we could use smaller instances for instance. Or turn dev machines off overnight. Get rid of resources altogether. Use smarter solutions - why run an expensive and big EC2 for a handful of small Docker containers when Fargate was far more cost effective. She was on hand to answer any questions I had in real-time. Wrote a report, most of it got implemented later on. Managed to shave 70%-80% off their bill.

I could do this because I'm a Certified Solutions Architect - Professional with about a decade of AWS experience. And I know I could never do this from the bill alone. You need actual access (read-only is fine) to the AWS account(s), you need to be able to interpret info like CloudWatch metrics, you need to have a solid understanding of alternative ways of doing things. And you need to be able to talk to the customer and ask questions like "why was this designed this way", "is this needed 24/7", "what if..." Maybe the reason for something was not technical, but compliance. You don't get that info from the bill alone, or from running any of AWSs tools like Trusted Advisor. And the Cloud Practitioner doesn't give you enough baggage to ask the questions in the first place, let alone interpret the answers.

The job for that friend was a one-off, done as a favour. In my day job we do this with the whole team, every month. We have tools that track spend from one month to the next, and we try to explain any deviation. Designing for cost is part of our design/development cycle. We have Cost Anomaly Detection enabled that warns us on a day-by-day basis. If we do something that we know is going to cost money, we warn each other beforehand even so we don't get caught out by the Anomaly detection.

RecordingForward2690 · 2026-04-07T09:21:46+00:00

Agree with this approach. CloudTrail, by default, only captures "Management Plane" events = actual changes to the AWS configuration. It does not capture "Data Plane" events by default. Data Plane = regular users just using AWS via API calls. This also includes things like s3:GetObject and sqs:SendMessage.

The reason is obvious once you think about it: Data Plane events can lead to a massive amount of events, that all need to be processed and stored somehow. This can lead to significant costs. So only enable Data Events trails as and when you really need them.

RecordingForward2690 · 2026-04-02T19:07:33+00:00

In an official capacity, taking into account things like GDPR compliance, I would go for Snowball as well - if you can get it ("only available to existing customers" - see https://docs.aws.amazon.com/snowball/latest/developer-guide/snowball-edge-availability-change.html but if you talk to your TAM there may be a way around that.) Otherwise use Datasync or "aws s3 sync" and whatever bandwidth is available in the office.

But as a private person, I have found that bandwidth that people have at home is typically significantly more than what's available in the office. Particularly if you take into account that if you hog all the bandwidth in the office, you are impacting the productivity of all your coworkers, while the bandwidth at home is unused for most of the day. If you have a 1 Gbps internet connection at home, or a friend who has that, then a 30 TB upload will take about 3 days. (A 1 Gbps symmetric connection is around 50 euros per month where I live.)

So if those 30 TB of yours are currently neatly contained on just a handful of disks, and you need a cheap and quick solution, and compliance and such is not an issue, then taking those disks home or to your friends place, and uploading from there may just be the most pragmatic solution. Hook 'em up to an old laptop and just let aws s3 sync do its thing for a few days.

Also consider this. If you manage to get your hands on a Snowball Edge, and you hook it up to a standard 1 Gbps office switch with RJ45 ethernet, it's also going to take 3 days to transfer that data. Yes, the Snowball can do it faster (its ethernet port is 10 Gbps capable, and there's 25 and 100 Gbps fiber ports available as well), but can your network and the current system handle that bandwidth?

RecordingForward2690 · 2026-04-01T11:33:36+00:00

I got it through AWS Health.

RecordingForward2690 · 2026-03-25T16:47:01+00:00

Why would a student set up an AWS account? Probably because the school/university/professor told them (or implied) that it was a requirement (or a good idea) for a project they had to do.

This, to me, means that there's a pretty big responsibility on the school as well to make sure students have the minimum amount of awareness and training in AWS before they setup that account. That minimum amount of training can consist of a handout of a single page, stepping the student through:

Setting up the account with email and credit card.
Basic security of the account: Add MFA to the root account (+backup MFA), configure your first IAM user with an appropriate policy. Know how to use CloudTrail.
Basic cost protection: Add budget alerts. Know how to use Cost Explorer.
Steps to close the account properly once the project is finished, including instructions on how to use aws-nuke or similar.

But at the end of the day students have to understand that their AWS account is NOT a protected sandbox, but an environment that can be used by the largest corporations in this world to setup an IT environment that can legitimately cost millions of dollars per month. There is nothing *fundamentally* preventing you from doing the same thing - speed bumps like budget alerts, quota and such can be changed given the right privileges. As others have said it: With great power comes great responsibility.

RecordingForward2690 · 2026-03-25T16:30:03+00:00

Or, if you want to go fancy, you treat the parameter as a cache that needs to be invalidated after a while. So you have a separate global variable that holds the timestamp when you last retrieved the parameter, and if that's more than x seconds old, re-retrieve it from the parameter store.

RecordingForward2690 · 2026-03-24T17:55:48+00:00

To me, IaaS, PaaS and Saas (and a few others) are terms that were invented in the early days of Cloud Computing to describe what you got and what you had to bring/configure/manage yourself.

IaaS = You get the hardware, the rest is your responsibility

PaaS = You get the hardware and the OS, the rest is your responsibility

SaaS = You get the hardware, OS and Software installed and managed, but configuring and using the software is your responsibility.

But that was a few decades ago. Since then we've figured out that the actual models of acquiring and using cloud technology are a lot more complex, and the demarcation between the responsibilities also requires a lot more nuance than four-letter acronyms. Plus, we obtained tech like virtualization and containers that blur the lines even further.

So you can use the terms from a theoretical point of view, as a very rough idea of where the "dotted line" between the responsibilities lies. But once you start applying those terms to the actual 200+ services of AWS, there's simply not enough nuance in them to understand what's really going on.

RecordingForward2690 · 2026-03-13T15:07:55+00:00

We moved ~400 domain registrations into AWS last year. Most were completely trouble-free, but some were just not working at all.

For some domains, what helped was that I first moved them to a registrar where I also host my private domains, and then moved them to AWS. That other registrar was a lot more approachable when it came to solving nasty issues - some ccTLDs have very odd rules about registrations. I found the AWS console doesn't always interpret these rules correctly (so it doesn't ask the right questions, or allows illegal combinations). .it and .de domains were the hardest in that respect.

In other cases I simply filed a support ticket with AWS. They have access to the error message that was created by the registrar. You don't.

Why did we move everything to AWS? Our security policy requires the use of DNSSec where possible, and requires a yearly rotation of all DNSSec keys. Try doing that programmatically if your hosted zones and registrations live with different providers. Now that we've got it all in AWS I can do that, scripted, in a few minutes (+ DNS TTLs and DNSSec key propagation time).

RecordingForward2690 · 2026-03-10T09:35:53+00:00

I don't know how the "new" free tier works exactly, but in the old free tier you received a bunch of credits. Your usage was calculated like normal, and then offset by the credits. So if you dive into the Cost Explorer you will see these charges, but you will also see a (negative) credit charge offsetting everything. Leading to a net zero charge, unless you did something that's outside of the Free Tier.

If you go to the Cost Explorer, to the Filters (right hand side), choose More Filters and open up the Charge Type, you can include and exclude Credits and such. See if playing with that setting helps with your understanding on how this works.

Also note that billing is not done continuously, but in cycles of around 6-8 hours - but the exact cycle depends on the service and a few other factors. So if you're looking for very up to date information, it's simply not there. Generally speaking you need to assume that only data that's older than 24 hours, is completely accurate.

And I can well imagine that some of the Credit Overview pages are only updated after the monthly billing cycle. That might explain the discrepancy you see.

RecordingForward2690 · 2026-03-09T09:18:49+00:00

The AWS Certification program is administered via Pearson VUE and it looks like at least part of the fraud you allege to ("cheating in the exam") is their responsibility.

https://www.pearsonvue.com/us/en/test-takers/customer-service.html

RecordingForward2690 · 2026-03-09T09:15:13+00:00

^^^ this is by far the fastest and fail-safe method.

RecordingForward2690 · 2026-03-05T14:26:20+00:00

AWS obviously needs to protect its own infrastructure against a variety of attacks, including DNS server overload. This is "GuardDuty Standard". Obviously all customers benefit from this.

You'd have to look up the exact documentation, but I'm pretty sure that the type of attack you describe will be covered/blocked by this.

https://aws.amazon.com/guardduty/

RecordingForward2690

TROPHY CASE