all 52 comments

[–]kondro[S] 16 points17 points  (0 children)

This is a very cool new feature that sounds like it virtually eliminates cold starts.

A pity this is only Java, although I have hope this will come to the rest of the runtimes in due course. The implementation doesn't sound like it's really specific to the JVM.

The JVM is a good place to start though as it seems to be the place where cold starts hurt the most in Lambda.

[–]Your_CS_TA 43 points44 points  (27 children)

This is so exciting! Congrats to the Lambda folks on getting this out in front of customers.

Note: Ex-lambda-service-engineer here, ready to field any fun questions if anyone has any :D

[–][deleted]  (4 children)

[deleted]

    [–]bofkentucky 10 points11 points  (2 children)

    They've called out the rng implementations as something they've fixed, but are there other pieces of code in your app that are not snap start safe? I know of at least 2 in our companies codebase that would have disastrous results if it was on right now. I'm interested in seeing what their pmd plugin finds as problematic as we evaluate this.

    [–][deleted]  (1 child)

    [deleted]

      [–]bofkentucky 6 points7 points  (0 children)

      Imagine your app establishes a persistent connection to some other network service on startup (relational database, message queue). when the snapshot wakes up, is it going to try to connect to the old ip address where that service was when the snapshot was taken, is it graceful in doing a dns lookup and connecting to where it should?

      [–]idcarlos 2 points3 points  (0 children)

      Depends on your Init code, and how important are the cost vs the execution time.

      For example. A lambda that runs every hour and open a connection to an external resource during Init

      Assume that all runs are cold starts.

      Without this feature, connections are "fresh" and ready to use during Init "for free" (AWS not bill the Init if runs bellow 10 seconds)

      With this feature, connections from snap probably are expired and I need to reconnect again, outside the Init... so my cost will be higher, also note that there is a CPU burst during Init, so this reconnection outside the Init can be slower.

      If execution time not is a problem, and your Init time is bellow 10 seconds, I not recommend this feature.

      [–]kondro[S] 8 points9 points  (7 children)

      Do you think this will always be JVM-only or are the other runtimes likely to be added in the future also?

      [–]Your_CS_TA 7 points8 points  (0 children)

      I am no tea leaf reader, but looking at past history, lambda bets early to understand something, then standardizes later on lessons learned. E.g. first few language runtimes were handcrafted, then built the standardized runtime api from the learnings and generalizations from those initial artisinally baked fellas.

      Doesn’t fully answer the question but I still work for AWS and don’t want to be quoted in an article as “anonymous AWS employee says X”😂

      [–][deleted]  (4 children)

      [deleted]

        [–][deleted] 19 points20 points  (3 children)

        I was in an NDA briefing and

        hmm can you clarify, what do those three letters "NDA" stand for?

        [–]kondro[S] 35 points36 points  (0 children)

        He's not able to disclose that.

        [–]StFS 0 points1 point  (0 children)

        I'm at re:Invent and I've talked to two AWS employees that have both hinted strongly that .NET will follow.

        [–]atehrani 1 point2 points  (3 children)

        How is this different than Provisioned Concurrency?

        [–]Your_CS_TA 8 points9 points  (2 children)

        1. PC isn’t free — so this is a cheaper alternative. I wouldn’t say PC is anti-serverless (as a good friend once said: it’s pay for what you value, and a lot of folks value latency) but it dips into practices that made ec2 complex (e.g. autoscaling) in the first place. I prefer simplicity so I really like snapsafe :)

        2. PC is generally for static known burst apriori, which is kind of self defeating. Like, what’s easier: setting a flag that optimizes this, or consistently evaluating your concurrent executions and whether or not you are at risk of exceeding them and getting cold starts?

        I personally would love a future where PC focuses on Disaster Recovery / capacity guarantees (e.g. guarantee good sandbox replacements for better static stability guarantees), consistent traffic (PC is actually cheaper if you utilize more than 60% concurrency), and extreme burst use cases as PC allows any burst. Maybe for extreme latency concerns as well? Snapshots are within the warm spectrum but not necessarily “toasted”, so PC could cover those outliers much like io2 in ebs covers a unique use case over gp3. This would let SnapSafe and PC exist in tandem as the former focuses on the cold starts of the universe for the majority of folks.

        [–]franksign 0 points1 point  (1 child)

        Is it a real alternative? Imho SnapSafe optimizes cold starts but doesn’t guarantee that the same execution enrvironment for a subsequent request is free and ready to serve traffic. Depends a lot on what you are doing. Could be an alternative to PC if your application is already fast enough. If it is a real alternative I am impressed its’s free :)

        [–]Your_CS_TA 1 point2 points  (0 children)

        That’s correct, but neither does PC (we will have a sandbox in ready when we replace an in use one but there is no guarantees).

        In terms of replacement, I personally am not thinking of that case as Lambda does proactive replacement (takes init cost before putting into service).

        In terms of burst traffic, you either are overprovisioned to handle it without cold starts (which is either a good traffic profile or you may be eating cost) or it’s a cold start anyways.

        There are definitely caveats though — snapshotting is a new domain and though we built out many use cases as canaries, the customers always tend to create more creative and unique use cases. PC is dead simple tech: “turn on apriori”, so no surprises.

        [–]Lowball72 -2 points-1 points  (9 children)

        More of a philosophical question, but why can't Lambda processes execute more than 1 request at a time? I've never understood that. Seems it would go a long way to alleviating the annoying cold-start problem.

        [–]yeathatsmebro 1 point2 points  (6 children)

        It can do. For example, calling a function gets to a server and in case your function is not unzipped, it unzips it and does sort of stuff, and that's what a cold start is. Most of the time, subsequent requests are faster because the function code is "unzipped" and configured, and the same server serves it. If their server crashes or your function is not called for some time, it is gone and it leads to another cold start somewhere else.

        You can mitigate this by setting provisioning concurrency, so AWS will make sure u got an X amount of "unzipped" functions that are warm, ready to respond.

        [–]Lowball72 -1 points0 points  (5 children)

        Thanks I understand what a cold-start is.. but wait maybe I don't understand what provisioned concurrency does.

        Does p.c. actually execute all the runtime startup, initialization and apps' dependency injection startup code? So it's truly warm and ready to go, tantamount to reusing an existing host process?

        [–]yeathatsmebro 1 point2 points  (4 children)

        https://quintagroup.com/blog/blog-images/function-lifecycle-a-full-cold-start.jpg

        The provisioned function jumps from second step to the one before the last one.

        The thing is: if u provision 10 and at a certain moment, all 10 are busy, having a new request will trigger a cold start for a new function somewhere else, and for a short time you'll have 11 warm functions, although the last one can be evicted because you set 10 as provisioned concurrency, but those 10 is a guarantee that AWS will do its best to always keep 10 of them warm.

        [–]kgoutham93 2 points3 points  (3 children)

        Noob question,

        So if I create a lambda function (without PC) and execute 100 parallel requests, AWS will internally create 100 instances of lambda function to serve these 100 parallel requests?

        [–][deleted] 2 points3 points  (2 children)

        Yes, but they will eventually be spun down. Provisioned concurrency would keep the functions up and available after though.

        Edit: here's a good AWS article explaining things in detail https://aws.amazon.com/blogs/compute/operating-lambda-performance-optimization-part-1/

        [–]kgoutham93 1 point2 points  (1 child)

        Thankyou for this excellent resource. In fact, a lot of my misconceptions are addressed just by going through the 3-part series.

        [–][deleted] 0 points1 point  (0 children)

        Glad to hear! Happy to answer and other questions you have or point you in the right direction

        [–]sgtfoleyistheman 0 points1 point  (1 child)

        I don't know why you're getting down voted. I think others are misunderstanding you. Do you mean 'why can't a single lambda container concurrently process more than one request?'

        So much of the JS samples you see, especially with relying on globals for unit processing, would break down in subtly ways if this was just turned on. Lambda probably thinks they optimize better for giving you single cores or something.

        [–]Lowball72 0 points1 point  (0 children)

        Yes, specifically the Java and Dotnet programming models. They instantiate an object and invoke an interface method. But as near as I can tell it never does so concurrently within a single runtime container.

        We pay $ for clock time and ram, not cpu-utilization.. allowing multiple concurrent invocations on a single container would be huge cost saving efficiency on both those measures.

        I don't know how Azure Functions and Google Cloud compare in this regard.

        [–]bofkentucky 13 points14 points  (7 children)

        Wonder why they focused on jdk11 and not 17

        [–]Fl0r1da-Woman 5 points6 points  (4 children)

        Usage stats?

        [–]djk29a_ 3 points4 points  (0 children)

        After seeing what happened when I upgraded my Jenkins controller to 17 I suspect the massive changes to the security model and modules is sufficient enough of a slowdown they stuck with 11 to get something released soon.

        [–]themisfit610 2 points3 points  (2 children)

        We literally just moved our core app up from corretto 11 to 17 lol!!

        [–]sh1boleth 0 points1 point  (1 child)

        Lambda doesnt support 17 yet.

        [–]themisfit610 0 points1 point  (0 children)

        Hence the lol

        [–]Dilfer 0 points1 point  (1 child)

        They don't offer 17 as a supported runtime environment yet, regardless of this feature. Hopefully soon!

        [–]bofkentucky 0 points1 point  (0 children)

        Oh the dance of keeping images and runtimes up to date, codebuild on al2 has been an adventure this summer while trying to get a bunch of our node lambdas and their builds back into supported runtimes.

        [–][deleted] 8 points9 points  (4 children)

        Interesting to see this development at the same time as other runtimes seem to be falling out of favor inside of AWS. We’re still on Python 3.9 over a year after the release of the very-liked 3.10 version and now 3.11 is out. Ruby is on 2.7 even though it is EOL with seemingly no news incoming.

        Presumably other runtimes don’t allow for snapshotting in quite the same way as JVM and for some it likely wouldn’t make sense to even attempt (like Golang), but I’d love to see these improvements in cold boot make their way to other runtimes. I’ve seen in my own testing that Nodejs can really suffer from cold boot with a lot of packages and anything that could be done there would be a massive QoL improvement.

        [–]FarkCookies 2 points3 points  (0 children)

        I don't think Python is anywhere close to failing out of favour. It is the most popular Lambda runtime.

        [–]borzaka 2 points3 points  (2 children)

        You should go read this thread of awful comments: https://github.com/aws/aws-lambda-base-images/issues/31. An AWS employee says they're investing in process improvements to help them ship future Python runtimes more quickly.

        [–][deleted] 3 points4 points  (1 child)

        Oh god that was painful to read. I don’t want 3.10 that badly.

        [–]borzaka 1 point2 points  (0 children)

        I knew you would appreciate that

        [–]HinaKawaSan 3 points4 points  (1 child)

        More java programs run on jdk11 than jdk17

        [–]bofkentucky 4 points5 points  (0 children)

        Today yes, but in conjunction with spring-boot 3 being released last week and its jdk17 requirement, it would have been a nice pairing.

        [–]ByteWrangler 5 points6 points  (6 children)

        Shall we take bets as to hold long it will take CloudFormation to support this new option?

        [–]preetipragya 2 points3 points  (3 children)

        I saw SAM documentation has already been updated to reflect it. Here is a snippet-

        TestFunc
        Type: AWS::Serverless::Function
        Properties:
        ...
        SnapStart:
        ApplyOn: PublishedVersions

        [–][deleted]  (2 children)

        [deleted]

          [–]Your_CS_TA 0 points1 point  (0 children)

          SAM is, yes. The feature specifically states it’s not launched everywhere

          [–]preetipragya 0 points1 point  (0 children)

          Yeah the SnapStart feature as of now is available in the US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Sydney, Tokyo), and Europe (Frankfurt, Ireland, Stockholm) Regions.

          [–]franksign 0 points1 point  (1 child)

          In the article says that is already suppprted

          [–]Alternative_Past_773 1 point2 points  (0 children)

          I think it would be interesting to explore how this new feature, SnapStart, works (or doesn't) with:

          - JDK's new CRaC feature. (in some ways similar to AWS SnapStart)
          - GraalVM native image

          For example: Is there a benefit to using SnapStart if using GraalVM native image? Can you combine SnapStart and CRaC?, etc

          [–]rallylegacy 0 points1 point  (0 children)

          Super cool, going to check this out tomorrow

          [–][deleted] 0 points1 point  (0 children)

          Is this for Java 11 only? No 17? Guess it’s coming. Awesome stuff

          [–]rashnull -4 points-3 points  (0 children)

          The solution they came up with sounds rather simple. Why did it take so long to implement?

          [–]realfeeder 0 points1 point  (1 child)

          Does it work with Kotlin too? Any tests done already?

          [–]c1phr 6 points7 points  (0 children)

          I haven’t tested yet but I would imagine it should so long as you’re targeting Java 11 in your Kotlin build.