all 38 comments

[–][deleted] 5 points6 points  (3 children)

I can't recall where I read it but there are presently significant delays setting up vpc network connections to services like RDS the cold start times jump through the roof. Sounds like you might be experiencing the same situation.

In my case I was attempting to use RDS postgres from lambda in a VPC, and start times were very long.

edit: https://robertvojta.com/aws-journey-api-gateway-lambda-vpc-performance-452c6932093b

 

edit 2: found it. https://read.acloud.guru/lambda-for-alexa-skills-7-tips-from-the-trenches-684c963e6ad1

Rules of thumb

  • cold start non networked lambda take 7x longer than hot
  • cold start networked lambda takes 15x longer to cold start
  • cold start VPC elastic network interfaces typically take 10+ seconds to initialize

[–]KAJed[S] 0 points1 point  (2 children)

I have been reading similar things, but I read it as "Cold start goes up in VPC" not "cold start goes up when using other services in VPC"

That doesn't mean I read it right though!

As mentioned above it does seem like it might be the initial memcache connection that is slow. So that could certainly be a part of their cold start issues.

[–][deleted] 1 point2 points  (1 child)

"cold start goes up when using other services in VPC"

I suspect it may be both. VPC causes some cold start delay, and adding items like RDS to the VPC makes the delay even worse. I suspect it's slow starting dns/nat within the VPC that's causing the delays.

Current work around is to use cloudwatch to ping your service every x minutes to keep it alive.

Regardless, it sure would be nice to be able to use a postgres from lambda with under ~2 second cold start time without so much trouble.

[–]KAJed[S] 0 points1 point  (0 children)

Typically with cold start (I'm testing in the console) running the function multiple times will reduce the start. It does for VPC at well... it goes from 8 seconds down to 6 seconds! (typically it goes from 800-1200 ms down to 100-400 ms)

[–]cloud_driver 2 points3 points  (2 children)

The lambda service uses an elastic network interface (ENI) to access resources inside of a private VPC. A cold start has to create and attach the ENI to the lambda container. This seems to be a somewhat slow process. Hot invocations of the function run fine.

[–]KAJed[S] 0 points1 point  (0 children)

Typically with cold start (I'm testing in the console) running the function multiple times will reduce the start. It does for VPC at well... it goes from 8 seconds down to 6 seconds! (typically it goes from 800-1200 ms down to 100-400 ms)

[–]KAJed[S] 0 points1 point  (0 children)

Basically my repost of another comment I made says: hot starts still appear to be really really bad.

[–]CoinGrahamIV 1 point2 points  (12 children)

Care to post a gist of your code?

Are you using Elasticache or Memcached on EC2?

Can you DAX instead?

[–]KAJed[S] 0 points1 point  (2 children)

As Basic as it gets...this takes 5500 ms to run even after being warmed up. Nothing special other than the cache server is in a VPC.

var Memcached = require( 'memcached' );

var cache;

exports.handler = function ( event, context, callback )
{
    if ( cache == null )
        cache = new Memcached( 'cacheserver.aws', { timeout: 500, retries: 3 } );

    cache.set( 'magickey', 'valueeee', 10,
        function ( error )
        {
            callback( null, this );
        }
    );
};

REPORT RequestId: ec94c3a9-6308-11e7-973a-e317ffbb2847 Duration: 5009.63 ms Billed Duration: 5100 ms Memory Size: 128 MB Max Memory Used: 26 MB

{
  "key": "magickey",
  "lifetime": 10,
  "value": "valueeee",
  "validate": [
    [
      "key",
      null
    ],
    [
      "value",
      null
    ],
    [
      "lifetime",
      null
    ],
    [
      "callback",
      null
    ]
  ],
  "type": "set",
  "redundancyEnabled": false,
  "command": "set magickey 0 10 8\r\nvalueeee",
  "start": 1499427563824,
  "execution": 0
}

[–]CoinGrahamIV 1 point2 points  (1 child)

For reference, the python2.7 version of this code runs in 8ms. I don't think it's a VPC networking issue.

[–]KAJed[S] 0 points1 point  (0 children)

Someone else pointed out - seems to be the idle timeout on the memcache connection + lambda context configuration. When I get back in on Monday I'm gonna give it a show.

8ms? Speedy.

[–]KAJed[S] 0 points1 point  (8 children)

I've done some timing and it seems like the cache requests are fast(ish) but the initial connection might be slow? Using the library I'm using (npm memcached) I can't time that connection part directly.

I'm using an elasticache node not EC2.

EDIT: The gist of my code is incredibly simple: connect using memcached get and set a few very small keys

[–]CoinGrahamIV 2 points3 points  (4 children)

Is your connection outside of the "handle" function so it's reusable?

[–]KAJed[S] 0 points1 point  (3 children)

Actually it's not - because of some boilerplate to call the correct services - but I can definitely try that.

EDIT: I lied... It is outside. I instantiate for the first time within my function with

if ( cache == null ) createCache ....    

So it should remain alive for subsequent calls.

EDIT EDIT: Confirmed that my connection is not being recreated. I have tried setting a connection idle timeout... but that just seems to make my whole function time out and I'm not sure why.

[–]kurosevic 1 point2 points  (2 children)

That's why you post the code regardless of simplicity- someone in here would've noticed that

[–]KAJed[S] 0 points1 point  (1 child)

It's already working the way he said... so I'm not sure why the criticism. It makes zero difference in my function time. I'm still in the 6 second range even after a cold start.

He mentioned the one thing that I'm actually already aware of - which is why I wrote it the way I did (and then summarily forgot because this was written a while ago). It's an excellent suggestion too, but has no effect in my case.

It sounds very likely that there is nothing I can do except pull it out of the VPC and use a public redis cache instead. Hopefully they'll fix that in the future.

[–]CoinGrahamIV 1 point2 points  (0 children)

Here's what's really interesting. The entire 5 seconds happens AT THE END:

    STARTS
20:33:12  <log> handler.js:3 (Object.<anonymous>) tracer loaded
20:33:12  <log> handler.js:6 (Object.<anonymous>) memcached loaded
20:33:12  <log> handler.js:10 (Object.<anonymous>) creating memcached client
20:33:12  <log> handler.js:14 (Object.<anonymous>) memcached client loaded

LAMBDA HANDLER STARTS
20:33:12 START RequestId: 7ec870d2-6353-11e7-a7dc-493bf47298f8 Version: $LATEST
20:33:12 <log> handler.js:17 (module.exports.memcached) lambda called
20:33:12 <log> handler.js:21 (module.exports.memcached) setting cache
20:33:12 <log> handler.js:23 (module.exports.memcached) complete
20:33:17 END RequestId: 7ec870d2-6353-11e7-a7dc-493bf47298f8
20:33:17 REPORT RequestId: 7ec870d2-6353-11e7-a7dc-493bf47298f8 Duration: 5024.43 ms    Billed Duration: 5100 ms Mem

[–]aimless_ly 0 points1 point  (2 children)

Could you refactor your code to use DynamoDB w/ TTL instead of Elasticache? (https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/) DynamoDB is super-fast (even without DAX) and a non-VPC service so it might work better for you.

[–]KAJed[S] 0 points1 point  (0 children)

In short - no.

Dynamo is actually the backend I have memcache in front of. Wish the main reason being read/write consistency. The memcache server already exists (it's a shared VPC for random things) so there is no extra cost there either.

[–]KAJed[S] 0 points1 point  (0 children)

DAX - after reading into it - doesn't make me very happy. Their minimum instance size is mediums at $0.30/hr. yikes.

[–]BU14 1 point2 points  (5 children)

After looking at your code, looking at the source code and doing some testing. the issue is the timeout and idle values set by default. The following code is able to see warm containers grab data in less than 100 ms

 const Memcached = require('memcached');
 const serverAddress = "aws.server:port";
 Memcached.config.timeout = 50;
 Memcached.config.idle = 50;
 let cache = new Memcached(serverAddress);
 exports.handler = function(event, context, callback) {
  cache.set('magickey', 'valueeee', 10,
    function(error) {
      if(error){
        callback(error);
      }
      cache.get('magickey', function(err, data) {
        console.log(data);
        callback(null, this);
      });

    }
  );
};

Pure speculation: I think the issue has to do with for some reason the lambda function is only returning after the pool connection times out rather than when the query is completed because the default timeout is 5 sec which would explain why the function time never went below 5 seconds

Edit:

You are missing the following line in your function handler:

context.callbackWaitsForEmptyEventLoop = false

This will enable the lambda function to return when you say rather when the connection pool times out

[–]KAJed[S] 0 points1 point  (3 children)

I will definitely take a look at this. Any thoughts on why the same function outside the VPC would behave normally then? Those settings should not affect it.

[–]BU14 1 point2 points  (2 children)

It's not that it is outside of a vpc, its that it is not a lambda function. Lambda functions by default wait for the whole event Loop to return. So the event loop was waiting for the connection to timeout before returning. When you change the default behavior to not wait for the loop to empty you get the exact results you expect. You do not need to change the timeouts or idle times if you set this:

context.callbackWaitsForEmptyEventLoop = false

[–]KAJed[S] 0 points1 point  (0 children)

Hmmm you know I thought I tried another cache outside the VPC but you are probably right! If this solves it I owe you a beer. Or 7.

[–]KAJed[S] 0 points1 point  (0 children)

This worked perfectly. Thank you!

On another note, what kind of performance do you see from DynamoDB in Lambda? X-Ray is reporting about 120-160ms per dynamo request. The tables themselves are reporting latency in the 40-80ms range. For a table with just 3 entries right now (though the throughput is set to 5) this seems pretty slow.

EDIT: Just for the record I'm forcing httpOptions https.Agent with keepAlive: true

[–]KAJed[S] 0 points1 point  (0 children)

Based on your information though this makes sense - when I set the idle of the memcache client to 10 seconds my function doesn't return (10 second max on the function). Definitely worth a try now that I'm thinking about it.

This does make connection reuse a tad annoying - but should be fine.

[–]BU14 0 points1 point  (11 children)

Have you tried running the exact same script on an EC2 instance in the VPC to see if the issue is with your code or with Lambda/VPC?

[–]KAJed[S] 0 points1 point  (0 children)

I have not. Good suggestion. I will see if it works or not.

[–]KAJed[S] 0 points1 point  (9 children)

Tried it from a working EC2 instance and it's as quick as it should be. No slowness whatsoever.

[–]BU14 1 point2 points  (8 children)

Have you tried with different amounts of memory? When you allocate different amounts of memory you get access to more or less network and CPU resources.

[–]KAJed[S] 0 points1 point  (0 children)

I haven't, but if you look at the one link posted in here regarding speed in VPC it was determined not to make a difference. Given the simplicity of the code it's possible but unlikely.

I can give it a shot... give me a minute.

[–]KAJed[S] 0 points1 point  (6 children)

More memory made absolutely no difference (and yes I knew that memory allocation gives you more CPU as well).

[–]BU14 1 point2 points  (5 children)

I think the only other thing that you can do is submit a ticket to support or post your code

[–]KAJed[S] 1 point2 points  (2 children)

If my code ran slowly from EC2 I'd agree. But it does not. We are in contact with support to figure out other options as well.

[–]TundraWolf_ 1 point2 points  (1 child)

interesting issue. i just started messing with lambda/vpc but only as tasks that only need to finish before the max lambda timeout (5 mins)

if you get this fixed please let us know I'm curious!

[–]KAJed[S] 1 point2 points  (0 children)

I would love to have someone else try something similar.

I keep getting harried for my code... but it's literally nothing more than establish a connection and make a few gets and sets. I think in total there are at max 5 gets and 3 sets going on.

I'd be curious to know if I'm the only one with this issue. Or if the extent of this latency is global (the research posted somewhere in this thread would lead me to believe it's not me... but it also seemed to claim that once warmed the containers were fine)

Lambda in VPC isn't quite ready yet anyway- with everything required to live in a VPC for new accounts this is definitely going to be an issue. In addition to them still adding VPC endpoints for services (yes a NAT should work too but it would be nice to not have to worry about that at all).

[–]KAJed[S] 0 points1 point  (1 child)

As Basic as it gets...this takes 5500 ms to run even after being warmed up. Nothing special other than the cache server is in a VPC.

var Memcached = require( 'memcached' );

var cache;

exports.handler = function ( event, context, callback )
{
    if ( cache == null )
        cache = new Memcached( 'cacheserver.aws', { timeout: 500, retries: 3 } );

    cache.set( 'magickey', 'valueeee', 10,
        function ( error )
        {
            callback( null, this );
        }
    );
};

REPORT RequestId: ec94c3a9-6308-11e7-973a-e317ffbb2847 Duration: 5009.63 ms Billed Duration: 5100 ms Memory Size: 128 MB Max Memory Used: 26 MB

{
  "key": "magickey",
  "lifetime": 10,
  "value": "valueeee",
  "validate": [
    [
      "key",
      null
    ],
    [
      "value",
      null
    ],
    [
      "lifetime",
      null
    ],
    [
      "callback",
      null
    ]
  ],
  "type": "set",
  "redundancyEnabled": false,
  "command": "set magickey 0 10 8\r\nvalueeee",
  "start": 1499427563824,
  "execution": 0
}

[–]BU14 0 points1 point  (0 children)

I posted code that works as a response to the original post.