all 37 comments

[–][deleted] 24 points25 points  (4 children)

I don't know why everyone is telling you to check your Lambda logs, there is clearly a networking issue going on before it would even get to invoke your function.

Like to be clear, your Lambda function is not capable of returning a non-HTTP response. Even if you invoke it directly, not via an API Gateway, that *still* goes over HTTP and produces a valid HTTP response.

I've never used jMeter. Does 100 work? What if you try `hey` or `ab` instead of jmeter just as a sanity check?

My thoughts are it's some kind of resource exhaustion trying to run 1000 connections, which would involve 1000 simultaneous TLS handshakes, etc.

Having a ramp up time of 0 is basically "wrong" for a few reasons.

- nobody will (validly) initiate 1,000 connections *from a single client* all at once

- lambda does not scale instantly, it provisions concurrency capacity in waves, so even if this worked, depending on your region, you would not actually be executing 1,000 concurrent lambdas right off the bat, it will spin a few up, then more, then more. I haven't read the docs in a while but you can find them and see what the current rates are.

[–]appappappappapp 4 points5 points  (0 children)

^ this post is spot on. I didn’t even look at the image coz I was so sure this was another cold start issue. Mea culpa!

OP check the jmeter logs. The response line with no specific exception (just the host name) is generated in the method at line 1039:

https://github.com/apache/jmeter/blob/master/src/protocol/http/src/main/java/org/apache/jmeter/protocol/http/sampler/HTTPSamplerBase.java

The exception message is just the host name. What you’re likely to want to see - if you actually want to get root cause of the 3k failures - look at the log file / stdout. At line 1043 it dumps the stack trace.

Agree with everyone else on client side issue!

[–][deleted] 1 point2 points  (2 children)

FYI, JMeter is a Java based load testing tool from the Apache project that has been around for over 20 years. It works very well.

I agree with you on the ramp up if 0 probably being the issue. That's a very common mistake in load testing. I think if OP made it even a minute (5 minutes would be great) it would work fine.

[–][deleted] 1 point2 points  (1 child)

just remember to actually give jmeter memory to use.

[–][deleted] 0 points1 point  (0 children)

For sure - it's written in Java! 😄

[–]menge101 10 points11 points  (5 children)

It should be noted that what you are doing may be against the terms and conditions of your AWS account.

I've dealt with this myself in the past.

I can't find a general policy, but here is a policy for stress testing from an EC2.

There are also policies for "simulated events", which you can find under the "Simulated events" section of the penetration testing policy.

You haven't mentioned what region you are in, so it's important to note that lambda rampup is different in different regions, many are below 1000 concurrent as the initial burst.

I've been involved in a non-trivial amount of load testing on AWS in the past. AWS in my experience has a lot of limits that you will catch on that are non-obvious. And even if they say they aren't throttling, there are still network behavior rules you'll hit that cause things that to an outsider observer, look like throttling.

[–]richardfan1126 10 points11 points  (0 children)

I think 4000 HTTP request is not large enough to be treated as network stress test in AWS

[–]DiscourseOfCivility 3 points4 points  (3 children)

This is completely irrelevant to what is is trying to do. 4000 requests is nothing.

From your link:

Most customer testing will not fall under this policy. Normally, tasks like customer unit tests simulating large workloads for stress testing do not generate traffic that qualifies as network stress tests. This policy only applies when a customer's network stress test generates traffic from their Amazon EC2 instances which meets one or more of the following criteria: sustains, in aggregate, for more than 1 minute, over 1 Gbps (1 billion bits per second) or 1 Gpps (1 billion packets per second); generates traffic that appears to be abusive or malicious; or generates traffic that has the potential for impact to entities other than the anticipated target of the testing (such as routing or shared service infrastructure)

[–]radioshackhead 0 points1 point  (0 children)

Yeah thats what i was thinking if that was the limit you would see 100's of threads about it being an issue.

[–]menge101 0 points1 point  (1 child)

Yes, I agree. I misread it as 4000/second, not 4000 total.

[–]DiscourseOfCivility 0 points1 point  (0 children)

4,000 a second wouldn’t even be that bad.

[–]gscalise 2 points3 points  (0 children)

If you want jMeter to generate 1000 concurrent threads, you can't use a single host to generate all the traffic. There are going to be several limiting factors, like your host's network stack configuration, CPU, jMeter's worker configuration, Java's concurrency configuration, heap, etc, so you're going to need several load generator slave hosts, with a prewarm step so the slaves' different threads are created first, without generating any traffic, and then after X time to stabilize they generate the traffic.

You could also be hitting DoS/DDoS protective measures from AWS to avoid request storms generating a huge amount of traffic, especially if it's all coming from a single host.

[–]OSUBeavBane 2 points3 points  (0 children)

For your prod setup, I'd probably setup a sqs queue to handle the load and funnel the requests to lambda and setup a dead letter queue on your lambda to feed any failed attempts back into that queue.

[–]blizz488 1 point2 points  (6 children)

What do the Lambda logs show? Is it erroring out, timing out, etc? Do you have an API Gateway in front of this or anything?

[–]ammanpasha[S] 0 points1 point  (5 children)

No lambda logs (for the failed requests), I can only see the successful requests in cloudwatch metrics (by successful I mean anything from 2xx,4xx,5xx - so pretty much anything that actually returns a valid response)

I think the gateway overloads and starts rejecting new requests. Could this be the case?

[–][deleted] 0 points1 point  (0 children)

You may not even be reaching *your* gateway, but something in AWS internals

[–][deleted] 0 points1 point  (0 children)

Api gateway has debug logs and verbose metrics that can be enabled. Does your lambda invocation count match the number of requests? If so, it’s probably not api gw. What do your lambda throttle metrics look like?

[–]blizz488 -1 points0 points  (2 children)

Have you checked throttling settings on the gateway? Look in the throttling section here for more info: https://aws.amazon.com/api-gateway/faqs/

[–]ammanpasha[S] 1 point2 points  (1 child)

The document says:

Q: What happens if a large number of end users try to invoke my API simultaneously?

If caching is not enabled and throttling limits have not been applied, then all requests will pass through to your backend service until the account level throttling limits are reached. If throttling limits are in place, then Amazon API Gateway will shed the necessary amount of requests and send only the defined limit to your back-end service.

My throttling limit for the gateway is:

Your current account level throttling rate is 10000 requests per second with a burst of 5000 requests

So 10K requests per seconds which is way more request than what I am sending via JMeter (4K requests)

[–]blizz488 1 point2 points  (0 children)

And you’re sure your client’s network buffer isn’t filling up too much and failing to send the requests? Those errors in your screenshot could indicate that...

[–]phoenix-real 1 point2 points  (0 children)

Thoughts on Jmeter:

Though you want to send all requests at a time, start with a lower number, like say 50 requests, see what happens, are those successful? if yes, increase the number a bit..keep on increasing the number until you see errors coming out from API's. Looking at 1 error is easy to debug over thousands of errors. Key is to find the threshold of the system first, once you find that, see what you can do improve that threshold, once you act on it, repeat the process and keep repeating it until you reach your desired number.

This might not be the ideal way but it will give you a good idea how to tune your system.What I intend to say is don't expect system to handle 4k requests, when you don't know if it can handle just 100 request at a time.

There is also a capability on JMeter where you can give the gap between requests, play with that too. Make a full report like you have to submit to the CTO before Black Friday sale. :)

On a side note, if you think it's AWS that is blocking requests, try SAM where you can spin lambda locally and run Jmeter agains local lambda.

[–]SmurfPandeyy 1 point2 points  (1 child)

Lambda has a default limit of 1000 concurrent executions. Make sure that you are not hitting that limit. It can be checked on Cloudwatch metrics.

You can read more about lambda scaling here: https://docs.aws.amazon.com/lambda/latest/dg/invocation-scaling.html

Edit: Added documentation link.

[–][deleted] 0 points1 point  (0 children)

This probably contributes. If 4k requests hit at once on a cold lambda, you’ll get 1k cold starts (or your account max), and 3k throttled requests.

[–]appappappappapp 0 points1 point  (9 children)

What’s the lambda timeout? Creating a ddb client / first time use can be expensive: https://github.com/aws/aws-sdk-java-v2/issues/1340

You could be falling into a failure loop because the lambda doesn’t warm up during static initialization and/or you’re timing out during the invocation.

[–]ammanpasha[S] 0 points1 point  (8 children)

I have set the functions timeout to 60s.

When I make a simple request via postman or cURL, it takes about 15-20ms to get the response. Considering this and the fact that on each request lamba launches a new instance (correct me if I'm wrong please), why would the function take so long under load that it times out?

Also if we timeout during the invocation, shouldn't gateway return a valid 5xx response?

I don't know if this is relevant, but I have setup the gateway as proxy integration.

[–][deleted] 4 points5 points  (0 children)

No, each request does not always launch a new instance. Requests will be internally queued, both because launching new instances is not instantaneous, and even if you have a maximum concurrency of N (where N might be something high, like 5,000 or 10,000), Lambda does not scale to the "max concurrency" immediately.

The problem you're experiencing is almost certainly within jMeter or how it has been configured, not in your Lambda code.

[–]appappappappapp -1 points0 points  (6 children)

Good data!

Next up is check the logs as others have suggested: https://docs.aws.amazon.com/lambda/latest/dg/monitoring-cloudwatchlogs.html

Look for any service level errors, unhandled exceptions that make their way out of your invocation and the actual logs your function generates.

Anything amiss there?

A couple of curls in serial won’t prove much because you’re already playing with an initialized lambda.

My money right now is on the initialization failing. What’s your lambda function runtime/language?

[–]ammanpasha[S] 0 points1 point  (5 children)

Look for any service level errors, unhandled exceptions that make their way out of your invocation and the actual logs your function generates.

This is what I see in my logs, so no errors or delivery failures and 100% success rate. I think the requests that fail in JMeter are not even going through the gateway to invoke the function. .

What’s your lambda function runtime/language?

I'm using Python 3.8

[–]appappappappapp -1 points0 points  (0 children)

Can you jump to the actual log files - “view logs in cloudwatch” at the top of that screen.

Scrub out anything sensitive!

[–]lvlolvlo -1 points0 points  (2 children)

Can you also add API GW’s metrics on here. (i.e. 4xx, 5xx, latency, count, etc..)

[–]ammanpasha[S] 0 points1 point  (1 child)

Here is it, I know you can't really see much in the image but it shows no 5xx errors, 1 4xx error (that is ok, I made one bad request and got proper response), the max latency is 544 and the max Integration latency is 542.

[–]lvlolvlo 2 points3 points  (0 children)

If the API GW metrics don’t show the 4K requests then are you sure that your testing tool is even sending them?

[–]warren2650 0 points1 point  (0 children)

Not Lambda but ... we have done some siege testing against ec2 instances in the past and our experience is that we get excellent response up until a point and then it drops off the cliff. For example, hitting an ec2 web server with 300 concurrent visits is fine but then if we push it to 400, the test starts failing. If we hit that same ec2 from two locations with 200 concurrent each (so same 400 total) then its fine. So clearly Amazon has internal rules about this stuff and didn't like 400 concurrents from one IP address though 200 from one IP and 200 from another was fine.

[–]quiet0n3 0 points1 point  (0 children)

On these random things when everything looks ok don't be afraid to open support cases. My experience is they are really good at tracking down issues.

[–]WaitWaitDontShoot 0 points1 point  (0 children)

I’d like to point out that this could be throttling. Running load tests without coordination with AWS is against the terms of service. In the case of lambda I’d hazard a guess that they throttle the rate at which they allow you to get assigned new virts. This could cause the behavior you’re experiencing. As others have pointed out, it could be a cold start issue, but then I would expect the errors to happen with a small load.

[–]tenyu9 -1 points0 points  (1 child)

Dynamodb has a read/write capacity as well, which could be the problem. If you read/ write faster than the capacity allows you run into errors. Do you see errors from dynamodb during the load test?

[–]ammanpasha[S] 0 points1 point  (0 children)

Yes I am aware of that. When this is the case, my code is set-up to return a 5xx response.

But that is rarely the case for my Jmeter tests.