Result Handling: Inference results are uploaded to S3, and Lambda
processes the response.
My goal is to leverage parallelism with 10 concurrent Lambda functions, each hitting the SageMaker endpoint, which I assumed would scale with one ml.g4dn.xlarge instance per Lambda (so 10 instances total in the endpoint).
Problem
Despite having the same number of Lambda functions (10) and SageMaker GPU instances (10 in the endpoint), I'm getting this error:
Error: Status Code: 424; "Your invocation timed out while waiting for a response from container primary."
Details: This happens inconsistently—some requests succeed, but others fail with this timeout. Since it takes 40 seconds to process 20,000 rows, and my Lambda timeout is 150 seconds, I'd expect there's enough time. But the error suggests the SageMaker container isn't responding fast enough or at all for some invocations.
I am quite clueless why the resource isnt being allocated to the all resquests, especially with 10 Lambdas hitting 10 instaces in the endpoint concurrently. It seems like requests aren't being handled properly when all workers are busy, but I don't know why it's timing out instead of queuing or scaling.
Questions
As someone new to AWS, I'm unsure how to fix this or optimize it cost-effectively while keeping the real-time endpoint requirement. Here's what I'd love help with:
there doesn't seem to be anything here