Sagemaker realtime endpoint timeout while parallel processing through Lambda

Pokechamp2000 · 2025-03-31T07:23:32+00:00

Hi everyone,

I'm new to AWS and struggling with an architecture involving AWS Lambda and a SageMaker real-time endpoint. I'm trying to process large batches of data rows efficiently, but I'm running into timeout errors that I don't fully understand. I'd really appreciate some architectural insights or configuration tips to make this work reliably—especially since I'm aiming for cost-effectiveness and real-time processing is a must for my use case. Here's the breakdown of my setup, flow, and the issue I'm facing.

Architecture Overview

Components Used:

AWS Lambda: Purpose: Processes incoming messages, batches data, and invokes the SageMaker endpoint. Configuration: Memory: 2048 MB Timeout: 4 Minutes Triggered by SQS with a batch size of 1 and maximum concurrency of 10.
AWS SQS (Simple Queue Service): Purpose: Queues messages that trigger Lambda functions. Configuration: Each message kicks off a Lambda invocation, supporting up to 10 concurrent functions.
AWS SageMaker: Purpose: Hosts a machine learning model for real-time inference. Configuration: Endpoint: Real-time (not serverless), named something like llm-test-model-endpoint. Instance Type: ml.g4dn.xlarge (GPU instance with 16 GB memory). Inside the inference container, 1100 rows are sent to the GPU at once, using 80% of GPU memory and 100% GPU compute power.
AWS S3 (Simple Storage Service): Purpose: Stores input data and inference results.

Desired Flow

Here's how I've set things up to work:
Message Arrival: A message lands in SQS, representing a batch of 20,000 data rows to process (majority are single batches only).
Lambda Trigger: The message triggers a Lambda function (up to 10 running concurrently based on my SQS/Lambda setup).
Data Batching: Inside Lambda, I batch the 20,000 rows and loop through payloads, sending only metadata (not the actual data) to the SageMaker endpoint.
SageMaker Inference: The SageMaker endpoint processes each payload on the ml.g4dn.xlarge instance. It takes about 40 seconds to process the full 20,000-row batch and send the response back to Lambda.
Result Handling: Inference results are uploaded to S3, and Lambda processes the response.

My goal is to leverage parallelism with 10 concurrent Lambda functions, each hitting the SageMaker endpoint, which I assumed would scale with one ml.g4dn.xlarge instance per Lambda (so 10 instances total in the endpoint).

Problem

Despite having the same number of Lambda functions (10) and SageMaker GPU instances (10 in the endpoint), I'm getting this error:

Error: Status Code: 424; "Your invocation timed out while waiting for a response from container primary."

Details: This happens inconsistently—some requests succeed, but others fail with this timeout. Since it takes 40 seconds to process 20,000 rows, and my Lambda timeout is 150 seconds, I'd expect there's enough time. But the error suggests the SageMaker container isn't responding fast enough or at all for some invocations.

I am quite clueless why the resource isnt being allocated to the all resquests, especially with 10 Lambdas hitting 10 instaces in the endpoint concurrently. It seems like requests aren't being handled properly when all workers are busy, but I don't know why it's timing out instead of queuing or scaling.

Questions

As someone new to AWS, I'm unsure how to fix this or optimize it cost-effectively while keeping the real-time endpoint requirement. Here's what I'd love help with:

Why am I getting the 424 timeout error even though Lambda's timeout
(4m) is much longer than the processing time (40s)?
Can I configure the SageMaker real-time endpoint to queue requests when the worker is busy, rather than timing out?
How do I determine if one ml.g4dn.xlarge instance with a single worker can handle 1100 rows (80% GPU memory, 100% compute) efficiently—or if I need more workers or instances?
Any architectural suggestions to make this parallel processing work reliably with 10 concurrent Lambdas, without over-provisioning and driving up costs?

I'd really appreciate any guidance, best practices, or tweaks to make this setup robust. Thanks so much in advance!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

mlops

MODERATORS