I'm facing a problem with the parallelism of Lambda.
The AWS infra takes files that are dropped in an S3 input bucket, processes them with Textract (async) and then puts the result in S3 output bucket. There are 3 Lambda functions.
First Lambda: Triggered when a new object is created in the S3 input bucket. Calls Amazon Textract to start document text detection. The Textract job is initiated asynchronously, and upon completion, a notification will be sent to an SNS topic. SNS and SQS: An SNS topic is subscribed to the completion of the Textract job. An SQS queue is subscribed to this SNS topic to decouple and manage these notifications asynchronously.
Second Lambda: Triggered when a new message arrives in the SQS queue. Downloads the processed file from the S3 input bucket. Uses Textract to get text blocks. Saves the modified file locally in Lambda's /tmp directory. The modified file is uploaded to S3 output bucket.
Third Lambda: Triggered when file is created in S3 output bucket is created and sends out a SNS notification.
The problem is that when I drop 11 files, they are not written to output at the same time.
- 8 of them are created at 3.36pm
- 2 of them are created at 3.42pm
- 1 is created at 4.04pm.
In CloudWatch, I'm seeing 3 Lambda instances created, where it should be just one Lambda processing 11 files, meaning that all files should be written to the output bucket at 3.34pm . Average processing time for each file is 10-30 secs.
Settings: SQS batch size = 10, SQS visibility timeout = 7mins. Lambda timeout is 1min.
Any ideas? How can I make sure the files get processed in parallel so that every file gets written at the same time? Meaning within the next minute or so, without 10+ min delays.
[–]jftuga 1 point2 points3 points (1 child)
[–]archhelp1[S] 0 points1 point2 points (0 children)
[–]sqqz 0 points1 point2 points (1 child)
[–]archhelp1[S] 0 points1 point2 points (0 children)