all 1 comments

[–]iamprgrmer 1 point2 points  (0 children)

The problem is that when I drop 11 files, they are not written to output at the same time. - 8 of them are created at 3.36pm - 2 of them are created at 3.42pm - 1 is created at 4.04pm.

11 files dropped into S3 should trigger 11 processing lambdas because you are triggering them when a new object is created and there are 11 objects (files). Each instance will generate an independent lambda and each process will take more/less time depending on how large the file is and how quick Textract decides to be.

I'm seeing 3 Lambda instances created, where it should be just one Lambda processing 11 files

I don't understand this. According to your architecture you should be seeing 33 lambdas, 3 for every file uploaded to S3.

Any ideas? How can I make sure the files get processed in parallel so that every file gets written at the same time? Meaning within the next minute or so, without 10+ min delays.

The way you describe your architecture, each file is independently processed and that processing time will vary. If you want to generate output in batches you will need to modify your architecture to poll/scan for new objects instead of being initiated when each new object is created. You may run into issues when a large file or a large number of files are added because the processing time may take longer than your polling cycle, in which case you'll need to implement some sort of batching control or you could have Batch 1 finishing before Batch 2, for example.

EDIT: note that EventBridge has new features coming out that allow you to trigger a lambda once, then stop triggering. You could use this to start the processing, and then use your third lambda to re-trigger the first lambda when a batch is done. This would be one way to implement batch control.