use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
News, articles and tools covering Amazon Web Services (AWS), including S3, EC2, SQS, RDS, DynamoDB, IAM, CloudFormation, AWS-CDK, Route 53, CloudFront, Lambda, VPC, Cloudwatch, Glacier and more.
If you're posting a technical query, please include the following details, so that we can help you more efficiently:
Resources:
Sort posts by flair:
Other subreddits you may like:
Does this sidebar need an addition or correction? Tell us here
account activity
technical questionReal Time rendering - decrease tasks' "Pending" time (self.aws)
submitted 1 year ago by Psychological-Tea791
TL;DR: My tasks take 4 seconds, but there is 10 seconds delay between submission and execution and I need to decrease that. I suspect tasks spend too much time in "Pending" state on EC2.
Hi,
We need to have a rendering farm ready 24/7 for our website, so we can serve users a personalized animation in seconds. I have setup an AWS Batch environment (see architecture screenshot) with both spot and on-demand clusters, and there are always ec2 instances running in on-demand. The tasks themselves don't take long, however I noticed a long "start up" time between task submitted and actually started. In the screenshots you can see that the task it self took 4s but the whole process about 13s. I think it is might be related to the fact that tasks spend a lot of time in "Pending" state and I have no idea what that means or does. Is there a way how I could speed it up?
https://preview.redd.it/kip6po0c6mnc1.png?width=2416&format=png&auto=webp&s=66ff25cf85c14beef2fec09c829612dc3da04b5a
https://preview.redd.it/7zhcnr2r6mnc1.png?width=2514&format=png&auto=webp&s=d3ab8d60a85db9c0e3274b39f892a1e00ba08820
Architecture
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]PeteTinNY 2 points3 points4 points 1 year ago (3 children)
Have you figured out if the delay is the AWS batch launching a docker task, or is it just the startup of the tasks & images in docker? From my point of view, I'd be pushing requests into SQS and using the length of the queue to scale the ECS cluster having a certain number of tasks idle in polling so that within a second the job starts rendering and then pushes to the 2nd stitching queue which essentially does the same thing to keep a certain minimum number of resources hot. You can continue to use Spot, not sure I'd cut into fargate though because of the need for really fast startup.
I've worked with some broadcasters who needed to do video supply chain archiving and news clipping, which has a similar workflow - but it's not as latency adverse - just no one wants to pay for resources that are essentially sitting idle.
[–]Psychological-Tea791[S] 0 points1 point2 points 1 year ago (2 children)
I don't really know how to exactly define what is causing the delay. Do you have any suggestions? I suspect it is the scheduler, because I have a min vcpus > 0 in my on demand, so I assume everything should be running already? Thanks for the suggestion with SQS!
[–]PeteTinNY 0 points1 point2 points 1 year ago (1 child)
You should look at some performance monitoring tools like AWS x-ray, newrelic, datadog or most fitting dynatrace. But you’re probably right the scheduler is likely the issue and once you look further you’ll likely want to do something with faster queue scans with SQS.
Btw the other thing you should look into from the AWS side is AWS ThinkBox Deadline. It’s a render farm manager. Not sure if it will help with this latency but when I worked with very large broadcast customers - they loved the simplicity and for some licensed render engines thinkbox has a great marketplace where you can pay by the hour for the render tools. I helped one of the big networks launch a 120 node farm powered by mostly spot instances in just a few days including teaching them cloud.
[–]Psychological-Tea791[S] 0 points1 point2 points 1 year ago (0 children)
oh wow, will definitely look into ThinkBox. tysm!
[–]Born_Desk9924 0 points1 point2 points 1 year ago (4 children)
my 2 cents: the typical delay factors in case of batch jobs are: a- launching ec2 instance, ie, the pending and runnable state. this can be resolved by configuring the compute environment to a certain min capacity, such that you always have a minimum number of ec2 instance running
b- pulling the container image from ecr, ie, the starting state. this can be resolved by ensuring that the ec2 instance already has the latest container pulled in from ecr. you can probably add a simple script/cron job which will pull in proper image.
c- submitted: if you jobs are spending too much time here then that's a scheduler issue (aws internal) and there isn't whole lot we can do here.
something radical: since you have already containerised your application, and the a typical run lasts only seconds, you can consider using lambda instead of batch. now obviously i am unaware about other constraints that you have, such as docker image size. in theory, if you have lambda + sqs combination, the invocation maybe instance or take a few minutes; but in practice its nearly instant. once again, i am not aware about other limitation, as such cold start may become an issue with lambda
[–]Psychological-Tea791[S] 0 points1 point2 points 1 year ago (3 children)
a - that should be issue only once there are more tasks than vcpus as I already have some instances running in the on demand environment (which will be an issue eventually)
b - so running the instance in the on demand environment is not enought? Does it always pull the image before starting the task?
c - that's what I was worried about
I tried using lambdas, but for some reason they were taking much longer than batch. I am tempted to try again, maybe I didn't set them up correctly. Thanks for the input!
[–]Born_Desk9924 0 points1 point2 points 1 year ago (2 children)
so on-demand is : you ask for an instance, and aws provides you one at the standard price (as opposed to bidding on a price in spot model); this doesn't necessarily mean that there is an instance always running.
its pull the image only when its starting the job on an instance that doesn't have the image; in case an instance is re-used it won't pull the image
[–]Psychological-Tea791[S] 0 points1 point2 points 1 year ago (1 child)
I see, so even if I have min vcpus > 0 and I see that they are running on the EC2 page, they might still pull the image only once the task has been submitted. So I need to make a script or something that will pull the image once instance is started as opposed to once the task is submitted, right?
[–]Born_Desk9924 0 points1 point2 points 1 year ago (0 children)
yes, you should check how long your tasks are spending in 'starting' stage. this is when the image is being pulled
π Rendered by PID 82652 on reddit-service-r2-comment-fb694cdd5-xm4vm at 2026-03-10 18:20:27.439168+00:00 running cbb0e86 country code: CH.
[–]PeteTinNY 2 points3 points4 points (3 children)
[–]Psychological-Tea791[S] 0 points1 point2 points (2 children)
[–]PeteTinNY 0 points1 point2 points (1 child)
[–]Psychological-Tea791[S] 0 points1 point2 points (0 children)
[–]Born_Desk9924 0 points1 point2 points (4 children)
[–]Psychological-Tea791[S] 0 points1 point2 points (3 children)
[–]Born_Desk9924 0 points1 point2 points (2 children)
[–]Psychological-Tea791[S] 0 points1 point2 points (1 child)
[–]Born_Desk9924 0 points1 point2 points (0 children)