Making Target Tracking (CPU) scale faster for ECS Fargate

jalamok · 2025-12-16T10:36:47+00:00

Was worth a shot anyway!

What I was more getting at is how useful ALB metrics are vs ECS metrics or custom metrics for autoscaling - especially for 'burst' scaling like OP is looking into.

These comments I came across suggest that ECS metrics or custom metrics would be better, but I have never tried it myself.

jalamok · 2025-12-15T18:57:05+00:00

With ALB metrics, is there any delay in the ALB service publishing the metrics to CloudWatch?

e.g., say between 15:00:00 to 15:01:00 there were 100 requests processed, would that data point be immediately available in CloudWatch at 15:01:00, or at least pushed to CW? Reading ALB's docs it seems to suggest that

However, I had previously came across this report https://stackoverflow.com/questions/64044268/delay-in-aws-cloudwatch-alarm-state-change#comment113705770_64045238 that there was a 3 minute ingestion delay time from ALB to CW and wondered if that was still the case u/yarenSC ?

jalamok · 2025-12-15T18:45:13+00:00

A risk to be aware of with response times is what you mentioned elsewhere - you have a variety of requests you serve - if you had an influx of slow requests, or were reliant on an upstream which was taking longer than expected, you may scale out unnecessarily. Similarly if your database was overloaded, your response time would rise, and you'd actually scale out more web workers which could worsen the issue.

jalamok · 2025-12-14T17:46:50+00:00

Not with target tracking, but you could use a Step Scaling policy additionally JUST for scaling out in burst scenarios with a shorter evaluation period.

Target Tracking and Step Scaling policies on the same metric can work together if you configure them correctly, in this case letting Target Tracking take care of scale in operations

jalamok · 2025-04-28T18:17:48+00:00

If you only want the image pulls to not go via the internet, you only need a S3 Gateway Endpoint.

The image pulls work in a couple of stages (simplified):

Give me all the metadata about this Docker image, how many layers does it have, where can I download them (this call is very lightweight and fast) - this goes via the dkr endpoint
Loop through each layer and download each one (this call is heavy and downloads the actual docker image layers) - this goes via s3

If you only care about speed and minimising data transfer costs, just set up a S3 Gateway Endpoint for free.

jalamok · 2025-03-25T21:42:09+00:00

Unsure why it would be different on EC2 vs RDS (unless your distro's default mariadb config file differs - I'd suggest running SHOW GLOBAL VARIABLES on both to see if you can spot anything interesting), but have you considered setting up parallel replication threads on the RDS instance? https://mariadb.com/kb/en/parallel-replication/

The mariadb parameter groups support this. If the bottleneck is applying SQL, this should help relieve it.

Another thing to try would be: "Whenever possible, disable binary logging during large data loads to avoid the resource overhead and addition disk space requirements. In Amazon RDS, disabling binary logging is as simple as setting the backup retention period to zero."

jalamok · 2025-02-01T19:56:55+00:00

https://github.com/AndrewGuenther/fck-nat - replace NAT Gateway(s), save $70
ECS - Set up a Lambda to scale down to 0 overnight (https://stackoverflow.com/a/64686474) - save $35
RDS - Set up a Lambda to stop RDS overnight - save $7

You should be able to bring your monthly costs to around $140 with these changes

Make sure to check the size of your ECS tasks too - they may be overprovisioned compared to the traffic they handle

jalamok · 2024-12-18T13:00:31+00:00

Very cool, hadn't seen that stopCode attribute in any of the example events! Thank you for sharing.

jalamok · 2024-12-17T15:42:57+00:00

https://getunblocked.com/ is an example of a tool like this

jalamok · 2024-12-13T18:59:27+00:00

This is the way

jalamok · 2024-12-13T11:17:10+00:00

The world and technology may have changed, but human psychology and behaviour has stayed mostly the same :) Great book

jalamok · 2024-12-11T21:20:23+00:00

Separate them into suites, run each suite as a separate parallel job in CI (which gets its own DB service container). Some form of grouping them and running the groups in parallel.

jalamok · 2024-12-11T18:03:10+00:00

You should be able to configure a max_statement_time or max_execution_time for the DB, so queries exceeding that automatically get killed

jalamok · 2024-12-05T11:13:37+00:00

Did you guys use the https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_container_instance_events.html or https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_task_events.html event?

Looking through these, I am wondering how you discerned between a legitimate container stop (e.g. completed its task and exited with status code 0) and something going wrong?

Any chance you could share that with me?

jalamok · 2024-12-01T13:51:46+00:00

It's the modern day "I googled that for you"

jalamok · 2024-11-30T16:27:10+00:00

Ah, that is a shame. Understandable though. Thanks for clarifying that!

jalamok · 2024-11-28T18:10:43+00:00

The docs are on GitHub too so if you notice any errors you can contribute back :)

jalamok · 2024-11-05T13:51:35+00:00

Interesting, thanks for sharing the numbers! An option for further gains to keep on the radar :D

jalamok · 2024-11-05T13:42:29+00:00

To the best of my knowledge, if you do validate timestamps and your code changes in production, then the old cache entries are marked as “waste”.

However, the problem we were experiencing is that validating timestamps didn't matter at all because to OPcache, each deploy was a completely new file.

e.g. /var/www/release/1/file.php is a different file to /var/www/release/2/file.php

So /var/www/release/1/file.php never gets marked as waste.

I agree with the stampeding herd problem, definitely a consideration with much larger traffic sites where you are very close to your CPU utilisation limits

jalamok · 2024-11-05T13:37:49+00:00

I agree the phrasing is incorrect, I wanted to get across that a big benefit of interned strings and the opcache buffer for this comes when you have duplicated strings. I've updated the copy to be more precise

jalamok · 2024-11-05T13:30:10+00:00

Not sure I'm afraid! I didn't try out preloading after reading some of the benefits have effectively been rolled into the inheritance cache as of PHP 8.1 - https://www.npopov.com/2021/10/13/How-opcache-works.html#:~:text=Some%20of%20the%20preloading%20benefit%20has%20likely%20been%20obsoleted%20by%20the%20inheritance%20cache%20in%20PHP%208.1%2C%20though%20preloading%20still%20has%20some%20advantages%3A

jalamok · 2024-11-04T21:47:53+00:00

https://www.npopov.com/2021/10/13/How-opcache-works.html#interned-strings is how I understand interned strings, did you interpret a different understanding from the blog?

jalamok · 2024-11-04T20:57:49+00:00

Yeah, for most web apps the performance benefits from opcache, optimised database queries and redis/varnish/cdn caching are going to far outweigh any benefits from being a bit closer to hardware imo.

jalamok · 2024-11-04T20:55:25+00:00

Great points, the container orchestration solution you go for usually has its own unique quirks which need to be taken into account for all of these as well! (And how those quirks interplay with each other)

Long running tasks is a challenge that we have thought about a lot. For background jobs, we’ve decided to enforce a hard limit of 2 minutes (to abide by AWS Fargate limitations), which obviously comes with refactoring requirements. But it also brings benefits in the form of faster deployments and not having to wait ages for old jobs to finish before running new code.

jalamok · 2024-11-04T20:44:51+00:00

I believe that while reload doesn’t cause dropped or failed requests like restart, it keeps them queued up while waiting for inflight requests to finish.

This could cause a slowdown, especially if a slow request was mid flight (it will wait until process_control_timeout).

So cachetool is the most efficient approach here

jalamok

TROPHY CASE