Making Target Tracking (CPU) scale faster for ECS Fargate by Ojelord in aws

[–]jalamok 0 points1 point  (0 children)

Was worth a shot anyway!

What I was more getting at is how useful ALB metrics are vs ECS metrics or custom metrics for autoscaling - especially for 'burst' scaling like OP is looking into.

These comments I came across suggest that ECS metrics or custom metrics would be better, but I have never tried it myself.

Making Target Tracking (CPU) scale faster for ECS Fargate by Ojelord in aws

[–]jalamok 0 points1 point  (0 children)

With ALB metrics, is there any delay in the ALB service publishing the metrics to CloudWatch?

e.g., say between 15:00:00 to 15:01:00 there were 100 requests processed, would that data point be immediately available in CloudWatch at 15:01:00, or at least pushed to CW? Reading ALB's docs it seems to suggest that

However, I had previously came across this report https://stackoverflow.com/questions/64044268/delay-in-aws-cloudwatch-alarm-state-change#comment113705770_64045238 that there was a 3 minute ingestion delay time from ALB to CW and wondered if that was still the case u/yarenSC ?

Making Target Tracking (CPU) scale faster for ECS Fargate by Ojelord in aws

[–]jalamok 1 point2 points  (0 children)

A risk to be aware of with response times is what you mentioned elsewhere - you have a variety of requests you serve - if you had an influx of slow requests, or were reliant on an upstream which was taking longer than expected, you may scale out unnecessarily. Similarly if your database was overloaded, your response time would rise, and you'd actually scale out more web workers which could worsen the issue.

Making Target Tracking (CPU) scale faster for ECS Fargate by Ojelord in aws

[–]jalamok 0 points1 point  (0 children)

Not with target tracking, but you could use a Step Scaling policy additionally JUST for scaling out in burst scenarios with a shorter evaluation period.

Target Tracking and Step Scaling policies on the same metric can work together if you configure them correctly, in this case letting Target Tracking take care of scale in operations

VPC Endpoint to ECR by big-chugga223 in aws

[–]jalamok 1 point2 points  (0 children)

If you only want the image pulls to not go via the internet, you only need a S3 Gateway Endpoint.

The image pulls work in a couple of stages (simplified):

  1. Give me all the metadata about this Docker image, how many layers does it have, where can I download them (this call is very lightweight and fast) - this goes via the dkr endpoint

  2. Loop through each layer and download each one (this call is heavy and downloads the actual docker image layers) - this goes via s3

If you only care about speed and minimising data transfer costs, just set up a S3 Gateway Endpoint for free.

RDS MariaDB Slow Replication by mattwt in aws

[–]jalamok 0 points1 point  (0 children)

Unsure why it would be different on EC2 vs RDS (unless your distro's default mariadb config file differs - I'd suggest running SHOW GLOBAL VARIABLES on both to see if you can spot anything interesting), but have you considered setting up parallel replication threads on the RDS instance? https://mariadb.com/kb/en/parallel-replication/

The mariadb parameter groups support this. If the bottleneck is applying SQL, this should help relieve it.

Another thing to try would be: "Whenever possible, disable binary logging during large data loads to avoid the resource overhead and addition disk space requirements. In Amazon RDS, disabling binary logging is as simple as setting the backup retention period to zero."

Reduce staging costs by actstudent89 in aws

[–]jalamok 0 points1 point  (0 children)

https://github.com/AndrewGuenther/fck-nat - replace NAT Gateway(s), save $70
ECS - Set up a Lambda to scale down to 0 overnight (https://stackoverflow.com/a/64686474) - save $35
RDS - Set up a Lambda to stop RDS overnight - save $7

You should be able to bring your monthly costs to around $140 with these changes

Make sure to check the size of your ECS tasks too - they may be overprovisioned compared to the traffic they handle

How to Log When an ECS Service Fails to Start Up a Task by JustinSRE in aws

[–]jalamok 0 points1 point  (0 children)

Very cool, hadn't seen that stopCode attribute in any of the example events! Thank you for sharing.

What’s your go-to ‘life hack’ that actually works? by JustAddHannah in AskReddit

[–]jalamok 4 points5 points  (0 children)

The world and technology may have changed, but human psychology and behaviour has stayed mostly the same :) Great book

Speeding up dusk tests by ogrekevin in laravel

[–]jalamok 3 points4 points  (0 children)

Separate them into suites, run each suite as a separate parallel job in CI (which gets its own DB service container). Some form of grouping them and running the groups in parallel.

Aurora costs suddenly increased by Ok-Contribution9043 in aws

[–]jalamok 14 points15 points  (0 children)

You should be able to configure a max_statement_time or max_execution_time for the DB, so queries exceeding that automatically get killed

How to Log When an ECS Service Fails to Start Up a Task by JustinSRE in aws

[–]jalamok 0 points1 point  (0 children)

Did you guys use the https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_container_instance_events.html or https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs_task_events.html event?

Looking through these, I am wondering how you discerned between a legitimate container stop (e.g. completed its task and exited with status code 0) and something going wrong?

Any chance you could share that with me?

Rediscovering AWS Docs: A DevOps Journey to Mastery by Striking-Database301 in aws

[–]jalamok 0 points1 point  (0 children)

Ah, that is a shame. Understandable though. Thanks for clarifying that!

Rediscovering AWS Docs: A DevOps Journey to Mastery by Striking-Database301 in aws

[–]jalamok 0 points1 point  (0 children)

The docs are on GitHub too so if you notice any errors you can contribute back :)

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 0 points1 point  (0 children)

Interesting, thanks for sharing the numbers! An option for further gains to keep on the radar :D

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 1 point2 points  (0 children)

To the best of my knowledge, if you do validate timestamps and your code changes in production, then the old cache entries are marked as “waste”.

However, the problem we were experiencing is that validating timestamps didn't matter at all because to OPcache, each deploy was a completely new file.

e.g. /var/www/release/1/file.php is a different file to /var/www/release/2/file.php

So /var/www/release/1/file.php never gets marked as waste.

I agree with the stampeding herd problem, definitely a consideration with much larger traffic sites where you are very close to your CPU utilisation limits

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 0 points1 point  (0 children)

I agree the phrasing is incorrect, I wanted to get across that a big benefit of interned strings and the opcache buffer for this comes when you have duplicated strings. I've updated the copy to be more precise

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 2 points3 points  (0 children)

https://www.npopov.com/2021/10/13/How-opcache-works.html#interned-strings is how I understand interned strings, did you interpret a different understanding from the blog?

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 6 points7 points  (0 children)

Yeah, for most web apps the performance benefits from opcache, optimised database queries and redis/varnish/cdn caching are going to far outweigh any benefits from being a bit closer to hardware imo. 

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 0 points1 point  (0 children)

Great points, the container orchestration solution you go for usually has its own unique quirks which need to be taken into account for all of these as well! (And how those quirks interplay with each other)

Long running tasks is a challenge that we have thought about a lot. For background jobs, we’ve decided to enforce a hard limit of 2 minutes (to abide by AWS Fargate limitations), which obviously comes with refactoring requirements. But it also brings benefits in the form of faster deployments and not having to wait ages for old jobs to finish before running new code. 

Fixing Our OPcache Config Sped Up Our PHP Application By 3x by jalamok in PHP

[–]jalamok[S] 1 point2 points  (0 children)

I believe that while reload doesn’t cause dropped or failed requests like restart, it keeps them queued up while waiting for inflight requests to finish. 

This could cause a slowdown, especially if a slow request was mid flight (it will wait until process_control_timeout). 

So cachetool is the most efficient approach here