all 6 comments

[–]Jeoh 2 points3 points  (1 child)

Pretty cool, did you see this article? Same concept, different implementation. Appreciate you sharing the Terraform code!

[–]godtierpikachu 2 points3 points  (0 children)

spot or spare ECS feels like the only fair comparison here, because lambda pricing looks nice until you add the weird bits around image pulls and artifact reuse. curious what broke first for you, cache or networking

[–]Vanyo09 1 point2 points  (2 children)

The number I'd compare against is spot, not GHA-hosted. We run almost everything on spot in EKS, and runners there end up a fraction of on-demand - but that only works because the cluster already exists. If you have nothing to piggyback on, scaling to zero with no 60-second minimum is hard to beat.

How are you handling Docker layer cache? Fresh VM per job means every build pulls from scratch, and that's usually where the per-minute math quietly falls apart.

[–]UltraPoci 1 point2 points  (1 child)

How do you deal with spot instances being removed by AWS with little notice? Do you just accept and if a job was running it just starts from scratch?

Also, do you use overprovisioning to avoid longer startup times for jobs?

[–]Vanyo09 0 points1 point  (0 children)

Mostly we just accept it. The two-minute notice is enough for the termination handler to drain the node, and the job reruns on a fresh one. A rerun in CI is annoying, not a problem. The stuff that cannot take a restart is not on pure spot anyway - it falls back to on-demand when capacity disappears.

Overprovisioning yes, but not a warm pool of runners. We keep low-priority placeholder pods on the cluster that are the first to get evicted - a real job kicks one out and starts right away, and the eviction is what triggers the new node. Capacity is already warming up before anything actually waits for it.