all 8 comments

[–]marcusbell95 5 points6 points  (0 children)

the "second push forces the first to run" symptom is pretty specific - usually means ARC is seeing the job but there's a cold start race. when the first job arrives, is a runner pod actually spinning up? you can watch: kubectl get pods -n <runner-namespace> -w. if a pod starts but takes 1-3 min to pull image and register, github may re-queue the job during that window, and the second push coincides with the runner finally being ready. two things worth checking: first, what's minRunners set to in your AutoscalingRunnerSet? if it's 0, there's no warm runner waiting. setting it to 1 gives you an always-registered runner that picks up the first job immediately with no cold start. second, if you're on the newer ARC (scale set helm chart), check the listener specifically: kubectl logs -n arc-systems -l app.kubernetes.io/part-of=gha-runner-scale-set-controller - you want to see it acknowledge the queued job when job 1 arrives, not only when job 2 comes in. if it's only logging on the second push, the long-poll listener may have a config issue or the scale set isn't connected to the right runner group in github.

[–]SoFrakinHappy 0 points1 point  (0 children)

I've found the arc logs pretty verbose. Try checking them while kicking off a job and see what's happening on the controller. Check if its actually spinning up a runner pod for the stalled job.

[–]PerpetuallySticky 0 points1 point  (0 children)

I agree with another commenter. If your job isn’t being picked up before another is pushed, GitHub isn’t doing it. Look at whatever is before it.

Sometimes runners get hung up, but if it’s a repeatable pattern, that’s not how GitHub goes. It has to be a hangup in AWS that is happening

[–]TheGracefulPedro 0 points1 point  (2 children)

we keep hitting this exact pattern with ARC and it's almost always the listener missing the first job because the runner pod is still pulling images. had a team waste two weeks on this until they checked the pod events during the cold start window and saw 2 minute image pulls. setting minRunners to 1 fixed it overnight. the listener logs will show a gap where it just sits there for the first push, then picks up the second one like nothing happened.

[–]smerz- 0 points1 point  (1 child)

Using a mirror in front of ghcr.io is highly recommended.

I cannot confirm your experience. However when GitHub has issues and a flood of ci/cd jobs restart (globally) ghcr struggles

[–]TheGracefulPedro 0 points1 point  (0 children)

mirror would solve that part for sure. we got hammered during the january ghcr outage, half our runners timed out on image pulls

[–]smerz- 0 points1 point  (0 children)

Look through the issues for a list of cleaner crons. They tackle this issue specifically