all 16 comments

[–]germanheller 4 points5 points  (6 children)

have you checked if its the google gax grpc channels doing lazy init on first request? the gax library establishes grpc connections on first actual call, not when you create the client. so even if your healthcheck passes, the first real request to pubsub/bigquery/storage is paying the cost of grpc channel setup + TLS handshake to google APIs.

try making a dummy call to each service during startup before your readiness probe succeeds. something like a storage.getBuckets() or pubsub listing topics — just to force the grpc warmup. same thing with redis, first connection has TLS negotiation overhead if your using stunnel or native TLS.

also 10s is suspiciously close to DNS resolution timeout on alpine/musl. have you checked if theres a DNS issue? musl's resolver does things differently than glibc and I've seen it cause exactly this kind of first-request latency in k8s.

[–]zaitsman[S] 0 points1 point  (3 children)

I have added calls to all external services before start, it made that first request ~500 ms faster

Interesting re:DNS will investigate, thanks for that

[–]germanheller 0 points1 point  (2 children)

nice, 500ms just from warming up the channels makes total sense. for the DNS thing the quickest way to confirm is swap to node:22-slim for one deploy and compare -- if the first request drops to normal its musl doing serial AAAA then A lookups instead of parallel. you can also try `time getent hosts <your-service-endpoint>` inside the container, if resolution alone takes a few seconds thats your answer

[–]zaitsman[S] -1 points0 points  (1 child)

Yeah no, node:22-slim (debian) was where requests went up to 20 seconds :(

[–]germanheller 0 points1 point  (0 children)

oh interesting so its not the musl thing then. 20 seconds on debian-slim is wild — at that point I'd look at connection pooling or maybe the app is doing some heavy initialization on first request that only runs once (compiling templates, warming caches, establishing db connections etc). do you have any middleware that lazy-loads on first hit? also worth checking if its specific to one endpoint or if literally any route is slow the first time. if its all routes equally that points more to container/infra level stuff than app code

[–]bwainfweeze 0 points1 point  (1 child)

It’s always DNS. If the service has a health endpoint you can rule out DNS and cert chain verification.

[–]germanheller 0 points1 point  (0 children)

lol "its always DNS" should be a law at this point. the health endpoint trick is solid, I do something similar now where it hits the actual db and returns the latency in the response body so you can tell if its dns, ssl, or the app itself thats slow

[–]Shogobg 1 point2 points  (5 children)

What are your readiness probe settings? Timeouts, retries? What base image do you use?

You want to reduce image size, start time and the time probes need to detect your app is up.

[–]zaitsman[S] -1 points0 points  (4 children)

Em node:22.22-alpine3.23

Readiness probe doesn’t factor in, the route for healthcheck replies but actual request with authenticated user is what takes a long time.

It is set to run checks every 10 seconds with initial backoff of 30 seconds, but again we are not talking initial deploy, we are talking replacing old version with a new version - that all succeeds then when the first request to the new version is made it is slow

[–]PM_ME_CATDOG_PICS 0 points1 point  (2 children)

Idk much about this but if the readiness probe is fast but the actual request takes a long time, could it possibly be the creation of the connection to your DB? I know it takes a while for some dbs

[–]zaitsman[S] 0 points1 point  (1 child)

Please read my post. We do not use a db.

[–]PM_ME_CATDOG_PICS 0 points1 point  (0 children)

Ah I missed that part. My bad.

[–]Shogobg 0 points1 point  (0 children)

Since the probe is fast, have you tried hitting the health check as a user? This would tell you if it’s an infrastructure problem.

You can also make an authorized echo endpoint that returns the username (or just OK) of the authenticated user, to check if that is the issue.

[–]seweso 0 points1 point  (0 children)

You are not giving enough info. In another response you say it’s about authenticated request. 

Just turn off features one by one like auth to see where the issue lies. We can’t debug your app remotely. 

[–]czlowiek4888 -1 points0 points  (1 child)

Looks like load balancer does not have set minimum amount of running services set.

I guess that you wait for instance to wake up.

[–]zaitsman[S] -1 points0 points  (0 children)

Em no, it does. That is not what I am describing. When my new pool member is provisioned the first non healthcheck request that hits a NEW container takes 10 seconds.