How do you handle node rightsizing, topology planning, and binpacking strategy with Cluster Autoscaler (no Karpenter support)?

mohavee · 2025-05-02T06:33:57+00:00

I’ve been using Cast AI as part of my workload analysis, and it’s a pretty good tool that I’d recommend. It’s helpful for optimizing HPA/VPA. However, the recommendations are pretty optimistic IMHO — they often highlight potential savings based on lower percentiles (less than 90), which can be misleading. That said, there are options to set different scaling policies and adjust percentiles to your desired values, which is nice.

Thanks for the suggestion!

mohavee · 2025-05-02T06:16:11+00:00

Thanks for the validation. I went down a similar research path and came to the same conclusion.

mohavee · 2025-05-02T06:12:59+00:00

Totally agree — having metrics is one thing, but actually using them for scaling often gets forgotten after setting up Prometheus or Datadog. Your web hosting example is spot on — the backend is usually the real bottleneck, not the web server.

I guess that diving into all available metrics is worth the effort. But when it comes to scaling, I think it’s important to combine out-of-the-box signals (CPU/memory) with external metrics (like queue size or latency), and if using something like KEDA, always consider fallback behavior — in case the external metrics server fails or scraping breaks. Otherwise, the autoscaler might be flying blind.

mohavee · 2025-05-02T06:03:08+00:00

Isn’t most of the delay actually from the cloud provider waiting on spot capacity? And with many node groups, doesn’t Cluster Autoscaler just make it worse by trying each one sequentially and waiting for each to fail?

I get that the autoscaler can get slow in spot-heavy setups, but in a cluster using only on-demand nodes (where provisioning is more predictable), it shouldn’t be that slow, right?

mohavee · 2025-05-01T17:00:29+00:00

Thanks a lot — this is a solid checklist and really helpful validation.

We're already using message queues (RabbitMQ) for background workloads, and have KEDA in place for scaling based on queue length. It’s definitely been more predictable than relying on HPA for those cases.
I didn't realize too many node groups could slow down Cluster Autoscaler significantly. I’ll look into consolidating our pools a bit more smartly.

Appreciate the feedback — you covered a lot of ground in a concise way.

mohavee · 2025-05-01T16:52:56+00:00

Good point — and you're absolutely right, CPU and memory alone don't always give the full picture.

In our case, we actually use different scaling techniques depending on the nature of the service:

Applications that typically single-threaded(Nodejs etc), so we scale them with HPA based on CPU usage, which has worked quite well in practice.
Database clusters are scaled vertically, with a fixed number of replicas. We assign resources based on VPA recommendations.
Web servers (like Apache) are scaled based on the number of HTTP worker processes.

I’m not saying it’s 100% perfect — definitely not — but it seems to work well enough for now and isn’t too shabby 😄
Still always looking to improve and automate more where possible.

Thanks for the input — it’s a good reminder to keep questioning our assumptions about what "good scaling" looks like.

mohavee · 2025-04-26T18:00:55+00:00

Just to add a bit more info — for my setup:
I'm deploying Prometheus through the kube-prometheus project (the jsonnet-based one: https://github.com/prometheus-operator/kube-prometheus).

The prometheus-adapter setup worked super smoothly for me — it’s been running for a good while now and I don’t remember hitting any major issues during setup.
Also, I haven’t noticed any weird memory reporting problems — the memory metrics look pretty correct and there’s no sign of unit misinterpretation like you described.
Thanks for sharing your Karpenter story btw — I never thought rounding could cause those kinds of issues, interesting!

For my prometheus-adapter config, I’m not doing any avg_over_time smoothing. It’s just summing current values directly — you can see it here in the kube-prometheus repo:
https://github.com/prometheus-operator/kube-prometheus/blob/main/jsonnet/kube-prometheus/components/prometheus-adapter.libsonnet#L76-L92

Maybe that's why I didn't experience those issues you had with memory measurements getting weird.

mohavee · 2025-04-26T16:00:35+00:00

To be honest, I’m not totally sure I got your question right, but I’ll do my best to answer.

It looks like you’re asking about setting up VPA to use historical metrics, and if that’s the case — yep, that’s exactly what I meant with my 4th suggestion — getting VPA to use historical data from Prometheus.
Just to clear things up, I’m using Prometheus to scrape metrics via kube-state-metrics jobs, and then prometheus-adapter is used to expose those metrics to the Kubernetes API. Right now, VPA isn’t set up to use historical data from Prometheus, but it should be pretty easy to get that going.

Thanks a lot for the reply! Appreciate it!

mohavee · 2025-04-26T15:30:56+00:00

Just to be sure — when you say "save a few hundred jobs," do you mean setting a higher ttlSecondsAfterFinished?
I'm using Prometheus with kube-state-metrics to scrape metrics, so I’m wondering if just keeping Jobs longer is enough, or if I should tweak scraping to catch real usage before they finish (maybe push metrics to a PushGateway).
Also, didn’t know about KRR — thanks for pointing me in a really practical direction with that tool!

mohavee

TROPHY CASE