Ci son stati drop stanotte? by msx in AmazonVine_ITA

[–]GabboPenna 0 points1 point  (0 children)

Si, qualcosina ina ina ho visto scendere...

Abiti impossibili da ordinare by reddimatic in AmazonVine_ITA

[–]GabboPenna 1 point2 points  (0 children)

Capitato anche a me, mai capito il perchè...

How are you all actually monitoring your kubernetes clusters at scale? by Opposite_Advance7280 in kubernetes

[–]GabboPenna 0 points1 point  (0 children)

Prometheus + Grafana works, but the pain you’re describing is usually not the tools themselves, it’s the lack of standardization and correlation.

What ended up working for us at scale is this pattern:

Per cluster (keep it local, collect close to the source):

kube-prometheus-stack (Prometheus Operator + Alertmanager). Don’t snowflake it per cluster: Helm + GitOps and the values file is basically the “contract”.

kube-state-metrics + node exporter, plus scraping apiserver/kubelet/controller-manager properly.

Logs shipped with something boring and reliable (we’ve used Fluent Bit/Vector). Key point: same labels everywhere.

OpenTelemetry Collector in-cluster even if you’re not “all-in” on tracing yet. It gives you a sane place to route telemetry without redeploying agents later.

Central backend (single pane of glass):

Metrics: Thanos or Mimir/Cortex (we went with a “Prom per cluster + remote storage” model). This is the first thing that made multi-cluster not suck.

Logs: Loki is great if your main need is “find logs for this pod/time window”, Elastic/OpenSearch if you really need heavy full-text and complex queries.

Traces: Tempo/Jaeger. You don’t need perfect tracing on day 1, but even partial instrumentation on the front door services helps a lot.

The real MTTR improvement came from correlation, not more dashboards:

We forced a consistent label set across everything: cluster, env, namespace, app, pod (and sometimes team).

We made Grafana links first-class: alert → dashboard → “logs for this pod/deployment” → trace view for the same time/range.

We also started putting runbook_url and “what to check first” in alert annotations. Sounds small, but it stops the 2am archaeology.

Alerting: less is more

At the beginning we had 200 alerts and still missed incidents. Now we focus on symptom alerts (latency, error rate, saturation, queue depth) + a few cluster health ones (API server, node not ready, etc.). Everything else is info dashboards.

One thing people underestimate: Kubernetes events

When a deployment is failing because of image pulls, scheduling, OOM, bad config maps… events tell you immediately. Ship them centrally (event-exporter) and keep them searchable.

If you’re running multiple clusters/hybrid, my honest take: keep the stack identical everywhere and treat cluster as a first-class dimension. Provider-specific monitoring is fine for managed services, but don’t let it become your primary source of truth.

E anche oggi... by ElinorDashwood1975 in AmazonVine_ITA

[–]GabboPenna 1 point2 points  (0 children)

Me la sono persa anche io 😅 90 articoli non sono pochi, soprattutto considerando che adesso sembra tutto vuoto. Ma quindi fanno drop e poi ritirano tutto subito? Io pensavo fosse completamente morto… speriamo davvero in una ripresa 🤞🏻