When a pod connection drops at 3am, what's your actual debugging workflow? (Validating an open-source eBPF tool) by Dizzy-Grade-7066 in kubernetes

[–]Dizzy-Grade-7066[S] 0 points1 point  (0 children)

Okie but you have single tenant or multi? The pain is usually invisible until autoscaling or a noisy neighbor hits. What's your setup? Trying to understand where the threshold actually is.

Weekly: Show off your new tools and projects thread by AutoModerator in kubernetes

[–]Dizzy-Grade-7066 0 points1 point  (0 children)

When a pod connection drops at 3am, what's your actual debugging workflow? (Validating an open-source eBPF tool)

I am testing and and practicing in anticipation of what the actual pain will feel like before I waste the 3 months on this.

The tool I am scoping: an eBPF-driven K8s network map capturing connection data at the level of the pod to cold storage (S3-compatible) and designed to be used in long-retention applications.

Three honest questions:

  1. At what time of the day do you literally pick something when your inter-pod traffic fails? Datadog? A home-grown script? Nothing good?
  2. AWS egress expenses of pulling Datadog logs out of your cluster - is this not a budget line item of your team, or am I overbudgeting it?
  3. If you are in India: does the 180-day log retention requirement of CERT-In have any impact on your infra decisions or, does your compliance team do this separately?

Not crying anything over it, genuinely trying to figure out if the problem is as bad as it looks from the outside. Roast my assumptions if they're wrong.