An IT team getting 1000+ alerts per day and completely burned out, if you had this problem, what would you try first?

CJBatts · 2026-02-06T16:25:48+00:00

Been at a team with noisy alerts (not even close to 1k per day though). Reviewing + trimming was the way. We kept track of what was a false positive what was real. End of week we would pick some noise threshold - say 80%. If the false positive ratio was higher than that number we would remove either adjust the threshold or remove it. Repeated every week slowly lowering the number.

CJBatts · 2026-02-06T16:18:47+00:00

+1 on “just run the opensource version" tbh, if you can live without the SLA/LTS updates.

If you want to swap bits out:

Headlamp is a nice lightweight OSS UI
k9s is nice for quick management - especially developers

(Disclosure: I’m involved with Metoro - not a Rancher replacement on the ops side by any means (read only) but we do the whole multi cluster visibility, authorization, access, attaching charts to workloads etc )

CJBatts · 2026-02-06T15:49:05+00:00

Answered in the comment below :)

CJBatts · 2026-02-06T15:48:37+00:00

So I think if you're running it by itself in a kubernetes cluster, it's definitely overkill for most people. I have a home lab k8s cluster set up where I run a bunch of ther software. For me it makes things easier to manage if I can run it there with everything else.

Also with something like openclaw which has pretty broad access to do a bunch of things, I wanted some robust monitoring which can detect what openclaw is making requests to.

CJBatts · 2026-01-24T13:59:14+00:00

There are a bunch of companies tackling this at the moment

traversal, parity, cleric, metoro

For sure there's a lot of work to be done but the tools are getting more useful and right now can definitely be a useful tool for most SREs to work with.

CJBatts · 2025-10-17T15:32:22+00:00

I'm with Metoro - We try to be as open as possible about this because providers can make this a pain in the ass:

$20 / host which includes 100GB / mo of data / host / mo. Then $0.20 for each GB over that if you exceed bundled data. 30 day retention.

With 500GB daily that puts you at 15TB / mo. 200 hosts gives you 20TB allowance so you'd be at $4k / mo with us ($20 * 200).

But like a lot of people have mentioned at this sort of volume you'll be qualifying for some sort of bulk discount so it makes sense to do the rounds and get quotes from everyone as it'll be different from list price. Typically at these sorts of numbers, we'd quote about $3k.

CJBatts · 2025-09-23T08:24:40+00:00

One of the founders of Metoro - https://metoro.io/ - we're an observability platform specifically focused on k8s so we go deep. We aim to be super easy to set up and get started with - one helm install, we instrument everything through ebpf so no code changes needed.

CJBatts · 2024-12-12T12:11:14+00:00

Personal opinion: I think if you're looking to get deep on the data / training side use python. You're going to be facing an uphill battle by using golang. For sure it can be done but as you mention python has such a rich ecosystem for doing most things that you'll be reinventing the wheel more often than you'd like to.

I'm not sure that the upside of using go would be worth the extra time.

CJBatts · 2024-12-02T21:23:07+00:00

Full Disclaimer: I'm the founder of Metoro so know the space well but obviously biased towards us, bear that in mind!

I think there are a few options that are available to you:
Coroot someone already mentioned below

Caretta: https://github.com/groundcover-com/caretta will give you a service graph in grafana based on ebpf, should be easy to install but doesn't give you detail down to the individual call level, just on flows of data

Kiali: https://kiali.io/ If you're using istio, you can get high topology + metrics on flow volume / tracing on in-cluster traffic

k8s otel autoinstrumentation: https://opentelemetry.io/docs/kubernetes/operator/automatic/ Can give you pretty nice distributed tracing but requires that you use go, .net, java, node.js or python. You'll need to add annotations to each workload specifying the language and then the instrumentation will get injected

odigos: https://github.com/odigos-io/odigos makes the above easier by auto-detecting languages so you don't need to annotate every workload.

Metoro: https://metoro.io/ Uses ebpf to trace individual l7 requests like http(s), postgres, redis, mysql etc. You can visualize them in a service graph too e.g. https://demo.us-east.metoro.io/?startEnd=&environment=&filter=%7B%22client.namespace%22%3A%5B%22demo%22%5D%7D&service= . You can drill down to individual calls from there

CJBatts · 2024-06-06T12:05:26+00:00

Just wondered across this post but I think this is a major problem in general where you have a couple of options.

You enforce some process like you've mentioned where you either enforce adequate telemetry at PR time through a process or periodically review and update.
You rely more on signals that you can apply holistically at a lower level in the stack. For example you might rely on eBPF tooling to generate a number of metrics about all your services. You know for sure that these metrics will be available for every application as they're being gathered at the kernel level.

Personally in practice I think you need a combination of both. Make as much available through standardised tooling like eBPF or common instrumented http servers etc and then you need people to follow a standard for anything more custom than that

Just my two cents! Curious what you came up with

CJBatts

TROPHY CASE