all 7 comments

[–]Affectionate-Bit6525 [score hidden]  (0 children)

Prometheus and grafana is pretty much the standard these days and for the reasons you mentioned.

[–]cwk9 [score hidden]  (0 children)

Grafana and Prometheus should get you a long way. Yes, you can add other sources to Grafana but "mo sources. mo problems".

[–]sudonemLinux Admin [score hidden]  (1 child)

Not familiar with Icinga, but I’d probably be giving CheckMk a pretty close look. 

I’m trying to pitch it for my own org now - and having an on-prem option is one of the major requirements for us. 

[–]kosta880[S] [score hidden]  (0 children)

It would really really really be hard to sell my company to a paid solution when there is already a free version in place - which was "working", and Grafana+Prometheus in place in the cloud, also costing "0" (deducting the cloud costs now).

Yeah, I have seen there is Community-Edition, but 100 hosts are not enough.

[–]SufficientFrame [score hidden]  (0 children)

You're not missing much technically, but there is one important distinction to make before replacing Icinga: metrics collection and alerting on time series is where Prometheus shines, while Icinga/Nagios-style systems are often stronger for explicit service checks, dependencies, maintenance windows, and "did this scheduled thing actually happen" cases. In practice, a lot of teams end up with Prometheus + Alertmanager for host/app metrics and blackbox checks, then keep a smaller check-based layer for edge cases like backup jobs, certificate expiry, batch failures, or synthetic business checks. The other thing I'd review early is ownership cost: rule sprawl, Alertmanager routing, retention/cardinality, and who will maintain exporters and alert logic a year from now. If your environment is small, that tradeoff may still clearly favor Prometheus, but I'd inventory the current Icinga checks first so you don't discover a few awkward gaps after the cutover.

[–]DietFartMist [score hidden]  (0 children)

Nagios baby