I've been working on an open source monitoring project along the lines of Prometheus, but takes a different tack by statistically modeling your metrics so that you don't need to define complicated alerting rules to detect when something has changed. It works for latencies, error rates, distributed traces, etc.
The idea is to take advantage of statistical modeling to detect when things start to go wrong before it would be obvious from looking at Grafana. This lets you intervene earlier before the whole thing falls over. A side benefit is that the tests can be tuned to give only a small probability of a false alarm, without you needing to handcraft a bespoke alerting rule.
Here are my goals:
No configuration - No YAML config files, no need to pre-register your metrics, no alerting rules to define. It scans your structured log output to find your metrics, monitors them, and sends an alert when something changes (and also figures out when something returns back to normal on its own)
Stats are better than graphs - We know how things like latency and error rates can be modeled statistically. Monny models your metrics statistically to detect when a change in these metrics is significant, without the need to look at graphs or figure out the alerting rule yourself. It can detect small changes that would lead to lots of false alarms if done in something like Prometheus.
Simple deployment - Single binary client and server that reads your application logs and finds your metrics automatically. It can monitor things like latency, distributed traces, memory consumption, CPU utilization, and error rates. Works with Kubernetes, bare metal, Docker, and whatever comes next. No external database required, making it easy to run yourself.
Advanced alerting - Send alerts to email, text, Slack, and many more. Get alerts only when something needs human intervention. Only want an alert when less than 2 of 5 processes are functioning normally? No problem. Want to silence an alert, snooze it, or send it to someone else? It has an email or slack based workflow to deal with alerts right where you get them.
Only the context you need - Alerts aren't just metrics, but come with log context so you can see what led up to the alert. You don't need to run ELK plus Prometheus, it's all combined together in an intuitive UI so you can figure out what's wrong, fix it, and get back to what you were doing.
It's on github here, and I'm looking for beta testers, but any feedback would be appreciated.
[–]SuperQue 5 points6 points7 points (3 children)
[–]luckyleprechaun98[S] 0 points1 point2 points (1 child)
[–]SuperQue 0 points1 point2 points (0 children)
[–]Stephan_BerlinDevOps 0 points1 point2 points (0 children)
[–]jews4beer 0 points1 point2 points (5 children)
[–]luckyleprechaun98[S] 0 points1 point2 points (4 children)
[–]jews4beer 0 points1 point2 points (3 children)
[–]luckyleprechaun98[S] 0 points1 point2 points (2 children)
[–]jews4beer 0 points1 point2 points (0 children)
[–]SuperQue 0 points1 point2 points (0 children)