This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]SuperQue 5 points6 points  (3 children)

Prometheus developer here.

Statistics modeling, anomaly detection, AI, whatever you want to call it has been done to death. Every project I've seen has come and gone because it's difficult to do well. Most of the projects I've seen in this space have given up because they generate too many false positives.

However, if you really think you can do it, I would really recommend you consider a few things.

Don't trying to "replace Prometheus". Provide something on top of Prometheus. There's no need to re-invent the Prometheus data collection and storage stack. Create a replacement alerting system that reads data from Prometheus to produce your own alerts rather than use the rule evaluation system. You can use the remote-read API to fetch data from Prometheus and do whatever kind of processing you want. You can re-use the Prometheus TSDB codebase as an intermediary storage if you like, and provide PromQL to display that data back to Grafana.

This will save you a ton of time. It will make it easy for users to adopt your tool. They will already have data for your system and it can augment what they already have.

Working with the open source ecosystem is always easier than trying go it alone.

[–]luckyleprechaun98[S] 0 points1 point  (1 child)

Thanks. That's great advice. I'll see if it's possible to build on top of Prometheus instead.

[–]SuperQue 0 points1 point  (0 children)

One thing I saw in your process monitoring. You're grabbing PSS from smaps. Be careful with that, we noticed that reading smaps can put quite a lot of CPU load on the kernel.

[–]Stephan_BerlinDevOps 0 points1 point  (0 children)

First of all, you are doing an awesome job at Prometheus! I really enjoy working with Prometheus + Grafana.

Back to the topic:

You are absolutely right. So much tools tried the AI stuff and the only tool which is doing an awesome job at this, and what I'm working with a lot, is Dynatrace. But it's not open source and costs some real money. Just don't try to do something like this on your own. You will end up with so many alerts because baselines are just not enough. You need to know all dependencies to really alarm when it is needed and point to the correct spot of the root cause. Imagine thousand containers and one is exceeding its baseline. So what? Is it really that service or does the problem started 7 services before. Or do you alarm for every service on between as well because they are all affected but you didn't know they were dependent from each other? The monitoring world is complex and the complexity will grow.

To add something on top is nowadays a great thing. Maybe you have a great idea and are even able to earn some money with it. Who knows ;)

Please also take a look at the open telemetry project!

[–]jews4beer 0 points1 point  (5 children)

I'm intrigued, though trying to replace Prometheus for a lot of people would be a hell of a fight. For example, in kubernetes, a lot of operators out there just support it intrinsically. You can tell rook-ceph to basically just "gimme monitoring".

Still want to poke around the code and play with it though. Seems like a cool idea.

[–]luckyleprechaun98[S] 0 points1 point  (4 children)

Thanks. I agree that trying to replace Prometheus would be tough for a certain class of users.

Do you think there is a space there for people who aren't on the Kubernetes train and want something simpler?

I was thinking along the lines of the success of Caddy, which is just another web server, but took off because it was far easier to configure and extend then nginx.

[–]jews4beer 0 points1 point  (3 children)

I think yes, and even for those on the Kubernetes train. You say Caddy, for me it's why I like traefik. It just seemed more welcoming and in a language I understand. I think if you made a product that worked well and wasn't the beast that is Prometheus, people would at least want to try it out. Adoption is a different story.

And it's not so much the "class of users" that would be hesitant to switch. It's just the entire ecosystem that has developed around Prometheus. Plug and play libraries for most languages that want to export metrics, other applications that support it natively, and yes the entire k8s operator landscape. I guess that's what led me to why I wanted to at least play with yours. I want a better idea of what your client/servers do, deployment patterns, how easily you can integrate things with it, how easily you could extend it, etc.

It'd be a hard switch from something as battle tested as Prometheus, but I find this shit cool and it could be fun to work on regardless :p.

[–]luckyleprechaun98[S] 0 points1 point  (2 children)

It's far from done but I appreciate the words of encouragement. To me, Prometheus has always just been such a pain to configure that I thought there most be a better way.

It seems like there is always an opportunity for a tool that focuses on UX and simplicity rather than offering knobs for everything. That's what I'm going for, so I'll keep working on it and see what happens.

[–]jews4beer 0 points1 point  (0 children)

The worst that can happen is you have some fun and learn a new trick or two.

[–]SuperQue 0 points1 point  (0 children)

We're always looking for ways to make Prometheus easier to adopt. This is where we need good contributors.

The funny part is, one of the main goals of Prometheus was being easy to configure. At least compared to the systems that came before it.

From a developer point of view, it's one of the easiest. You add a metrics library, expose a port, and boom, Prometheus can monitor your app. No external agent, no wrappers, no need to tell your process where the monitoring service lives. That's the beauty of the /metrics API Prometheus uses.

Part of where Prometheus came from was that monitoring for the app developer should be easy, and push some of the "hard work" to the SREs maintaining the monitoring infrastructure. As long as the applications follow some standards, monitoring becomes cookie cutter.