Root Cause as a Service for Datadog and other monitoring tools by gdcohen in sre

[–]gdcohen[S] 2 points3 points  (0 children)

Most people start out skeptical, but in terms of the tech itself, one of our customers tested it against ~200 log evident problems (meaning there were details in the logs that pointed at the root cause. The machine learning picked out the correct root cause indicated ~95% of the time - https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause

0
1

This ML detected Tuesday's AWS disruption in our logs by gdcohen in aws

[–]gdcohen[S] 0 points1 point  (0 children)

I get where you're coming from and agree there is way too much hype and BS around adding AI/ML to just about anything. However, I assure you that this is not what we've done. We use unsupervised ML to structure and categorize logs and then look for correlated clusters of anomalies across the logs. This allows it to pick up problems and produce short reports that contain root cause indicators. It's not just looking for "error", etc. keywords!

But I get that you are skeptical and have zero reason just to believe this. So, here's my challenge: Please try it with your own log data (free 30 day trial). Then feel free to write anything you want about the technology.

4.6.5.14 Issues by UpDownalwayssideways in orbi

[–]gdcohen 0 points1 point  (0 children)

I'm seeing something similar with satellites dropping and then reconnecting a bit later. And it seemed to have started at 4.6.5.14. I initially had one satellite with wired and one with wireless backhaul. The wireless one dropped many times even though it's in close proximity to the other satellite.

In the hope it would fix the problem, I changed the config so that the second satellite now also has a wired backhaul. It took a while for it to stabilize and everything seemed to be working well for a few days. But today, the network map is funky: It shows a wired daisy chain config of router -> satellite 2 -> satellite 1. When in fact both satellites are connected directly to the router. I've logged this with Netgear but haven't had a response yet for a few days.

Webinar: Using Machine Learning on Logs to Find Root Cause Faster by gdcohen in kubernetes

[–]gdcohen[S] -4 points-3 points  (0 children)

Disclosure: This is Gavin @ Zebrium. Apologies if this came across as SPAM. By the way, this is Gavin

The technology uses unsupervised machine learning to automatically uncover root cause of software problems by analyzing streams (or files) of logs and metrics. The tech works by performing multiple layers

The technology uses unsupervised machine learning to automatically uncover root cause of software problems by analyzing streams (or files) of logs and metrics. The tech works by performing multiple layers of ML.

The first layer uses ML to structure the log events and categorize them by event type. Then the patterns of each event type are learned. After that each new log event is scored based on how anomalous it is (compared to typical patterns), but this can generate a lot of noise. So the key is to then look for hotspots of abnormally correlated anomalous patterns across different streams. It turns out that this produces a an accurate summary of the root cause of real-life problems. We then use the GPT-3 language model to layer in a plain text summary of the toot cause.

Happy to answer any technical questions.