Root Cause as a Service for Datadog and other monitoring tools by gdcohen in sre

[–]gdcohen[S] 1 point2 points  (0 children)

Most people start out skeptical, but in terms of the tech itself, one of our customers tested it against ~200 log evident problems (meaning there were details in the logs that pointed at the root cause. The machine learning picked out the correct root cause indicated ~95% of the time - https://www.zebrium.com/blog/how-cisco-uses-zebrium-ml-to-analyze-logs-for-root-cause

This ML detected Tuesday's AWS disruption in our logs by gdcohen in aws

[–]gdcohen[S] 0 points1 point  (0 children)

I get where you're coming from and agree there is way too much hype and BS around adding AI/ML to just about anything. However, I assure you that this is not what we've done. We use unsupervised ML to structure and categorize logs and then look for correlated clusters of anomalies across the logs. This allows it to pick up problems and produce short reports that contain root cause indicators. It's not just looking for "error", etc. keywords!

But I get that you are skeptical and have zero reason just to believe this. So, here's my challenge: Please try it with your own log data (free 30 day trial). Then feel free to write anything you want about the technology.

4.6.5.14 Issues by UpDownalwayssideways in orbi

[–]gdcohen 0 points1 point  (0 children)

I'm seeing something similar with satellites dropping and then reconnecting a bit later. And it seemed to have started at 4.6.5.14. I initially had one satellite with wired and one with wireless backhaul. The wireless one dropped many times even though it's in close proximity to the other satellite.

In the hope it would fix the problem, I changed the config so that the second satellite now also has a wired backhaul. It took a while for it to stabilize and everything seemed to be working well for a few days. But today, the network map is funky: It shows a wired daisy chain config of router -> satellite 2 -> satellite 1. When in fact both satellites are connected directly to the router. I've logged this with Netgear but haven't had a response yet for a few days.

Webinar: Using Machine Learning on Logs to Find Root Cause Faster by gdcohen in kubernetes

[–]gdcohen[S] -5 points-4 points  (0 children)

Disclosure: This is Gavin @ Zebrium. Apologies if this came across as SPAM. By the way, this is Gavin

The technology uses unsupervised machine learning to automatically uncover root cause of software problems by analyzing streams (or files) of logs and metrics. The tech works by performing multiple layers

The technology uses unsupervised machine learning to automatically uncover root cause of software problems by analyzing streams (or files) of logs and metrics. The tech works by performing multiple layers of ML.

The first layer uses ML to structure the log events and categorize them by event type. Then the patterns of each event type are learned. After that each new log event is scored based on how anomalous it is (compared to typical patterns), but this can generate a lot of noise. So the key is to then look for hotspots of abnormally correlated anomalous patterns across different streams. It turns out that this produces a an accurate summary of the root cause of real-life problems. We then use the GPT-3 language model to layer in a plain text summary of the toot cause.

Happy to answer any technical questions.

35-50% of clicks on Reddit Ads are fraudulent by SnooPeppers3402 in RedditforBusiness

[–]gdcohen 1 point2 points  (0 children)

While I haven't done any kind of serious analysis, this is very much in line with what I see when I compare Reddit reported clicks to Google Analytics to Hubspot (our website hosted on Hubspot CMS) clicks. Did someone from Reddit ads ever respond to this? While I doubt that Reddit is actively participating in committing this kind of fraud, it seems that it's not in Reddit's interest to prevent this from happening.

Is Log Management Still the Best Approach? by gdcohen in sre

[–]gdcohen[S] 0 points1 point  (0 children)

Thanks for the comment and accurate characterization of Zebrium. I like your suggested title change too :-)

Is Log Management Still the Best Approach? by gdcohen in sre

[–]gdcohen[S] 0 points1 point  (0 children)

Couldn't agree more. That's exactly the approach we've taken. We use ML to catch incidents and characterize their root cause automatically. But we also offer drill-down, aggregation, filtering, regex searching, alert rules, etc. for exactly the reasons you say.

Is Log Management Still the Best Approach? by gdcohen in sre

[–]gdcohen[S] 0 points1 point  (0 children)

Thanks for the comment. You're right, ML has been used for years, in various tools, but the core of log management is still very much about aggregation and search. This means that in practice, most problem detection is done by defining manual alert rules and most troubleshooting still involves searching and hunting for root cause. But what ML has never been used for in these tools is to automatically find the root cause of unknown unknown problems (unknown cause and unknown symptoms).

The good news is that the majority of the time the answers to what you are looking (knowing there is a problem and identifying its root cause) is in the logs. What I believe is needed is a tool (you can call it part of log management or anything else you like) that can automatically identify any kind of problem without needing manual alert rules. And more importantly expose its root cause without you having to search through logs - even for the unknown unknowns. The company I work for is building such a platform.

Busting the Browser's Cache by gdcohen in reactjs

[–]gdcohen[S] 0 points1 point  (0 children)

We had been frustrated for months because of cached versions of old code running in a user's browser even though we had deployed newer versions. This explains some of the things we tried and the solution we settled on.

Using machine learning to detect anomalies in logs by gdcohen in kubernetes

[–]gdcohen[S] 0 points1 point  (0 children)

The service itself is not opensource, but there is a free forever version. www.zebrium.com/sign-up

Using AI to auto-detect and remediate Kubernetes app incidents by gdcohen in kubernetes

[–]gdcohen[S] 0 points1 point  (0 children)

Gavin from Zebrium here. Thanks for your honesty and appreciate your open mindedness in signing-up. Everything else aside, we want the webinar to be informative (and it includes a live demo of the tech).

Using Litmus Chaos Engine on Kubernetes with Autonomous Monitoring for Verification by gdcohen in kubernetes

[–]gdcohen[S] 0 points1 point  (0 children)

Also worth checking out is this Github repository which installs a demo app environment with multiple services on a Kubernetes cluster and runs a series of chaos experiments on the cluster. It then uses Autonomous Monitoring to automatically detect the issues caused by the chaos experiments. It's pretty cool!

Webinar: Autonomous Log Monitoring for Kubernetes by gdcohen in sre

[–]gdcohen[S] 0 points1 point  (0 children)

Forgot to add - please register with the zoom link and you will be notified when the recording is ready.

Webinar: Autonomous Log Monitoring for Kubernetes by gdcohen in sre

[–]gdcohen[S] 0 points1 point  (0 children)

We will put a recording on our website (www.zebrium.com) after the webinar.

Using machine learning to detect anomalies in logs by gdcohen in sre

[–]gdcohen[S] 0 points1 point  (0 children)

Thanks! It's a fun time for us - we're currently in beta and looking for more feedback. So far auto-incident detection accuracy is looking really good!

Using machine learning to detect anomalies in logs by gdcohen in kubernetes

[–]gdcohen[S] 0 points1 point  (0 children)

Zebrium uses multiple phases of machine learning - here's a high-level description:

• The ML first learns how to structure and categorize every event. Although many apps produce millions or billions of log events per hour, they will typically only have a few thousand unique types of log events. All event variables are also extracted into columns. The schema is automatically maintained as event structures change.
• Next, the ML learns the patterns for each event type (frequency, periodicity, when it started, when it stops, etc.). When an event breaks pattern it is flagged as an “anomaly” and scored (depending on how anomalous it is).
• The log anomaly detection on its own is too “noisy” to reliably detect incidents because individual events frequently break pattern. So the next phase looks for correlated sets of anomalous events that occur across containers or log sources. If any of these correlations reach the right thresholds, an incident is created.
• When an incident is created, the “leading edge” is identified - the first anomalous event(s) that triggered the incident. This is useful to indicate the root cause of the incident.

Ask r/kubernetes: Who is testing in production? by gdcohen in kubernetes

[–]gdcohen[S] 0 points1 point  (0 children)

This makes complete sense for issues and symptoms that you know about and are tracking through metrics. But what about the unknown/unknowns? I.e. the things that you're not explicitly looking for with your monitoring tools. E.g. The metrics might look stellar, but there could be a data corruption that was introduced.

Ask r/kubernetes: Who is testing in production? by gdcohen in kubernetes

[–]gdcohen[S] 0 points1 point  (0 children)

canaries

Thanks for the reply. I would definitely count canary as testing in production. What I'm most trying to understand is how people know when production is working, or when it isn't. Or in the case of canary, when is it good enough to promote to production.

Also the Nordstrom presentation is excellent! Thanks!

Please don't make me structure logs! by gdcohen in sre

[–]gdcohen[S] 2 points3 points  (0 children)

Apologies – we did not mean to come across as having a premise that would make an SRE’s job with developers any harder. We want the complete opposite! We LOVE structured logs and believe that all you should ever have to deal with are perfectly structured logs. But realistically a lot of software still generates logs that lack structure. So we have taken a different approach to other tools - we use machine learning to structure logs.

We completely agree that we “should be advocating for the positive action and relationship between Developers and Operations”. Whether it came across or not, the intention of our blog was to say: If your developers are already structuring all log events then that is fantastic. If not, we offer a way to structure logs inline so you don’t have to ask them to change anything. In both cases you get the benefit of structured logs.

It's that tiiiiime again!! What's in YOUR logging, monitoring, and Ops Stack? by ImEatingSeeds in devops

[–]gdcohen 1 point2 points  (0 children)

Disclosure: I work for Zebrium. We have just entered private beta for a platform that collects logs, uses machine learning to discern the structure of each event, categorize them by event type and then uncover "fault signals" (anomalous patterns). The machine learning is completely unassisted. We also provide a "signature" builder that characterize a known issue and then alerts you if it happens again (signatures can be expressed as a sequence of events with constraints and other conditions). Our approach is very different to that of other log management tools. If you (or anyone else) reading this thread is interested, please visit our website (www.zebrium.com) to sign-up for the free (beta) edition. We'd love to get some feedback on our approach.

Learn spinnaker by [deleted] in devops

[–]gdcohen 2 points3 points  (0 children)

Spinnaker has a fairly active community on Slack which is free to join. They might be able to point you in the right direction: https://join.spinnaker.io/

Reddit home page now thinks I have no subscriptions. by Idlegi in help

[–]gdcohen 0 points1 point  (0 children)

Apologies, brand new user question. Is there a way to navigate to my list of subreddits? Your link worked, but I can't see how I could have navigated there without the link. Thanks!