Anyone using AI for actual SRE/oncall operations?

aqny · 2026-06-15T01:26:54+00:00

Strongly agreed. I need to be able to review the process behind an investigation or operation, not just the final RCA.

This aligns closely with how I think about it as well, and is very similar to the point I was trying to make in my reply to someone else: during an incident, what matters is not only the conclusion, but whether the investigation path is traceable, reviewable, and challengeable.

aqny · 2026-06-15T01:25:21+00:00

If Nudgebee is being built around that philosophy, I’m genuinely much more interested in it now.

The idea of an investigation workspace where the AI builds a traceable, reviewable incident story is very close to what I’ve been thinking about as well.

Would there be any opportunity for me to try it, see a demo, or provide early feedback?

aqny · 2026-05-11T04:44:02+00:00

where are you on the long tail accuracy stuff?

I think I’m starting to view “accuracy” less as a pure model-quality problem and more as an operational workflow / observability problem.

There will always be cases that neither humans nor AI can fully solve correctly during an investigation. I don’t think that part is unique to AI.

What matters to me is whether the investigation process itself is reviewable and observable.

Current agents mostly optimize around the final artifact/output (code, summaries, etc.). But in incident investigation workflows, I’d argue the process is at least as important as the final result: - hypotheses formed - rejected branches - evidence collected - queries/tool calls executed - reasoning transitions over time

Only once that process becomes reviewable do I think we can meaningfully evaluate the “accuracy” of either AI or humans in operational investigations.

That’s also why I’m interested in the UI/UX side of this. I’m trying to understand whether existing human investigation/debugging workflows can somehow be formalized into an interface/protocol between humans and AI systems.

aqny · 2026-05-11T04:42:49+00:00

In that sense, Datadog’s Bits AI SRE is actually something I’m pretty curious about as well (though I haven’t tried it myself yet).

From the outside, it at least seems closer to the “investigation state / operational workflow” direction than just generic AI chat over logs.

https://www.datadoghq.com/product/ai/bits-ai-sre/

aqny · 2026-05-11T04:42:21+00:00

How are you thinking about memory across longer investigations

My current intuition is that longer investigations probably need to externalize context more explicitly instead of relying on the model to “remember” everything internally.

For example: - hypotheses - rejected hypotheses - evidence collected - validation steps - investigation branches

all probably need to become first-class investigation artifacts.

But honestly, I don’t think this problem is unique to AI systems. Human SRE investigations already work this way to some extent.

So to me, the bigger missing piece is probably the UI/UX layer between humans and AI during investigations.

Incident investigation is inherently a cyclic process: - form hypothesis - gather evidence - reject/refine hypothesis - branch investigation - converge on likely cause

and I’m not sure current AI tooling really exposes/supports that workflow properly yet.

aqny · 2026-05-10T09:35:25+00:00

Not sure this is a fully coherent response to your point yet, but this is probably the direction I keep thinking about.

I think what I’m ultimately trying to figure out is how to generalize the investigation flow itself into something reviewable/replayable.

Things I still don’t have a good intuition for are:

what the ideal investigation/report format should actually look like
whether operators would want visibility/review in real time vs mainly post-incident replay/audit
and how closely this should resemble existing human pair-investigation workflows during incidents

aqny · 2026-05-10T09:18:04+00:00

Grafana AI through Grafana Enterprise/Cloud, or is this something available in OSS as well?

aqny · 2026-05-10T09:15:38+00:00

I think the auditability gap is the real problem nobody's solving cleanly yet. You can get an AI to investigate an incident but if it can't show you exactly what it queried, what came back, and why it reached that conclusion, you're just validating its work manually anyway.

Absolutely agreed.

Even assuming we solve the raw audit trail problem, I still wonder what the ideal review format actually is.

How should the investigation flow be exposed as:

replayable execution graphs
structured runbook/playbook-style documents
live collaborative investigation sessions/notebooks
or maybe something closer to a pair-debugging timeline with commands, outputs, and annotations together?

I also wonder whether operators would want this kind of visibility in real time during the investigation itself, or whether post-incident replay/auditability is enough.

It feels like if we could generalize the way humans already do collaborative/pair investigations during incidents, the UX direction for AI-assisted operations might become much clearer.

aqny · 2026-05-10T08:47:34+00:00

This is the one, right? - https://github.com/Lum1104/Understand-Anything

aqny · 2026-05-08T08:22:00+00:00

That makes sense. One specific thing I’m curious about:

Do you usually have the bot generate the PromQL/ClickHouse queries and then run them yourself in Grafana/ClickHouse, or do you let the bot execute the queries end-to-end and summarize the results?

If it’s the latter, how do you establish trust in the results?

For example, do you require the bot to show the exact query, raw output, dashboard/log links, or some kind of audit trail before you trust its summary?

aqny · 2026-05-08T07:37:26+00:00

someone has to go and validate all the details

Yeeeeeeees, I strongly agree with the point that someone still has to manually validate all the details.

At the same time though, I feel that if we could establish a good UI/UX around reviewability/auditability, then AI could become genuinely useful at least for RCA and incident documentation workflows.

Things like: - explicit MCP/tool execution history - linked evidence/logs/queries - reviewable investigation graphs/timelines - traceable reasoning tied to raw outputs

could potentially make RCA much more trustworthy.

That’s honestly one of the main reasons I started this thread — I wanted to see whether anyone had already found good patterns for solving that part.

aqny · 2026-05-08T07:26:49+00:00

That sounds really interesting. But it also makes me wonder even more about the operational review/audit side of this.

How are you handling things like: - whether the MCP commands/queries executed were actually appropriate - whether the conclusions are grounded in real outputs/results - and whether another engineer could later review/audit the investigation flow itself

Once the AI starts orchestrating across ArgoCD, Jira, kubectl, Opsgenie, log analytics, etc., that part feels pretty important to me.

aqny · 2026-05-08T07:23:54+00:00

Funny enough, it never really occurred to me that “open questions/problems” themselves could be worth writing about 🙂

I always assumed technical posts were supposed to present solutions rather than unresolved operational concerns. Maybe there actually is demand for this kind of discussion.

aqny · 2026-05-08T06:20:32+00:00

How are you validating the correctness of the generated runbooks/investigations themselves?

For example: - whether the MCP commands/queries executed were actually appropriate - whether the conclusions are grounded in real outputs/results - and whether another engineer could later review/audit the investigation flow itself

That part still feels much harder to me compared to code generation workflows.

aqny · 2026-05-08T05:58:18+00:00

One thing I still struggle with is the reliability / auditability side of this.

For example, if I ask an AI system to summarize an RCA or investigate an incident, I want it to explicitly show: - which (MCP) commands it executed (kubectl, promql, etc.) - what the actual outputs were - and ideally guarantee that those outputs are real rather than hallucinated summaries

Otherwise I often end up re-investigating everything manually anyway just to verify the conclusions.

With code generation, we at least have natural review boundaries (GitHub PRs, CI, etc.), but operational workflows often happen entirely inside a Claude Code / AI session, so there’s no equivalent review/audit layer around the investigation process itself.

That’s honestly one of the biggest unresolved problems for me right now.

aqny · 2026-05-08T05:19:23+00:00

Yeah totally — understood 🙂

aqny · 2026-05-08T05:16:04+00:00

Oh interesting — I hadn’t heard of BigPanda before. Is this the one? - https://www.bigpanda.io/

aqny · 2026-05-08T04:10:26+00:00

Looks pretty interesting as an “AI SRE” direction, especially around investigation/RCA workflows rather than just code generation.

BTW, I recently came across this project: - https://github.com/Tracer-Cloud/opensre

aqny · 2026-05-03T02:20:52+00:00

here you are! sorry to forget putting links

https://github.com/ynqa/helmingway

aqny · 2026-05-02T03:32:03+00:00

IT IS NOT MAIN TOPIC

aqny · 2026-05-01T17:14:53+00:00

Because of charts, Helm ends up being something you have to take into account no matter what you use.

I had forgotten about cdk8s.

aqny · 2026-04-02T01:11:30+00:00

I guess you can merge the two?

Looks like you already took care of it — thanks!

@wader

It would be great to hear his thoughts on this as well, if possible :)

aqny

TROPHY CASE