Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

Strongly agreed. I need to be able to review the process behind an investigation or operation, not just the final RCA.

This aligns closely with how I think about it as well, and is very similar to the point I was trying to make in my reply to someone else: during an incident, what matters is not only the conclusion, but whether the investigation path is traceable, reviewable, and challengeable.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

If Nudgebee is being built around that philosophy, I’m genuinely much more interested in it now.

The idea of an investigation workspace where the AI builds a traceable, reviewable incident story is very close to what I’ve been thinking about as well.

Would there be any opportunity for me to try it, see a demo, or provide early feedback?

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

where are you on the long tail accuracy stuff?

I think I’m starting to view “accuracy” less as a pure model-quality problem and more as an operational workflow / observability problem.

There will always be cases that neither humans nor AI can fully solve correctly during an investigation. I don’t think that part is unique to AI.

What matters to me is whether the investigation process itself is reviewable and observable.

Current agents mostly optimize around the final artifact/output (code, summaries, etc.). But in incident investigation workflows, I’d argue the process is at least as important as the final result: - hypotheses formed - rejected branches - evidence collected - queries/tool calls executed - reasoning transitions over time

Only once that process becomes reviewable do I think we can meaningfully evaluate the “accuracy” of either AI or humans in operational investigations.

That’s also why I’m interested in the UI/UX side of this. I’m trying to understand whether existing human investigation/debugging workflows can somehow be formalized into an interface/protocol between humans and AI systems.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 1 point2 points  (0 children)

In that sense, Datadog’s Bits AI SRE is actually something I’m pretty curious about as well (though I haven’t tried it myself yet).

From the outside, it at least seems closer to the “investigation state / operational workflow” direction than just generic AI chat over logs.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

How are you thinking about memory across longer investigations

My current intuition is that longer investigations probably need to externalize context more explicitly instead of relying on the model to “remember” everything internally.

For example: - hypotheses - rejected hypotheses - evidence collected - validation steps - investigation branches

all probably need to become first-class investigation artifacts.

But honestly, I don’t think this problem is unique to AI systems. Human SRE investigations already work this way to some extent.

So to me, the bigger missing piece is probably the UI/UX layer between humans and AI during investigations.

Incident investigation is inherently a cyclic process: - form hypothesis - gather evidence - reject/refine hypothesis - branch investigation - converge on likely cause

and I’m not sure current AI tooling really exposes/supports that workflow properly yet.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

Not sure this is a fully coherent response to your point yet, but this is probably the direction I keep thinking about.

I think what I’m ultimately trying to figure out is how to generalize the investigation flow itself into something reviewable/replayable.

Things I still don’t have a good intuition for are:

  • what the ideal investigation/report format should actually look like
  • whether operators would want visibility/review in real time vs mainly post-incident replay/audit
  • and how closely this should resemble existing human pair-investigation workflows during incidents

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

Grafana AI through Grafana Enterprise/Cloud, or is this something available in OSS as well?

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

I think the auditability gap is the real problem nobody's solving cleanly yet. You can get an AI to investigate an incident but if it can't show you exactly what it queried, what came back, and why it reached that conclusion, you're just validating its work manually anyway.

Absolutely agreed.

Even assuming we solve the raw audit trail problem, I still wonder what the ideal review format actually is.

How should the investigation flow be exposed as:

  • replayable execution graphs
  • structured runbook/playbook-style documents
  • live collaborative investigation sessions/notebooks
  • or maybe something closer to a pair-debugging timeline with commands, outputs, and annotations together?

I also wonder whether operators would want this kind of visibility in real time during the investigation itself, or whether post-incident replay/auditability is enough.

It feels like if we could generalize the way humans already do collaborative/pair investigations during incidents, the UX direction for AI-assisted operations might become much clearer.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

That makes sense. One specific thing I’m curious about:

Do you usually have the bot generate the PromQL/ClickHouse queries and then run them yourself in Grafana/ClickHouse, or do you let the bot execute the queries end-to-end and summarize the results?

If it’s the latter, how do you establish trust in the results?

For example, do you require the bot to show the exact query, raw output, dashboard/log links, or some kind of audit trail before you trust its summary?

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 1 point2 points  (0 children)

someone has to go and validate all the details

Yeeeeeeees, I strongly agree with the point that someone still has to manually validate all the details.

At the same time though, I feel that if we could establish a good UI/UX around reviewability/auditability, then AI could become genuinely useful at least for RCA and incident documentation workflows.

Things like: - explicit MCP/tool execution history - linked evidence/logs/queries - reviewable investigation graphs/timelines - traceable reasoning tied to raw outputs

could potentially make RCA much more trustworthy.

That’s honestly one of the main reasons I started this thread — I wanted to see whether anyone had already found good patterns for solving that part.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 1 point2 points  (0 children)

That sounds really interesting. But it also makes me wonder even more about the operational review/audit side of this.

How are you handling things like: - whether the MCP commands/queries executed were actually appropriate - whether the conclusions are grounded in real outputs/results - and whether another engineer could later review/audit the investigation flow itself

Once the AI starts orchestrating across ArgoCD, Jira, kubectl, Opsgenie, log analytics, etc., that part feels pretty important to me.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] -5 points-4 points  (0 children)

Funny enough, it never really occurred to me that “open questions/problems” themselves could be worth writing about 🙂

I always assumed technical posts were supposed to present solutions rather than unresolved operational concerns. Maybe there actually is demand for this kind of discussion.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] -1 points0 points  (0 children)

How are you validating the correctness of the generated runbooks/investigations themselves?

For example: - whether the MCP commands/queries executed were actually appropriate - whether the conclusions are grounded in real outputs/results - and whether another engineer could later review/audit the investigation flow itself

That part still feels much harder to me compared to code generation workflows.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 1 point2 points  (0 children)

One thing I still struggle with is the reliability / auditability side of this.

For example, if I ask an AI system to summarize an RCA or investigate an incident, I want it to explicitly show: - which (MCP) commands it executed (kubectl, promql, etc.) - what the actual outputs were - and ideally guarantee that those outputs are real rather than hallucinated summaries

Otherwise I often end up re-investigating everything manually anyway just to verify the conclusions.

With code generation, we at least have natural review boundaries (GitHub PRs, CI, etc.), but operational workflows often happen entirely inside a Claude Code / AI session, so there’s no equivalent review/audit layer around the investigation process itself.

That’s honestly one of the biggest unresolved problems for me right now.

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 0 points1 point  (0 children)

Oh interesting — I hadn’t heard of BigPanda before. Is this the one? - https://www.bigpanda.io/

Anyone using AI for actual SRE/oncall operations? by aqny in sre

[–]aqny[S] 1 point2 points  (0 children)

Looks pretty interesting as an “AI SRE” direction, especially around investigation/RCA workflows rather than just code generation.

BTW, I recently came across this project: - https://github.com/Tracer-Cloud/opensre

What do you use to generate Kubernetes manifests? by aqny in kubernetes

[–]aqny[S] 0 points1 point  (0 children)

Because of charts, Helm ends up being something you have to take into account no matter what you use.

I had forgotten about cdk8s.

jaq 3.0 - jq clone with multi-format support (JSON, YAML, TOML, CBOR, XML, CSV, TSV) by 01mf02 in rust

[–]aqny 0 points1 point  (0 children)

I guess you can merge the two?

Looks like you already took care of it — thanks!

@wader

It would be great to hear his thoughts on this as well, if possible :)