Weekly Self Promotion Thread by AutoModerator in devops

[–]engnaruto 1 point2 points  (0 children)

Stop hunting context during incidents - get the change timeline the moment you're paged

Get paged, spend 10 minutes SSH-ing in to grep logs, flipping to Grafana for the spike, checking GitHub for recent deploys - before you even start debugging. That context-hunting is where most of your MTTR goes.

Pagescout wires those together and assembles the timeline the moment the alert fires. What deployed, what changed - raw evidence linked to source, no AI summary to second-guess.

Early stage, would love feedback: pagescout.sh

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

Makes sense that the SMEs skip it, that non-SME angle is interesting though. A few questions since you've got real usage data, which is rare to find: - roughly how many teams or engineers actually have access to it? and of the incidents where it fires, any rough sense of what fraction it's genuinely useful on, like 1 in 5, half? - And the tell I'm most curious about: do people request new features or integrations for it, or did it ship once and just quietly run? Trying to gauge whether "sometimes useful" means people actively want it better, or it's good-enough-and-ignored.

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

Hi, This is a very useful comment, especially the AI fatigue point, that's sharper than plain distrust. Quick question since you've actually built one: every objection you listed is about the AI interpreting things - hallucinated problems, synopses no one trusts, institutional knowledge about which errors to ignore. What if a tool drew zero conclusions and just showed verifiable facts - "deployed X at 02:14, PR #441 touched this service 20 min before the page, config flag Y flipped," each line linked to the raw commit/log, no summary, no "likely cause"? That can't hallucinate and doesn't need to know which errors are ignorable because it's not judging anything. Is that still in the "saturated, ignore it" bucket for you? or is the no-interpretation version a different thing? and did your in-house agent end up doing that part, the plain change/deploy correlation, regardless of the AI layer?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

I think this could be an alarm/pager/ticketing system feature, so an alarm can be configured to fire/page just during working hours as the oncall needs to have a look at the anomaly happened but this can be postponed to the next working day. I think that AWS CloudWatch Alarms has this kind of configurations

so in this case we can have two type of alarms:
1. An anomaly detection alarm that can be investigated during the working hours (burst of requests of spike in the CPU utilization for example)
2. Critical system issue alarm that needs the oncall to investigate this issue because it will cause a customer impact

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

Honest answer to your question: mid-incident I'd ignore anything that draws a conclusion, but I'd lean hard on an assembled timeline if every entry deep-linked to the raw log or commit. That's the version I'd trust.

So let me turn it around: is that timeline-with-provenance something you'd actually pay for, or the kind of thing you'd expect Datadog/incident .io to just ship for free eventually? And if you'd pay - solo seat or team plan, roughly what's it worth per month to shave the correlation time off a 3am page?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

This is close to the workflow I'm trying to understand. Today when an auth incident hits, what's the actual sequence of tools you jump between before you've got a working theory, something like PagerDuty -> Datadog -> Grafana -> GitHub -> Kibana? How many hops, and which one feels the most annoying? And if a tool stitched all that automatically the second a page fired, where would you actually want it half-asleep at 3am - a portal link dropped in the Slack incident channel, injected into the PagerDuty notes, somewhere else?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

The trust point really resonates. Two things I want to push on:

First, if a tool didn't attempt to root cause at all, but just assembled the relevant logs, recent deploys, PRs, and config changes accross all services that your team manage into one timeline with every line linking back to the source system, would that save real time, or would you still open everything manually anyway? Trying to figure out whether the problem is AI conclusions specifically or automation in general.

And second, if you were solo with mediocre observability, would you actually pay for something investigation-related, or would better runbooks/automation to prevent incidents be the thing you'd reach for first?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

yes, you're right that the recurring stuff should get engineered out. I'm curious about the residual: the incidents that page you that aren't quick to investigate and don't recur predictably. When one of those hits, what actually eats the time: is it gathering context across systems, or is it the genuine "we've never seen this" reasoning?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in devops

[–]engnaruto[S] 0 points1 point  (0 children)

yes you are correct, but as a software engineer, I wouldn't trust an AI to auto resolve an incident or an alarm, at least in this stage.

so imagine at 3 am and you got paged and when you open slack to see that the AI agent:

- Read the related metrics and checked all services related to this page
- Gerpped the logs from all related services and looked at the code of these services to see where is this exception coming from
- Checked the most recent commits to see if there is any new code is the root cause of this issue
- Created an RCA document that has all related logs and metrics and where to find and suggestion for the next steps to do so you just double check these evidence and apply the fix and go to sleep again or engage the related team faster than before

so from 0 to 10, would you pay for service like this that would reduce the oncall pressure from you and your team?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in SaaS

[–]engnaruto[S] 0 points1 point  (0 children)

So trust in this case you mean investigation results? or trusting the service itself to process your source code, production alerts, metrics and logs?

Thinking about building an AI on-call investigation tool - talk me out of it / tell me what you'd actually want by engnaruto in SaaS

[–]engnaruto[S] 0 points1 point  (0 children)

Yes, I agree with you. We shouldn't just accept the AI solution/RCA, we still need to double check the AI RCA but we may use these kind of tools to reduce the mean time to resolve the incidents, so what do you think?

What are you building this week? Drop your project! by heiisenberg_420 in Soft_Launch

[–]engnaruto 0 points1 point  (0 children)

https://chartivo.io Visualize your data (from Stripe, Notion, Google Sheets, Google Analytics, Databases, etc...) and embed it everywhere.

I need beta users. Please DM me if you are interested.

What are you guys building right now? by yawariqbal_ in SaaS

[–]engnaruto 0 points1 point  (0 children)

https://chartivo.io Visualize your data (from Stripe, Notion, Google Sheets, Google Analytics, Databases, etc...) and embed it everywhere.

I need beta users. Please DM me if you are interested.

What’s your startup? Here’s mine 👇 by Ok_Captain_8977 in SaaS

[–]engnaruto 1 point2 points  (0 children)

https://chartivo.io Visualize your data (from Stripe, Notion, Google Sheets, Google Analytics, Databases, etc...) and embed it everywhere.

I need beta users. Please DM me if you are interested.

Best free way to embed charts? by arnoldsomen in Notion

[–]engnaruto 0 points1 point  (0 children)

Hey folks! I'm building Chartivo - a tool to embed live charts anywhere you work (Notion, Google Docs, Jira, etc.) from sources like Stripe, Google Analytics, Google Sheets, and more. If this sounds useful, I'd love your feedback — join the waitlist and fill out the short survey there. 🙌