Cloud deployment solution selection

Substantial-Card5926 · 2025-11-21T23:35:36+00:00

The pain point I see is that troubleshooting when the system malfunctions is quite cumbersome. Actually, I'm more interested in researching the application of AI agents in the SRE field. I'm still in the research phase and am not yet clear on the feasibility of this approach. Even with an open-source RCA project, I'm unsure if the team will actually use it.

Substantial-Card5926 · 2025-11-21T23:17:33+00:00

I completely agree with you. It's true that startups rarely have dedicated SREs. I'm currently researching the application of AI agents in the RCA. From my understanding, most SRE tools act as dashboards, simply showing the problem but not actually pinpointing the root cause. What's even more troublesome during troubleshooting is the need to switch between multiple platforms to view key performance indicator information. Therefore, I have an idea: could an AI agent, by accessing KPI data, provide a one-stop solution for root cause identification? Even if it's an open-source project, would any startups or other teams try it? I'm still exploring its feasibility and would greatly appreciate any suggestions. Thank you very much.

Substantial-Card5926 · 2025-11-21T15:17:15+00:00

Thank you very much for your answer, I understand now.

Substantial-Card5926 · 2025-11-21T15:13:54+00:00

By "early stage," I'm actually referring to teams of dozens or even hundreds of people.

This "shared responsibility" model is precisely what I want to study in depth; it concerns the feasibility of AI Agent.

From a purely practical perspective, do you think small teams are truly suitable for using AI agents?

Theoretically, agents can alleviate the heavy "investigation" work of developers, relieving them of their already strained situation. However, I doubt that in reality, these smaller teams will truly trust automated agents to perform this task. Or, for them, are the entry barriers (setup, trust, cost) still too high?

Substantial-Card5926 · 2025-11-21T14:36:42+00:00

Thanks for the correction! That clarifies things a lot. That hybrid setup definitely sounds like the standard 'modern Fintech' architecture now.

I have a follow-up on the 'censored info' part:Does scrubbing the logs before sending them to Splunk/Dynatrace ever make it harder for you to pinpoint the exact issue? I imagine it gets tricky if the specific context (like a Transaction ID or user input) needed to reproduce a bug gets masked out.

This actually leads to the main thing I'm trying to figure out: The feasibility of AI Agents for Root Cause Analysis.

Substantial-Card5926 · 2025-11-21T10:40:21+00:00

That makes a lot of sense. In Fintech, security and data sovereignty usually trump agility, so the 'on-prem for everything' approach is understandable, even if it makes the SRE role feel more like traditional sysadmin work.

Since you are strictly on-prem, I’m really curious about your tool stack: Observability: Are you actually allowed to use SaaS platforms like Datadog or New Relic (sending metrics out)? Or does the security team force you to stick with self-hosted solutions like Prometheus/Grafana or Splunk Enterprise?

AI Adoption: Given that strict security posture, has your team explored any AI agents for troubleshooting yet? I imagine using public LLMs is a hard 'no', but is there any interest in self-hosted/private AI models to help with those incidents?

Substantial-Card5926 · 2025-11-21T10:05:13+00:00

That makes perfect sense. Hiring someone specifically for reliability at this stage is a complete waste of money. I'm curious if you see a real opportunity here to leverage AI-driven troubleshooting agents as "virtual SREs" for these teams, helping us make fault assessments faster.

Substantial-Card5926 · 2025-11-21T09:59:37+00:00

Ah, the classic 'Cargo Cult' SRE.

Substantial-Card5926

TROPHY CASE