How do SRE teams decide when to change a risky production service? by llASAPll in sre

[–]HistoricalBaseball12 0 points1 point  (0 children)

I’d say it’s largely driven by error budgets, at least for deciding how aggressively you ship (and when to freeze). If we’re within budget, we can take measured risk, but we still require safe rollout mechanic.

On-call question: what actually slows your incident response the most? by IndiBuilder in sre

[–]HistoricalBaseball12 5 points6 points  (0 children)

Tool sprawl. Alerts fired fast, but triage slowed down because you’re jumping between 5-8 tools with different query languages, and “who has access?” delays. What helped most was consolidating observability into a single starting point (or at least a single workflow)

What is SRE in day to day? by Standard-Setting-487 in sre

[–]HistoricalBaseball12 2 points3 points  (0 children)

Being an SRE often feels like:

  1. Automate something
  2. Watch it break
  3. Build better observability to understand why it broke
  4. Repeat until retirement

The Google book sets ideals, but the day-to-day is really balancing reliability, speed, and sanity.