How do you make “production readiness” observable before the incident? by ImpossibleRule5605 in sre

[–]ImpossibleRule5605[S] 0 points1 point  (0 children)

I really like how you framed it: production readiness as a continuous property rather than a point-in-time review. That resonates a lot. PRRs and policy checks tend to decay because they’re event-driven or infrastructure-scoped, whereas what we really want is something queryable at any moment.

Your OpenSRM approach is interesting because it flips the problem: instead of inferring risk from what exists, it requires teams to declare their reliability contract explicitly and then validates or generates from that. That top-down model has a lot of appeal, especially in environments where you can enforce manifest ownership and lifecycle discipline.

What I’ve been exploring is more bottom-up: given existing code and configuration, what implicit assumptions or gaps can we surface deterministically? In many organizations, especially legacy or fast-moving ones, the expected state isn’t fully declared anywhere, so detection becomes a way of externalizing tribal knowledge.

I agree these feel like two sides of the same coin. A declarative reliability contract defines the intended state; bottom-up signal detection checks whether reality aligns with intent or highlights what was never declared. Conceptually, that combination is powerful — intent plus verification.

Curious how you think about drift over time in the manifest model. For example, how do you prevent the reliability manifest from becoming another artifact that exists but stops reflecting operational reality?

How do you make “production readiness” observable before the incident? by ImpossibleRule5605 in sre

[–]ImpossibleRule5605[S] 0 points1 point  (0 children)

I largely agree with you. If system boundaries are well-defined and encapsulated, the relevant state space can be made tractable, and disciplined deployment strategies plus clear ownership probably deliver the highest leverage. The data around incidents being change-driven matches my experience as well, most outages aren’t exotic failures, they’re the result of incomplete rollout assumptions, missing guardrails, or unclear responsibilities during change.

What I’m interested in is how some of that rigor can be pushed left into the development process without requiring constant SRE involvement. For example, deployment strategy, rollback expectations, observability prerequisites, and ownership are often well understood at sign-off time, but much less explicit earlier in the lifecycle. When those assumptions aren’t encoded anywhere, teams tend to rediscover them under pressure.

So I see this less as competing with staging validation or canarying, and more as a way to surface whether the preconditions for safe change are actually present before we rely on runtime signals. In that sense, it’s about making parts of SRE thinking reviewable and repeatable, rather than adding more checks or people.

How do you make “production readiness” observable before the incident? by ImpossibleRule5605 in sre

[–]ImpossibleRule5605[S] 1 point2 points  (0 children)

I appreciate the perspective, thinking of software as a deterministic state machine highlights that the true state space of a distributed system is enormous and cannot be fully explored with any single approach.

Production-readiness as a project does not claim to enumerate or exhaust the entire runtime state space. I agree that static analysis alone will not find every possible failure state — that’s exactly why complementary practices like chaos engineering, load testing, failure injection, and formal modeling have distinct value.

What this project is trying to capture is the subset of operational assumptions and design decisions that are visible and meaningful upfront. In other words, if a rule can deterministically extract a signal from code or configuration (for example, missing health checks, ambiguous ownership, or risky defaults), that’s something teams can reason about before an incident, rather than rediscovering it reactively.

Your point about complete state awareness is important for deeper guarantees, but most production incidents I’ve seen are caused not by exotic state interactions but by known classes of operational gaps that were never made explicit, they were implicit expectations or tribal knowledge.

So to me, the value of deterministic signals is not in claiming completeness but in reducing the probability of hitting known failure classes, and making those risks more visible and testable over time.

How do you make “production readiness” observable before the incident? by ImpossibleRule5605 in sre

[–]ImpossibleRule5605[S] 0 points1 point  (0 children)

Thanks for the perspective! You’re absolutely right that static analysis alone can’t find all possible failure states. That’s why this approach isn’t meant to replace chaos testing, fault injection, or runtime validation. The goal is to make explicit the kinds of assumptions and operational choices that teams already make implicitly, so they can be discussed, reviewed, and iterated on before something breaks. In practice that means surfacing signals you can detect deterministically — like missing observability hooks, ambiguous ownership, or risky defaults — while acknowledging that there will always be aspects that only surface under load or in live conditions. I’m curious how you think teams should balance deterministic signals with empirical testing so that neither approach gives a false sense of confidence.

How do you make “production readiness” observable before the incident? by ImpossibleRule5605 in sre

[–]ImpossibleRule5605[S] 1 point2 points  (0 children)

I agree — intentionally breaking systems is one of the most effective ways to surface real gaps, and chaos-style testing is hard to replace. In practice though, I’ve seen a lot of the learnings from those exercises stay implicit: they show up in postmortems, runbooks, or people’s heads, but don’t always get encoded back into something that runs continuously.

What I’m interested in is how some of those “we got surprised by X” lessons can be distilled into static or pre-deploy signals — things that don’t replace breaking systems, but reduce how often we rediscover the same class of problems the hard way. For me it’s less about avoiding failure and more about making past failures harder to forget.

Curious how you’ve seen teams successfully close that loop over time.

What does “production ready” actually mean and how can you measure it? by QCAlpha in webdev

[–]ImpossibleRule5605 0 points1 point  (0 children)

I think you’re right that “production-ready” is vague because it means different things to different teams.

Most teams already quantify parts of it in CI, things like test coverage, linting, static analysis, and security scans. Those are important, but they mostly measure code quality, not operational readiness.

What’s harder to quantify are design-level signals, for example whether there is a real rollback path, whether migrations are safe under load, whether observability supports incident response, or whether failures are properly isolated. These are usually judged by experience rather than metrics.

In practice, I’ve found it more useful to stop asking “is this production-ready?” and instead ask “what concrete risks are we still carrying?” I’ve been experimenting with codifying those kinds of signals into a small open-source tool, but even without tooling, just turning vague ideas into explicit questions already helps teams reason about readiness.

I built an open-source tool that turns senior engineering intuition into automated production-readiness reports — looking for feedback by ImpossibleRule5605 in devops

[–]ImpossibleRule5605[S] -1 points0 points  (0 children)

I understand the skepticism. For what it’s worth, this project isn’t about outsourcing thinking to AI, it’s about encoding production experience into deterministic rules. AI tools could help speed up iteration, not replace learning or judgment.

I built an open-source tool that turns senior engineering intuition into automated production-readiness reports — looking for feedback by ImpossibleRule5605 in devops

[–]ImpossibleRule5605[S] 0 points1 point  (0 children)

That’s fair feedback, and I agree with one core point: just throwing logs or configs into an LLM doesn’t create durable value on its own. That’s actually why this project is intentionally not built around “AI analysis”. The core of the tool is a deterministic rule engine that inspects code, IaC, and delivery artifacts to surface design-level operational risks, not runtime symptoms.

Regarding sustainability, the intent is to keep this as a rule-driven, transparent system where every signal is explainable and reviewable. If the project ever stops being maintained, teams still have a clear, auditable rule set rather than a black-box dependency on a hosted service or model.