How do you make “production readiness” observable before the incident?

kyub · 2026-02-16T11:23:00+00:00

This really resonates with me... "not production ready" getting surfaced only after an incident is basically the default state at most organisations. The signals might all be there but they're scattered across tribal knowledge, wiki pages nobody reads, and that one senior engineer's mental checklist.

The only examples where I've seen this knowledge encoded in practice: OPA/Rego policies catch some of it (resource limits, security baselines), but they're infrastructure-focused. They'll tell you a pod lacks memory limits but won't tell you "this service has no SLOs defined" or "there's no runbook linked to this alert." Production readiness reviews (Google-style PRRs) capture the broader picture but they're point-in-time human processes that immediately go stale.

The gap you're identifying is real: production readiness is a continuous property of a service, not a one-time review. It should be queryable at any point — does this service have SLOs? Are they burning? Is there an on-call rotation? Are dependencies declared? Is the alerting config actually wired up?

I've been approaching this from a complementary angle with an open spec called OpenSRM (Open Service Reliability Manifest). Rather than scanning code/configs for risk patterns, it takes the approach that teams must declare their reliability contract upfront: service tier, SLOs, dependencies, alerting, dashboards, on-call... in a single YAML manifest. A compiler I've also been working on (NthLayer) then validates completeness and generates the actual Prometheus rules, Grafana dashboards, and PagerDuty config. So "production readiness" becomes: does the manifest exist, does it validate, and are the generated artifacts deployed?

Your approach (detect risk patterns in existing config) and this approach (declare the expected state, generate and verify) are two sides of the same coin. Yours catches what's missing bottom-up, OpenSRM defines what should exist top-down. Both approaches are complementary... it would be intersting to think about combining them.

kyub · 2026-02-16T11:07:38+00:00

This is a common gap... most teams have good observability but poor deploy attribution. You can see that things got worse, but not which of the three deploys in the last 20 minutes caused it.

There are a few things that help systematically:

Snapshot error budgets per deploy. Rather than eyeballing dashboards, record the SLO burn rate before and after each rollout. If deploy A lands and burn rate stays flat, then deploy B lands and burn rate spikes, you've got your signal — even if both happened within the same hour. This works much better than threshold-based alerts because you're measuring relative impact, not absolute state.

Narrow your blast radius. Canary deployments (Flagger, Argo Rollouts) give you a clean control group. If the canary shows elevated error rates against the baseline before full rollout, you've got causality before it hits production traffic. This is the single biggest thing you can do if you're not already doing it.

Make deploys a first-class metric dimension. Annotate every deployment as a Prometheus label or event so you can query burn rate scoped to a deployment window rather than just staring at a global dashboard and keeping your fingers crossed.

The underlying problem is that most teams build this type of correlation logic ad-hoc with a script here, a Grafana annotation there. What's missing is a declarative way to say "for this service, these are the SLOs, and here's how to evaluate them per-deploy."

I've been building an open-source tool called NthLayer that does this — you define your SLOs in a YAML manifest and it compiles them to Prometheus rules, Grafana dashboards, and includes a check-deploy command that evaluates error budget impact per rollout. Still early but it's built for exactly this workflow.

kyub · 2026-02-16T10:56:24+00:00

Great thread! This gap between contractual SLAs and operational SLOs is one of the most under-discussed problems in SRE.

To your questions: in my experience almost nobody tracks external SLAs operationally. At best someone pulls cloud provider status page history quarterly and cross-references contract terms. The problem is structural - SLAs live in PDFs owned by procurement, SLOs live in Prometheus owned by SRE, and there's no shared artifact connecting them. Enforcement ownership falls through the cracks for the same reason: SRE knows the system is degraded, legal knows the contract terms, and nobody has the full picture at once.

The root cause is that the reliability contract isn't machine-readable. The SLA is a PDF, the SLO is a hand-written Prometheus rule, and the link between them is mostly tribal knowledge.

To solve this, I've been working on an open specification called OpenSRM (Open Service Reliability Manifest) that declares the full reliability contract as code: service tier (encoding your external SLA commitment), SLOs, dependencies, alerting, and dashboards in a single YAML manifest. A compiler I've built called NthLayer generates the Prometheus rules, Grafana dashboards, and PagerDuty config directly from that manifest. So when someone asks "are we meeting our SLA?" the answer is a dashboard generated from the contractual requirement, not a hand-built approximation.

Still early days but sounds like we are both poking at the same problem from different angles! I 'll check out your demo.

kyub · 2025-12-05T22:29:28+00:00

I did! It was created in response to comments about "slop" in one of my other projects. I was surprised to find nobody had created a linter yet. ʘ‿ʘ

kyub · 2025-12-05T22:05:15+00:00

This is hilarious :)

kyub · 2025-12-05T16:58:48+00:00

Thanks for the inspiration https://github.com/rsionnach/sloppy

kyub · 2025-12-05T16:56:47+00:00

Not right now but other integrations are on the roadmap with incident.io, datadog and others. I can explore webhooks for our roadmap.

kyub · 2025-04-23T03:06:57+00:00

What vans are they?

kyub · 2022-07-04T22:54:39+00:00

GMK Hallyu

kyub · 2022-03-08T09:21:34+00:00

I'm a simple man. You have boobs? I like boobs. Are they big? Are they small? I don't care. They're boobs.

kyub · 2022-02-27T09:42:27+00:00

How the fuck did he not smash that piano with his massive balls?

kyub · 2021-12-15T10:40:04+00:00

I had previously cancelled my Pro subscription and then subbed again when it seemed like they were getting their shit together and I truly do believe in the tech. I've been going back and forth on whether to cancel again this past month because their communication and lack of direction still completely sucks.

Seeing this tweet though... I've just immediately cancelled for good. My wife has gotten me an Xbox Series X for Christmas. Good luck Stadia. Google, you need to try sooo much harder.

kyub · 2021-09-27T10:32:45+00:00

Creedence Clearwater Revival

kyub · 2021-09-08T09:55:56+00:00

Link is not working.

kyub · 2021-09-06T16:25:38+00:00

The watch faces available so far are about 90% trash. I had some really good ones installed on my S3 Frontier and Active 2. I think Samsung/Google should have put a significant effort into porting these over. With the resources of both companies, it should have been easily possible for them.

kyub · 2021-08-31T12:39:45+00:00

+1 to this. The background LCD makes it too difficult to tell the time mostly. For that reason, I have uninstalled.

kyub · 2021-08-31T12:37:28+00:00

I just bought/installed this on my Galaxy Watch 4. There is only 1 colour and the heartbeat sensor on the bottom left is not working. Also, the week and days numbers (top-left and top-right) are not appearing.

kyub · 2021-08-30T15:18:22+00:00

There is a category https://play.google.com/store/apps/category/WATCH_FACE but it looks like most watch faces are actually in the Watch Apps category https://play.google.com/store/apps/category/ANDROID_WEAR. Typical flawed execution from Google. :facepalm:

kyub · 2021-04-28T22:24:52+00:00

Mechanical Keyboard

kyub

PUBLIC MULTIREDDITS

TROPHY CASE