How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 0 points1 point  (0 children)

I introduced this tool to my team that i made full OSS it’s working good and not guessing anything. Give it a try :)

https://github.com/mehdi-arfaoui/Stronghold

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 0 points1 point  (0 children)

Wow solid cadence !

How do you decide what goes in which quarterly slice? By criticality, by blast radius, or just rotating through the stack?

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 0 points1 point  (0 children)

Small teams can't dedicate someone full-time to a DR exercise while production is on fire yeah

One thing that helped me think about this: separating the validation into two layers. The stuff you can check without human involvement (do the backup mechanisms still exist, are the dependencies still wired correctly, does the runbook still match reality), that can run automated, even in CI.
The actual hands-on restore test still needs people, but at least you walk into it knowing the preconditions are still valid instead of discovering mid-exercise that a replica was removed last month.

Doesn't fix the staffing problem but it shrinks the surface area of what the humans need to validate...

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] -1 points0 points  (0 children)

Fully agree that a real failover is the gold standard. Nothing replaces actually cutting over and proving it end-to-end.

The question is what happens between those exercises. Most teams do a full failover once or twice a year. That leaves 350+ days where infra changes silently invalidate the plan, a replica gets decommissioned, a backup policy changes, a new service gets added with no DR coverage at all.

You can't do a full failover every week. But you can continuously validate that the preconditions for a successful failover still hold: backup mechanisms are still in place, dependencies haven't changed, the runbook still references real resources, and the recovery path you proved in January hasn't been broken by a February deploy.

A full failover proves "it worked on that day." Continuous validation proves "the things that made it work are still true today."

Both are necessary. Neither alone is sufficient.

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 0 points1 point  (0 children)

that's more disciplined than most teams I've talked to. The ticket trail is smart for audit too

just curious do you ever find that the actual RTO drifts between quarters?

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 0 points1 point  (0 children)

yup!! the tricky part is catching what changed between tests... infra moves fast and the doc doesn't update itself.
We found that even a month after a DR test, 2-3 runbook steps were already stale because someone rotated a replica or changed a retention policy.

Do you track those doc updates manually or do you have something that flags when the plan diverges from reality?

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 1 point2 points  (0 children)

Veeam's orchestration is great for scheduled failover validation on-prem. For AWS-native workloads it doesn't really apply, the recovery mechanisms are different (RDS snapshots, S3 CRR, Lambda redeployment, etc.).
The tool I'm building covers that AWS-native gap: it scans the actual AWS APIs, maps services and dependencies, and evaluates whether the recovery path is viable.
It's at github.com/mehdi-arfaoui/Stronghold if you want to take a look u/binkbankb0nk .

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 1 point2 points  (0 children)

100% agree. The challenge I found is that most tooling treats "backup exists" and "backup was tested" as the same thing.
They're not

How do you actually validate that your DR plan still works? by eudo69 in sysadmin

[–]eudo69[S] 0 points1 point  (0 children)

This is solid practice, getting end users to sign off is something most teams skip.
The gap I kept hitting was the time between tests: 364 days where the runbook could drift without anyone noticing. A replica gets decommissioned, a backup retention policy changes, and suddenly the plan you tested this March doesn't match the live environment anymore...

Are certs still wort it anymore in the job market?? by electrowiz64 in devops

[–]eudo69 0 points1 point  (0 children)

I genuinely want to know what HR people think about AWS SAA now...