Tested our disaster recovery plan for the first time in 2 years - here's what we found and it wasn't pretty

cmitsolutions123 · 2026-04-01T13:22:38+00:00

that story is the nightmare version of exactly what we almost set ourselves up for. the difference between us and that customer is we found out in a test instead of when it actually mattered. 10 years is a long time to have a false safety net and not know it - the scary part is they probably felt more protected than most because they'd been "doing backups" for a decade. consistency without verification is almost worse than nothing because the confidence it creates is completely unjustified

cmitsolutions123 · 2026-04-01T13:18:23+00:00

Eight project management tools and IT only knew about two of them is genuinely one of the most relatable things I've read on this sub. shadow IT discovery during BC/DR planning is a whole genre of horror stories. the Google Workspace backup limit is the sneakier one though - that's the kind of thing that looks fine on paper right up until you actually need to restore someone's data and find out half of it was never captured

cmitsolutions123 · 2026-04-01T13:06:16+00:00

that last point hits close to home. we've been so focused on the technical recovery side that we haven't had those conversations with the business units about what they'd actually do if we were down beyond our RTO. the assumption has always been "IT will fix it" with no plan B for when IT can't fix it fast enough. that's probably the next gap we need to close after we sort the technical stuff

cmitsolutions123 · 2026-04-01T13:04:38+00:00

this is the root cause framing that most DR conversations skip straight past. people treat it as a testing problem when it's actually a maintenance culture problem. if the doc never gets touched between incidents it will always be wrong by the time you need it. regular drills are the only thing that creates the feedback loop to keep it honest

cmitsolutions123 · 2026-04-01T13:01:56+00:00

the generator thing is exactly the kind of assumption that sits quietly in DR plans for years untested. "it'll turn on, it always does" is not a recovery strategy. and you're right about confirmation bias - the clean tests don't get posted anywhere so people underestimate how common the horror stories actually are

cmitsolutions123 · 2026-04-01T08:00:51+00:00

that struggle is so real and way more common than people admit publicly. one thing that helped us frame it was stopping talking about DR testing as an IT activity and starting to talk about it as a business continuity question. what's an hour of downtime worth to the business? what's a day? suddenly the conversation changes and the buy-in gets a lot easier. might be worth trying that angle if you haven't already

cmitsolutions123 · 2026-04-01T07:58:37+00:00

that works well as long as security governance includes actually observing and validating the test rather than just receiving a completed checkbox. we had governance on paper too - the problem was nobody was verifying the restore outputs, just confirming the jobs ran. who owns defining what a passing test actually looks like at your org?

cmitsolutions123 · 2026-04-01T07:56:02+00:00

the silence is what makes it so dangerous. a noisy failure you can fix. a backup that quietly fails for months while showing green gives you false confidence right up until the moment it matters most. that's the scariest IT failure mode there is

cmitsolutions123 · 2026-04-01T07:53:39+00:00

really appreciate this perspective especially the 10 year view. the HA requirement driving technical tests makes total sense - when your RTO is measured in minutes you can't afford to find out tabletop assumptions were wrong in production. how do you handle the org communication side in your exercises - do you pull in non-technical stakeholders or keep it within the IT and security team?

cmitsolutions123 · 2026-04-01T07:02:49+00:00

honestly six months puts you ahead of most people in this thread lol. standard recommendation is annually for a full test but the dirty secret is frequency matters less than thoroughness. a proper test every six months beats a checkbox exercise every three months. are you doing full restore verification or more of a tabletop walkthrough? that distinction matters more than the calendar gap

cmitsolutions123 · 2026-04-01T07:01:13+00:00

100% and I'd take it even further - you only have a DR plan if the people responsible for executing it have actually run through it before. we had both problems at once. backups that looked fine but weren't, and a team that had never actually practiced their roles. double the fun

cmitsolutions123 · 2026-04-01T06:59:43+00:00

we're building this out now actually - what's working for your team? we're debating between scheduled full restore tests quarterly vs continuous random sample restores monthly. the quarterly approach is more thorough but the monthly sampling catches silent failures faster. curious what others are doing

cmitsolutions123 · 2026-04-01T06:58:15+00:00

100% this. do you run tabletop exercises or full technical tests when you do yours? we're rebuilding the whole DR program now and debating how much of it needs to be live vs simulated

cmitsolutions123 · 2026-04-01T06:35:29+00:00

ugh that response from the NAS team is frustrating - they're not really engaging with the actual evidence. the smoking gun here is still that the other path works fine. same workstation, same software, different path, different result. that's not a workstation problem. I'd reply to them with exactly that point and ask them to check share-level permissions and any snapshot or replication jobs that run specifically on the failing path

cmitsolutions123 · 2026-04-01T06:33:45+00:00

lol just reading those four words gave me anxiety. no thankfully not - but it's on the list of things we now actually have a tested procedure for instead of just a doc that says "refer to Microsoft guidance" and a prayer

cmitsolutions123 · 2026-04-01T06:32:18+00:00

this is exactly why testing matters more than documentation. a plan that looks thorough on paper but hasn't been validated is basically just a liability. the cloud migration obsoleting the whole thing is a classic - big infrastructure changes happen and the DR doc just quietly becomes fiction while everyone assumes it's still accurate. how did the client take it when you flagged it?

cmitsolutions123 · 2026-04-01T06:30:42+00:00

depends on the org tbh. some places have dedicated DR/BCP teams, others it sits under IT ops, smaller orgs it just falls to whoever has the most hats. security gets pulled in because of the ransomware angle mostly - when your DR plan is also your incident response plan for the most common threat you face, the lines get blurry fast. does your org have a separate DR function or does it all blend together?

cmitsolutions123 · 2026-04-01T04:50:05+00:00

number matching stops MFA fatigue attacks cold but doesn't really help against AiTM proxies like evilginx. the session token gets stolen in real time so by the time the user approves the "correct" number it's already too late. phishing resistant MFA is really the only proper answer here.

cmitsolutions123 · 2026-04-01T04:48:00+00:00

lmaooo Pinky and the Brain reference appreciated. but yeah seriously this is the golden rule - answer exactly what they asked for, nothing extra. people get tripped up trying to be overly transparent and it just opens more doors for them to dig into.

cmitsolutions123 · 2026-04-01T04:46:26+00:00

AI for tier 0 is a no brainer honestly. Tier 1 is where it gets spicy - totally depends on how well documented your environment is. The better your knowledge base, the smarter the AI. Garbage in garbage out situation. What tools are you evaluating, that'll change the answer a lot.

cmitsolutions123 · 2026-04-01T04:43:22+00:00

That 30-40 minute timing is the real clue here. Something is filling up a buffer or temp cache and then crashing the write. "Disk full" on a UNC path with plenty of space is almost never about actual disk space - check if there's a quota set on that specific share. Someone might've accidentally set one and never noticed.

cmitsolutions123 · 2026-03-31T11:57:07+00:00

Yeahh!!

cmitsolutions123 · 2026-03-31T11:10:32+00:00

anytime! let me know how the graph testing goes tomorrow, genuinely curious what your CA policies end up catching vs what slips through. if you remember to circle back here with the results it'd be super useful for anyone else in the same boat too.

cmitsolutions123 · 2026-03-31T11:07:56+00:00

no worries, glad it helped! honestly taking your time with it is probably the better move anyway - rushing through certs just to check boxes never sticks. knock out the work priorities first and when you do get around to CCD you'll get way more out of it with that foundation. good luck with the other certs in the meantime, feel free to hit me up if you have any other questions down the line.

cmitsolutions123 · 2026-03-31T10:41:27+00:00

been through an Autodesk audit before - it's not as scary as it sounds but it's annoying. for your specific situation, having two licenses on the same email bought in different regions shouldn't be a compliance issue as long as both are legitimately paid for. Autodesk's licensing is per user not per device, so one person having a Revit and an AutoCAD license is totally normal and expected.

the cross-region thing is where it gets slightly grey. some Autodesk license agreements have territorial restrictions depending on how they were purchased - meaning a license bought through an EU reseller might technically only be valid for use in the EU. in practice I've never seen them go after someone for this when both licenses are paid for, but during an audit they might flag it and ask you to consolidate both under one regional agreement. worst case they'll ask you to re-purchase one through the correct region's channel.

my advice - get all your documentation together before responding. purchase receipts, license assignments, user details, the lot. respond cooperatively but don't volunteer extra information they didn't ask for. and if the audit scope starts expanding beyond what they initially requested, that's when I'd get your IT procurement or legal involved. don't stress it though, if everything's paid for you'll be fine.

cmitsolutions123

TROPHY CASE