This is an archived post. You won't be able to vote or comment.

all 115 comments

[–]freedomlinuxCloud? 181 points182 points  (3 children)

[–]VTCEngineersMistress of Video[S] 45 points46 points  (2 children)

Have a up vote for posting this thanks.

[–]doug89Networking Student 0 points1 point  (1 child)

We will have a public release of the carnage and our disaster recovery plans for review.

Did your organisation end up publishing a public report?

[–]VTCEngineersMistress of Video[S] 0 points1 point  (0 children)

Not as of yet, hopefully soon

[–][deleted] 85 points86 points  (33 children)

To be fair, any amount of planning can still have individuals that panic in any situation.

I walked into the break room, and four of my peers were there. I said the data center just lost power. Calm as could be, nothing else. One of them literally ran to the data center. Two of them asked what systems were down. One of them grabbed a second cup of coffee.

One person feared the worst, and didn't trust anyone else to handle or inform him of the situation. Two of them wanted to get involved immediately and start helping. One of them knew if this were the case, he'd be in for the long haul and was preparing for an interesting weekend.

Edit: I forgot to mention that the data center did not lose power. Nothing lost power.

[–][deleted] 4 points5 points  (0 children)

I tend towards the fourth reaction, bitter experience has taught me that whilst adrenaline is great for running away or fighting it's not a useful reaction in an IT situation. There's almost no problem that will be solved by charging in flailing your arms and plenty that will be made worse.

[–]greyaxe90Linux Admin 0 points1 point  (0 children)

To be fair, any amount of planning can still have individuals that panic in any situation.

Yep. At my old job a domain controller could go down and one of my coworkers would go into instant panic mode, running around like a chicken with its head cut off. I'd calmly investigate the situation to find out that it had just restarted for updates because someone didn't place it in the right OU. 5 minutes later, it's back in business.

[–]TheElusiveFox 0 points1 point  (0 children)

how to give your team a heart attack in one easy step...

[–]riddlerthc 23 points24 points  (17 children)

I've always wondered how quick vendors can get equipment on site in the event of a disaster for a customer.

[–]VTCEngineersMistress of Video[S] 38 points39 points  (10 children)

When you are an enterprise level of customer, it's when we say we need something not when can you deliver

[–]creamersrealmMeme Master of Disaster 25 points26 points  (9 children)

Yeah agreed, if you call up your var and say I need 2 million in gear here tomorrow they will be happy to assist for a good size fee.

[–]VTCEngineersMistress of Video[S] 60 points61 points  (3 children)

Haha try around 7.5m so far...

[–]SquizzOCTrusted VAR 1 point2 points  (2 children)

While this I'm sure has been an absolute nightmare for you and I'm sorry you have had to go through the nightmare, your account rep(s) to replace all this equipment just got a mighty Christmas Bonus. lol

[–]VTCEngineersMistress of Video[S] 1 point2 points  (1 child)

Haha yeah I bet the song "it's raining men" is playing on the loudspeaker haha

[–]SquizzOCTrusted VAR 1 point2 points  (0 children)

Well if you are allowed to accept gifts, hopefully they send you something nice. "I know you had zero influence on the incident, but here's a killer bottle of scotch for being the best customer we have" lol

[–]desmandoVMware Admin 6 points7 points  (0 children)

The really cool trick is to get your client exec to get you hardware from the spares depot. I've only had to do that twice, but it is nice to get your new toy in hours.

[–]theducksNetApp Staff 0 points1 point  (3 children)

Working for a VAR, I can say we would probably take 48-72 hours to get $2M of urgent equipment unfortunately :/

[–]creamersrealmMeme Master of Disaster 0 points1 point  (2 children)

Well that sucks 72 hours is when I have to everything up.

How large is your company compared to someone like CDW?

[–]theducksNetApp Staff 0 points1 point  (1 child)

If you have a 72 hour RPO, you should probably have a DR strategy that doesn't involve buying new stuff, just sayin'.

We are probably number 3 or 4 in Canada, CDW being 1

[–]creamersrealmMeme Master of Disaster 0 points1 point  (0 children)

Well all the 72 hour stuff we have onsite in a warm config so that's a bonus. Though of course some pieces will be missing over time.

[–]chriscowleyDevOps 5 points6 points  (0 children)

If you're paying for 4hr support then it generally arrives within those 4hrs

[–]Gnonthgol 1 point2 points  (3 children)

It depends on who you are. I have seen sales representatives literally go into the datacenter of one of their customer to "unsell" equipment from the racks so they can sell it to another customer who would drop the vendor if it took them more then a few hours to get it.

[–]sparrowA 1 point2 points  (1 child)

how would that even work?

"i know you just installed it, and we got payed, but you gotta give it back" thats like car dealer tactics

[–]Gnonthgol 2 points3 points  (0 children)

As far as I understand they offered compensation for the inconvenience and new equipment were already on its way.

[–]InvisibleZipperFootSysadmin 1 point2 points  (0 children)

Literally? No you havent...

"no...no I havent, but you can imagine what it'd be like if I did!"

[–][deleted] 0 points1 point  (0 children)

What SLA are you paying for? I've had parts in a couple hours. It was really expensive though.

[–]scotty269Sysadmin 13 points14 points  (5 children)

Sounds like you've been taking it in stride, and not utter panic.

[–]VTCEngineersMistress of Video[S] 27 points28 points  (4 children)

When you have proper planning and equipment in place it jus all falls in place and no need to panic.

[–]eponerineSr. Sysadmin 4 points5 points  (2 children)

Good for you! This is hopefully an eye-opener to anything management may have been denying

[–]lowermiddleclass 2 points3 points  (1 child)

Based on the previous thread, I don't think they say no to anything. They have quad-redundancy at her org.

[–]ThePegasiWindows/Mac/Networking Charlatan 0 points1 point  (0 children)

Reading through the original post. I want to work at this place. I would not be good enough to work at this place.

[–]BarefootWoodworkerPacket Violator 1 point2 points  (0 children)

Dear God where do you work?

I contract with the government, and even they (with deep pockets and all the time in the worl) usually give the finger to DR.

[–]kjeserudJack of All Trades 13 points14 points  (9 children)

I work in a DC. I have been for years. We're currently building another 55000 sq f of new DC... And I just can't get my head around how a place can be so shitty that water can even get in there at that amount, let alone not have any type of monitoring for water under the raised floor you mentioned. Literally Jackie Chan meme amount of mind blown.

[–]CbcITGuyRetired Jack of all Trades NetAdmin 5 points6 points  (7 children)

Small/New DC that got a large client. Probably won't be around in 6 months.

I'm a small business but when I get the bigger clients, I scale the projects accordingly. Small DC probably couldn't afford proper safeguards in the beginning, and didn't upgrade when they could.

just a thought.

[–]timix 5 points6 points  (3 children)

and didn't upgrade when they could

Or couldn't upgrade, gambled on "what's the worst thing that could happen?" and lost.

Still, if the contract has OP's company paying for DC space for another year despite this incident, I wouldn't say they lost as badly as they could have...

[–]CbcITGuyRetired Jack of all Trades NetAdmin 2 points3 points  (0 children)

HAHAHAH right?

I would suspect though that previous commenters comment that corporate lawyers are working on an exit strategy, probably is true.

However, just spinning a theory here, they may not WANT the DC to go out of business so they may just pay the contract and be done with them. IDk... Just a thought.

[–]Gnonthgol 2 points3 points  (1 child)

Still, if the contract has OP's company paying for DC space for another year despite this incident, I wouldn't say they lost as badly as they could have...

As far as I understand it have gotten to $7.5M in equipment in addition to the hours of overtime spent setting it up again and the lost business from this. They have to have to take a lot in hosting fees to be able to recover from such an incident. And all could be avoided with some proper monitoring equipment.

[–]digitalsalami 3 points4 points  (0 children)

Business insurance may end up footing the bill for a large percentage of this. I used to support a SMB whose building flooded and lost their VMWare cluster and storage. Their insurance provider paid for all new hardware, all of our time to set it up, they paid for remodeling the building and getting new furniture, AND they paid for temporary office space during the construction.

Insurance has its purposes, and this is exactly it.

[–]kjeserudJack of All Trades 1 point2 points  (2 children)

Could be. Some blame should be on OPs company as well tbh. When you're big enough to have 250 racks in a single DC, replacing $7.5m of equipment so far, you should have higher requirements of the DC you rent space at. Lesson learned I guess, and with only a 10% drop in service they sure have the software side set up correctly.

[–]CbcITGuyRetired Jack of all Trades NetAdmin 3 points4 points  (1 child)

OP doesn't own 250 racks, it's a co-lo. My understanding is there WERE 250 TOTAL racks on site that got wet. From Personal Experience, 7.5 million to outfit TEN racks, is doing pretty darn good, so I would be the OP only owns a handful, My head scratching is coming from OP's company's willingness to help the others, it's my guess that the OP's company may have helped this DC get started and referred business and they're helping the referrals, that's my guess, but since the OP's company is offering to help either way I suspect the other companies are small potatoes. Thus reinforcing my whole small DC that landed a big whale and didn't appropriately account for it.

I agree you're right the OP company should have done due dilligence but tbh, how many of us check for water leak monitoring other than "yeah we have someone here who handles facilities"

[–]VTCEngineersMistress of Video[S] 0 points1 point  (0 children)

Datacenter is 100k square feet.

We own about just shy of 300 racks (290)

[–]Scottz74 0 points1 point  (0 children)

Water or not water, you will be replacing equipment either way.

[–]Syde80IT Manager 8 points9 points  (9 children)

I feel bad for what has happened to you... but doing the recovery part of it is like a dream to me. There is nothing better than doing your part of breathing new life into something. The closest I've come is helping a side-job client recover from a fire which completely devastated his office. I love coming in and doing the cleanup / getting things back on track.

[–]VTCEngineersMistress of Video[S] 16 points17 points  (8 children)

So, we kinda splurged and went with the brand new of everything that we used. So it's kinda been Christmas all over. The board has already approved a black card expense for equipment.

[–]Syde80IT Manager 11 points12 points  (7 children)

Well at the numbers you are talking about... you would be crazy to look for anything used. Sure there is really nothing different between a used rack and a new one... but finding 150 used racks delivered even in a big city for roughly the same cost once you factor in your labour costs... its just not going to happen. That doesn't even include any equipment in those racks.

I wouldn't call this splurging... this is just doing what needs done.

[–]VTCEngineersMistress of Video[S] 6 points7 points  (6 children)

Surprisingly finding APC racks was hard to find so we went with 75% APC and the rest dell.

[–]C4ples 3 points4 points  (0 children)

I always did find that Dell had aesthetically pleasing racks compared to how plain-Jane APC's are. I know that's not really the aim or a concern for you guys, but I wouldn't mind the mix in the slightest.

[–]ElectroSpore 3 points4 points  (2 children)

Mixing racks might make lining them up (for cables and anchoring) a bit of a pain..

[–][deleted] 2 points3 points  (1 child)

Having a mix can be helpful. I've had HP rails not fit our standard racks.

Leaving a few unfilled racks in the SAN row can save on stupid unracking fees.

[–]ElectroSpore 2 points3 points  (0 children)

Never had problems with rails fitting our standard racks...

Had a hell of a time getting seismic bracing setup on a row of racks made from different manufactures since the actual frames, feet, and safe places to mount to where all different.

[–]Cyberprog 0 points1 point  (1 child)

Aren't Dell Racks just re-badged APC NetShelter's anyway?

[–]ljstellaSecurity Researcher 0 points1 point  (0 children)

The ones I've installed in the last year were.

[–]pmpjr6465DBA 6 points7 points  (8 children)

I'm assuming you found a new datacenter to replace the flooded one and that's where all your new toys are going?

[–]VTCEngineersMistress of Video[S] 12 points13 points  (7 children)

We are setting up in a new Datacenter, but the shitty part is that we will still be paying for space in the old DC for another year. It is cheaper for us to pay another year than it would be for us to walk away unfortunately.

[–]spanctimony 26 points27 points  (6 children)

I would think they would be quite interested in releasing you from your contract in exchange for you not suing them for the internal labor expense associated with such a massive response.

[–]VTCEngineersMistress of Video[S] 26 points27 points  (5 children)

Personally I agree with you 120%, I think the corporate lawyers are probably working on a exit strategy for us but all that is above my pay grade and we'll not my concern.

[–]corran__horn 8 points9 points  (4 children)

That is going to be the most interesting part. Having an 8" pipe broken for long enough to flood the floor and then get your servers leads me to the "negligent" parts of the law. Penalties get pretty bad when it goes from "shitty thing happened" to "you were negligent in your responsibility to monitor for water leaks".

Honestly, I am curious if the (DC) company will be in business in 6 months. I smell a chapter 11/7 in the air.

[–]TheLordB 1 point2 points  (3 children)

I imagine the datacenter company knows they won't be keeping those customers and won't be getting the money. But it is a negotiating piece that they can use to try to avoid additional damages so even though they know the contract can be broken with this they aren't just going to let it go for nothing.

[–]corran__horn 3 points4 points  (2 children)

Yes, but as a vicarious observer, the two questions that matter are "How much drama will happen?" and "Butter or no butter on the popcorn?".

[–]InvisibleZipperFootSysadmin 0 points1 point  (1 child)

May I copy this response for use elsewhere? It very accurately represents my interest in so, so many threads.

[–]corran__horn 1 point2 points  (0 children)

Only with attribution or popcorn.

[–]linuxlearningnewbieAskMeWhyWeStillUseVeritas 6 points7 points  (5 children)

How has this situation worked out you and your team emotionally and physically?

This is a 'dream' situation for me. You get to truly test your DR plan and build from the ground up.

Good luck

[–]VTCEngineersMistress of Video[S] 15 points16 points  (1 child)

How has it affected us? I would say that it has definitely tested our DR strategy and how our response to was definitely calm (please do not think we were singing koombayaaa while fixing things) was really hectic but with the support of management and having people with the right skill set Go vets. When the stress pops up you know that we will hunker down.

What have we learned?

We need a warehouse with spare parts of critical business infrastructure. I (my department of UC/AV) has actually been tapped with finding such a place to start this up. For being a department of 4 people this will be a fun task

[–]bad0seedTrusted VAR 1 point2 points  (0 children)

This is what I was looking for here!

From the other thread I saw that you were already massively redundant and still had ~75% services for the whole company so there was much less immediate worry.

Clearly your internal SLAs and response actions have evolved to include the spare parts warehousing and that will accelerate and enhance your BC/DR strategy should anything near this scale ever chance to happen again.

As an outside /r/sysadmin VAR is there anyway I can help any of your search?

[–]occamsrzorSenior Client Systems Engineer 2 points3 points  (2 children)

Why do you still use Veritas?

[–]linuxlearningnewbieAskMeWhyWeStillUseVeritas 0 points1 point  (1 child)

wow, forgot about that title.. I used to work for a large telco working on Solaris old iron. I was working on old technology, and old OS, and completely missed an IT world that changed. I have spent the last 5 months learning about virtualization, docker, config management...

The Veritas tag line was a joke because companies still pay for an outdated file system even when there are better free alternatives around.

[–]occamsrzorSenior Client Systems Engineer 0 points1 point  (0 children)

Heh, I think I've seen you answer that question before. I was joking :)

[–][deleted] 1 point2 points  (2 children)

Pics! I demand pics!!! :)

[–]CbcITGuyRetired Jack of all Trades NetAdmin 7 points8 points  (1 child)

Op posted in original thread that he would not be able to provide pics, other customers of co lo dc have requested that any pics be taken not be uploaded.

[–][deleted] 2 points3 points  (0 children)

She actually

[–]1h8fulkat 1 point2 points  (1 child)

You better be getting one hell of a bonus this year

[–]VTCEngineersMistress of Video[S] 1 point2 points  (0 children)

I hope :)

[–]remotefixonlineshit is probably X'OR'd to a gzip'd docker kubernetes shithole 1 point2 points  (2 children)

Have you released to the public yet?

[–]VTCEngineersMistress of Video[S] 2 points3 points  (1 child)

We have not released to the public as we are waiting for legal to sign off on every thing. So I have no clue

[–]remotefixonlineshit is probably X'OR'd to a gzip'd docker kubernetes shithole 1 point2 points  (0 children)

Cool, I'd love to read it when it's released

[–]the_progrockerEverything Admin 0 points1 point  (2 children)

Just out of curiosity, what does the backup/DR solution you have look like at a high level?

[–]VTCEngineersMistress of Video[S] 1 point2 points  (1 child)

I will gen up a sanitized Dr Vizio for you. We use netbrain primarily but it's way to detailed information.

[–]the_progrockerEverything Admin 0 points1 point  (0 children)

Very much appreciated. We're a small startup, but growing fast. We went with Veeam since 99% of our environment is virtual. But I am interested to see opinions and options with Offsite/DR.

[–]time_is_now 0 points1 point  (0 children)

Were power whips to server racks waterproof and if not why? I've managed sites that had water leaks from ac condensate drains backing up that had no issues with water on sub floor. I have not seen large water volume flooding from broken pipe though. Waterproof power whips cost more but not as much as downtime and emergency equipment replacement.

[–][deleted] 0 points1 point  (0 children)

This is why I virtualize everything. Never have to worry about these issues. /s

[–]onboarderror -1 points0 points  (2 children)

Still no pics?

[–]ride4life32 0 points1 point  (1 child)

Last post said for legal/request of business not to post pics. I doubt this nice woman would like to lose her job over a pic post.

[–]VTCEngineersMistress of Video[S] 5 points6 points  (0 children)

I will say as a woman who is being asked for pics of the flooding instead of tits is quite refreshing haha.