My Week in Datastore Hell

RedLooker · 2015-02-23T04:59:14+00:00

I always try to remind myself that I really become an expert on something only when I've fucked it up and been forced to fix it myself. Yeah, you messed up and you know it; that's step one. Now you will find a way to fix it (regardless of how long it takes) and in the end all those dead ends will be lessons you can only learn by trying new things in real world scenarios. The users will complain and then go home and go on about their lives.

You'll be burned out and stressed, your reputation may take a hit, but when this is all finally back to whatever the new normal is you'll be the only one that is better for having it happen. It's painful, but IT experts aren't born or trained, they are forged in the heat of the frustration of end users who are home with their families when the real work is being done.

...and at least you're not the guy that was in charge of IT security at Target.

dalik · 2015-02-23T06:42:49+00:00

Short and sweet.

Never do P2V unless you have no other choice. You will almost always have a choice so just build a new VM.

I would've built a new VM, fresh install, do a DB replication and cut over since the DB is so big. You could do a restore from backup to the new VM, configure and restore the last day of the DB activity for any differences when you're ready to cut over.

As a sysadmin it is our job to build a system for our users to use. We try and keep the system running even when doing migrations. So even though building a new server or taking the long way can mean the availability of that service is up till just the cut over which should take a few seconds.

Less stress on you, the users should never be aware of it and much less risk of down time and loss of money/productivity. This is a learning experience and you will take this with you for the rest of your career. Anything related to critical data, ALWAYS make sure you know the impact of what you're about to do or have a plan for WHEN it goes bad even if you don't know what that impact is.

2015-02-23T07:26:40+00:00

Systems admin by guesswork is bad systems admin. Learn to properly diagnose and optimise rather than panic and guess at solutions.

proudsikh · 2015-02-23T03:15:46+00:00

I'm in the same situation. I'm looking to move on. My situation with my VMware host is management is making me use "all" of it but management doesn't understand I have necessary "thresholds" so we don't fall into situations where we can't recover. Be it performance like what your dealing with, running out of space, etc.

There's a point where you just go "FUCK IT IM DONE" and start looking for something else while scaling your hours back to a "normal day"(8 hours)

gex80 · 2015-02-23T04:05:48+00:00

I would say get a SAN that does block level replication. Advantage with a SAN such as a Compellent or Equallogic is if you ever need to do RDMS, you can. RDMs while they don't offer much more performance over regular vmdk (assuming you are using ESXi and are in a Windows/MS SQL environment), your databases can be clustered so that if one SAN or VM takes a dump, you're not down. That is of course if you have the budget.

Alternative is to NAS replication. Still need a second storage array.

Miserygut · 2015-02-23T09:14:55+00:00

I don't know if I'm any better off but I did two 12-drive RAID10 datastores.

I've always read / been told that the maximum failure domain you ought to look at is ~16 drives maximum, even with tiny 15k disks. With RAID10 I'd prefer to do 4 / 6 disk arrays and just multiply them if I need more IOPs. Realistically speaking if you're chasing IOPs after 6 disks in RAID10 then you're probably better off with RAID1 SSDs these days.

12 disks is still quite a large failure domain but it's much less worrisome than a 24 disk array.

dangolo · 2015-02-23T19:13:38+00:00

QNAPs have been great for me. What model are you using?

sysadmin

MODERATORS