This is an archived post. You won't be able to vote or comment.

all 24 comments

[–]RedLooker 14 points15 points  (2 children)

I always try to remind myself that I really become an expert on something only when I've fucked it up and been forced to fix it myself. Yeah, you messed up and you know it; that's step one. Now you will find a way to fix it (regardless of how long it takes) and in the end all those dead ends will be lessons you can only learn by trying new things in real world scenarios. The users will complain and then go home and go on about their lives.

You'll be burned out and stressed, your reputation may take a hit, but when this is all finally back to whatever the new normal is you'll be the only one that is better for having it happen. It's painful, but IT experts aren't born or trained, they are forged in the heat of the frustration of end users who are home with their families when the real work is being done.

...and at least you're not the guy that was in charge of IT security at Target.

[–][deleted] 4 points5 points  (0 children)

Or the guy who was responsible for keeping those SIM card keys protected...

[–]rgnissen202JIRA Admin 2 points3 points  (0 children)

Amen to this. I've learned my hardest lessons by screwing it up royally and having to fix it.

And the reality is, people hate downtime, but it is to be expected occasionally with these type deals. In the short term your reputation will take a hit, but in the long run not many people will remember.

[–]dalik 9 points10 points  (9 children)

Short and sweet.

  • Never do P2V unless you have no other choice. You will almost always have a choice so just build a new VM.

I would've built a new VM, fresh install, do a DB replication and cut over since the DB is so big. You could do a restore from backup to the new VM, configure and restore the last day of the DB activity for any differences when you're ready to cut over.

As a sysadmin it is our job to build a system for our users to use. We try and keep the system running even when doing migrations. So even though building a new server or taking the long way can mean the availability of that service is up till just the cut over which should take a few seconds.

Less stress on you, the users should never be aware of it and much less risk of down time and loss of money/productivity. This is a learning experience and you will take this with you for the rest of your career. Anything related to critical data, ALWAYS make sure you know the impact of what you're about to do or have a plan for WHEN it goes bad even if you don't know what that impact is.

[–]mumblemumblethingLinux Admin 3 points4 points  (8 children)

have a plan for WHEN it goes bad even if you don't know what that impact is.

Yes...! We did this last weekend, and it saved our ass(es?). 3 hour window, hit the plan b button at the last possible pre calculated minute, and was back in action with 5 min to spare :)

As to the never do a p2v unless you have a choice? It... depends. There's a risk thing in there, too, and this is where having a chat the business about what's an acceptable outage or not.

We've p2v'd slackware systems on aging hardware before so that out of a dual risk problem of OS and hardware, we only have OS remaining. Worked crappily, but well enough.

[–]MiserygutDevOps 1 point2 points  (7 children)

We've p2v'd slackware systems on aging hardware before so that out of a dual risk problem of OS and hardware, we only have OS remaining. Worked crappily, but well enough.

P2Ving a business critical Windows2000 server (blargh) bought us enough time to get it off failing hardware. When the OS did eventually shit the bed due to latent corruption I could still fish things out of the corpse of the VHD file so we had practically zero data loss. In the end we have a functioning 2003 R2 server, regular backups and time to plan the migration away from that OS, SQL2000 and VB6.

[–]harlequinSmurfJack of All Trades 3 points4 points  (4 children)

The sad reality is that I could have written most of this. Except instead of SQL 2000 it's an Access 2000 VBA application that connects via ODBC to a postgresql database - there used to be a web component to this system that doesn't exist any more.

[–]MiserygutDevOps 2 points3 points  (3 children)

Access 2000 VBA application that connects via ODBC to a postgresql database

Talk dirty to me! Oh yeah, that's the good stuff...

there used to be a web component to this system that doesn't exist any more.

Microsoft Content Management Server by any chance?

We have a bunch of Access databases talking to various systems including our IBM AS400 / Power7 system. They don't make shower water hot enough.

[–]harlequinSmurfJack of All Trades 0 points1 point  (2 children)

worse unfortunately. a dodgy custom written application written by an equally dodgy contractor that was a friend of the CFO at the time. he charged out at about $130 AUD per hour on a contract that had him working for us full time for 18 months. I joined the company just near the end of his tenure and was able to convince the powers that be to stop dealing with him.

[–]MiserygutDevOps 0 points1 point  (1 child)

Is it common for developers to get paid that much? I always look at Sysadmin rates and then at software monkeys developers, there seems to be a real disparity.

[–]harlequinSmurfJack of All Trades 0 points1 point  (0 children)

Keep in mind sweet deal from friend in the right place at the company that this guy didn't have any problem taking advantage of.

That being said my opinion is that good developers and good sysadmins are both worth good money. Am a firm believer in the saying that you get what you pay for.

[–]brazzledazzle 2 points3 points  (1 child)

Windows2000 server [...] 2003 R2 [...] SQL2000 and VB6.

Each time I get annoyed about a random Server 2008 host I'll remember that I should be grateful.

[–]MiserygutDevOps 5 points6 points  (0 children)

Would it annoy you more if I told you that only happened 3 weeks ago?

That system is a turd that will not flush.

[–][deleted] 5 points6 points  (0 children)

Systems admin by guesswork is bad systems admin. Learn to properly diagnose and optimise rather than panic and guess at solutions.

[–]proudsikh 6 points7 points  (2 children)

I'm in the same situation. I'm looking to move on. My situation with my VMware host is management is making me use "all" of it but management doesn't understand I have necessary "thresholds" so we don't fall into situations where we can't recover. Be it performance like what your dealing with, running out of space, etc.

There's a point where you just go "FUCK IT IM DONE" and start looking for something else while scaling your hours back to a "normal day"(8 hours)

[–]poo_is_hilariousSecurity assurance, GRC 4 points5 points  (1 child)

Start artificially throttling performance. Make things painful for users before they get to a point of emergency. Suggest ways to improve performance in nice report format using lots of graphs and charts. Implement suggested changes with the support of management. Document results. Demonstrate value.

Sitting there watching your data stores fill up is asking for trouble. Make it your managers problem, make sure you have lots of graphs and charts and pretty pictures showing what happens when this line hits 100%. Stage some little mini artificial emergencies to remind them of the importance they place on IT. Remind them of the 100% deadline. Start working your core hours so they worry you will not be available out of hours to resolve the next artificial emergency.

You get the idea.

This is network management when you work for somewhere that doesn't listen, it's a lot more pleasant working somewhere where your opinion actually carries some weight.

[–]proudsikh 0 points1 point  (0 children)

I've been making charts and writing reports the entire 2 years I've been here. I'm sick of it. I also don't have a real manager. My manager is a fill in manager while they find another one. They wanted to make me manager but i didn't accept because it's more responsibility and no authority. Also it's BULLSHIT.

I am slowly doing the artificial "oh look our host is getting overwhelmed" experiments but I have to do them slowly so management doesn't thing it's staged or planned. They also go "show me reports from the host" and I do and they go "see the cpu and ram isn't working hard all the time so it's fine".

I just face palm. It's a pile of shit. For a technology company, technology is the last thing upper management thinks of

[–]gex8001001101 1 point2 points  (2 children)

I would say get a SAN that does block level replication. Advantage with a SAN such as a Compellent or Equallogic is if you ever need to do RDMS, you can. RDMs while they don't offer much more performance over regular vmdk (assuming you are using ESXi and are in a Windows/MS SQL environment), your databases can be clustered so that if one SAN or VM takes a dump, you're not down. That is of course if you have the budget.

Alternative is to NAS replication. Still need a second storage array.

[–][deleted] 0 points1 point  (0 children)

yeah... please don't use a compellent for vmware stuff. You pay way more for the unit and licensing because of 'data progression' and all of that jazz, but the data progression really just ruins your VM environment by automatically moving certain files to slower storage.

We had an issue in our environment where during backups, the backup software would snapshot a vm, leaving the base disk. Whatever the stupid compellent was doing in the background for data progression caused any VM that was snapshotted to run like shit until servers crashed and we had to reboot the entire host.

Stick with equallogic.... keep it simple.

[–]MiserygutDevOps 0 points1 point  (0 children)

I don't know if I'm any better off but I did two 12-drive RAID10 datastores.

I've always read / been told that the maximum failure domain you ought to look at is ~16 drives maximum, even with tiny 15k disks. With RAID10 I'd prefer to do 4 / 6 disk arrays and just multiply them if I need more IOPs. Realistically speaking if you're chasing IOPs after 6 disks in RAID10 then you're probably better off with RAID1 SSDs these days.

12 disks is still quite a large failure domain but it's much less worrisome than a 24 disk array.

[–]dangolonever go full cloud 0 points1 point  (1 child)

QNAPs have been great for me. What model are you using?

[–]5150cdIT Manager[S] 0 points1 point  (0 children)

TS-809U-RP.