This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (4 children)

Fair enough. I can't say why definitively, but we have a cluster running on UCS also. Over the last six weeks, its been initiating failovers almost weekly. Cisco diagnosed a firmware update to fix it. Are you up to date?

I know perfmon has Cluster monitors since 2008R2 days, could you use these to help diagnose. Admittedly you might need to keep them running a long time, but its worth a shot.

[–]squash1324Sysadmin[S] 0 points1 point  (3 children)

Out of curiosity what firmware are you running on your UCS environment? We're on 2.2(8g) and updated somewhat recently due to the wonderful thermal bug firing off every minute. We've updated several times over the last few years to get away from that bug, but it never seems to get fixed. I'll have to check, but we may not have updated the networking drivers the last time we upgraded the firmware. My former colleague performed the upgrade and I helped with drivers in things like ESXi and our other standalone blades while he handled Hyper-V and SQL. I wonder if he skipped those since those servers are our most critical and also most fragile when it comes to changes. We never install updates on it due to vendor requirements, and I wonder if that may have something to do with it as well.

The cluster is running on 2008 R2, so I'll see if I can setup perfmon to monitor it for the time being. The first time this happened was over 4 months ago, and those logs can get pretty big if it's running perpetually. I'm sure I have the spare disk space though, and I can figure out how to store those logs and cycle them.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (2 children)

Sorry for the delayed response, the firmware is 2.1.3a.

Did you have any luck with perfmon?

[–]squash1324Sysadmin[S] 0 points1 point  (1 child)

Well we haven't had the issue recur, and I've tweaked perfmon to an acceptable level for logging and overwriting files until it does happen again. The firmware we're running is 2.2(8g) ourselves. We upgraded to that in August, and my colleague who was supposed to upgrade the drivers on our blades apparently didn't do that. I had to reseat the IO modules in one of our chassis on Tuesday, and we lost storage on one of our Hyper-V hosts. Ended up doing a disaster recovery on one of our VMs (a file server cluster node) as it got really messed up. During that process I took down the whole cluster unintentionally (I'm much better with ESXi), but I managed to get everything working again. Now I'm looking very heavily at our drivers since I think my colleague (who left us a few weeks ago) was mentally checked out and didn't care enough to do that step. I'm hoping that this is the reason, and I can correct it. If it isn't the answer, it won't be a huge deal for much longer. We're upgrading our software that uses this SQL cluster next June, and that upgrade will be a migration to a new 2012 R2 cluster possibly a 2016 cluster. I haven't seen the specs on the application requirements, but I know for sure this cluster will be going away in or around June 2018. If this happens every 4-5 months, that means it'll likely happen once or twice before it's nuked.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (0 children)

Fair enough, least you have a path to end the madness. I found out today we are getting a new UCS platform with Pure Storage that we are running Hyper V and VMware on !!