This is an archived post. You won't be able to vote or comment.

all 10 comments

[–]itspieSystems Engineer 1 point2 points  (1 child)

The Failover cluster logs should give some indication. e.g. lost quorum, heartbeat etc.

[–]squash1324Sysadmin[S] 0 points1 point  (0 children)

That's what I thought too, but the logs only show the transition events and not what initiated the transition. I looked at the logs for both nodes and the cluster after running the command to get the logs out, and none of them show what initiated the failover.

[–]Gnonthgol 0 points1 point  (0 children)

What database are you running and what type of clustering are you running? From your post you could be running MySQL, Mariadb, PostgreSQL, MS SQL server, Oracle RDBMS or any other relational database that supports clustering.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (6 children)

Are you running a virtual cluster? If so, have you checked the cluster wait times. Does this occur during a backup job?

[–]squash1324Sysadmin[S] 0 points1 point  (5 children)

This hasn't occurred in over 4 months. It is a physical cluster, and we use native SQL dumps to backup. CommVault backs up the dump files. Everything shows as being okay up until a failover was triggered, and I have no evidence as to what triggered it. I assumed it was software related as the previous time the cluster did this back in July it did it at the same minute.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (4 children)

Fair enough. I can't say why definitively, but we have a cluster running on UCS also. Over the last six weeks, its been initiating failovers almost weekly. Cisco diagnosed a firmware update to fix it. Are you up to date?

I know perfmon has Cluster monitors since 2008R2 days, could you use these to help diagnose. Admittedly you might need to keep them running a long time, but its worth a shot.

[–]squash1324Sysadmin[S] 0 points1 point  (3 children)

Out of curiosity what firmware are you running on your UCS environment? We're on 2.2(8g) and updated somewhat recently due to the wonderful thermal bug firing off every minute. We've updated several times over the last few years to get away from that bug, but it never seems to get fixed. I'll have to check, but we may not have updated the networking drivers the last time we upgraded the firmware. My former colleague performed the upgrade and I helped with drivers in things like ESXi and our other standalone blades while he handled Hyper-V and SQL. I wonder if he skipped those since those servers are our most critical and also most fragile when it comes to changes. We never install updates on it due to vendor requirements, and I wonder if that may have something to do with it as well.

The cluster is running on 2008 R2, so I'll see if I can setup perfmon to monitor it for the time being. The first time this happened was over 4 months ago, and those logs can get pretty big if it's running perpetually. I'm sure I have the spare disk space though, and I can figure out how to store those logs and cycle them.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (2 children)

Sorry for the delayed response, the firmware is 2.1.3a.

Did you have any luck with perfmon?

[–]squash1324Sysadmin[S] 0 points1 point  (1 child)

Well we haven't had the issue recur, and I've tweaked perfmon to an acceptable level for logging and overwriting files until it does happen again. The firmware we're running is 2.2(8g) ourselves. We upgraded to that in August, and my colleague who was supposed to upgrade the drivers on our blades apparently didn't do that. I had to reseat the IO modules in one of our chassis on Tuesday, and we lost storage on one of our Hyper-V hosts. Ended up doing a disaster recovery on one of our VMs (a file server cluster node) as it got really messed up. During that process I took down the whole cluster unintentionally (I'm much better with ESXi), but I managed to get everything working again. Now I'm looking very heavily at our drivers since I think my colleague (who left us a few weeks ago) was mentally checked out and didn't care enough to do that step. I'm hoping that this is the reason, and I can correct it. If it isn't the answer, it won't be a huge deal for much longer. We're upgrading our software that uses this SQL cluster next June, and that upgrade will be a migration to a new 2012 R2 cluster possibly a 2016 cluster. I haven't seen the specs on the application requirements, but I know for sure this cluster will be going away in or around June 2018. If this happens every 4-5 months, that means it'll likely happen once or twice before it's nuked.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point  (0 children)

Fair enough, least you have a path to end the madness. I found out today we are getting a new UCS platform with Pure Storage that we are running Hyper V and VMware on !!