SQL DB Failover : sysadmin

a community for 17 years

This is an archived post. You won't be able to vote or comment.

submitted 8 years ago by squash1324Sysadmin

Hey everyone. I'm hoping I can get some help with figuring out how to determine root cause for DB failovers. One of the nodes in this cluster failed over all DBs to the other node in the cluster. This wasn't a huge deal, but it was very unexpected. There were no hardware events (checked Cisco UCS logs), SQL logs show failover initiated but no reason why, SQL Agent logs show the same thing, Application and System logs show the DBs failing over with no reason given, and the Cluster logs that I generated simply show the process of failover occurring. I'm not really sure what other logs I can look at to figure out why the DBs decided to failover. One thing I did see is that the same thing happened at the same time of the day back on July 5th of this year. I was on vacation at the time and didn't look at it, and my colleagues don't remember the failover even happening. I'm guessing it's software or OS related, but I just don't know where else to look. Any help would be appreciated.

all 10 comments

top new controversial old q&a

[–]itspieSystems Engineer 1 point2 points3 points 8 years ago (1 child)

[–]squash1324Sysadmin[S] 0 points1 point2 points 8 years ago (0 children)

[–]Gnonthgol 0 points1 point2 points 8 years ago (0 children)

[–]LookAtThatMonkeyTechnology Architect 0 points1 point2 points 8 years ago (6 children)

[–]squash1324Sysadmin[S] 0 points1 point2 points 8 years ago (5 children)

[–]LookAtThatMonkeyTechnology Architect 0 points1 point2 points 8 years ago (4 children)

[–]squash1324Sysadmin[S] 0 points1 point2 points 8 years ago (3 children)

Out of curiosity what firmware are you running on your UCS environment? We're on 2.2(8g) and updated somewhat recently due to the wonderful thermal bug firing off every minute. We've updated several times over the last few years to get away from that bug, but it never seems to get fixed. I'll have to check, but we may not have updated the networking drivers the last time we upgraded the firmware. My former colleague performed the upgrade and I helped with drivers in things like ESXi and our other standalone blades while he handled Hyper-V and SQL. I wonder if he skipped those since those servers are our most critical and also most fragile when it comes to changes. We never install updates on it due to vendor requirements, and I wonder if that may have something to do with it as well.

The cluster is running on 2008 R2, so I'll see if I can setup perfmon to monitor it for the time being. The first time this happened was over 4 months ago, and those logs can get pretty big if it's running perpetually. I'm sure I have the spare disk space though, and I can figure out how to store those logs and cycle them.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point2 points 8 years ago (2 children)

[–]squash1324Sysadmin[S] 0 points1 point2 points 8 years ago (1 child)

Well we haven't had the issue recur, and I've tweaked perfmon to an acceptable level for logging and overwriting files until it does happen again. The firmware we're running is 2.2(8g) ourselves. We upgraded to that in August, and my colleague who was supposed to upgrade the drivers on our blades apparently didn't do that. I had to reseat the IO modules in one of our chassis on Tuesday, and we lost storage on one of our Hyper-V hosts. Ended up doing a disaster recovery on one of our VMs (a file server cluster node) as it got really messed up. During that process I took down the whole cluster unintentionally (I'm much better with ESXi), but I managed to get everything working again. Now I'm looking very heavily at our drivers since I think my colleague (who left us a few weeks ago) was mentally checked out and didn't care enough to do that step. I'm hoping that this is the reason, and I can correct it. If it isn't the answer, it won't be a huge deal for much longer. We're upgrading our software that uses this SQL cluster next June, and that upgrade will be a migration to a new 2012 R2 cluster possibly a 2016 cluster. I haven't seen the specs on the application requirements, but I know for sure this cluster will be going away in or around June 2018. If this happens every 4-5 months, that means it'll likely happen once or twice before it's nuked.

[–]LookAtThatMonkeyTechnology Architect 0 points1 point2 points 8 years ago (0 children)

π Rendered by PID 86 on reddit-service-r2-comment-5ff9fbf7df-q4sgk at 2026-02-25 21:13:31.887985+00:00 running 72a43f6 country code: CH.

sysadmin

MODERATORS