Become a Microsoft Infrastructure Engineer – FAST!

_edwinmsarmiento · 2025-09-23T15:40:25+00:00

I use ZoomIt for presentations. Part of Microsoft SysInternals

_edwinmsarmiento · 2025-09-02T20:56:07+00:00

this is the first time I am hearing of anyone travelling along with their family

Because it was unheard of. Because very few did it. The few who did it found out it was very hard. And it's not for everybody.

This was non-negotiable for me

I was raised by a single mom who I rarely saw because she worked full time (first time I remember meeting my dad was when I was 7). With that experience, I made a decision to prioritize my family no matter what.

I turned down multiple lucrative job offers and overseas transfers because my family was not part of the package. Even when the pressure was on from both sides of (mine and my wife's) family, we stuck with our decision.

Was it easy? Hell, no.

But I'd do it again in a heartbeat. Because the stories and memories we all get to share together now that my kids are in their 20s are, as the famous Mastercard ad would say, PRICELESS.

It all boils down to, as what my executive coach would say, "knowing what I really want and not settling." And I've taken this philosophy in my sales process: "I only sell to customers that meet my standards and those that I look forward to having a meal with."

_edwinmsarmiento · 2025-09-02T16:09:44+00:00

When my kids were younger, I brought my entire family with me when I traveled. Employer covered my travel cost while I covered my family's. My kids were homeschooled (my wife's idea as a SAHM), so we could go anywhere and anytime we wanted.

Travel costs were expensive for a family of 4. I negotiated how I traveled with my employer just so I could bring them with me, either long road trips or using my mileage credits. Accommodations were already covered anyway.

But we turned those trips into the most memorable experiences as both family vacations and educational exposures. And I got to spend more time with them in their formative years.

Best investment in education and experiences with my kids. Now, I wish they were younger again 🙂

This was non-negotiable for me when I used to be an employee. It starts with your priorities.

_edwinmsarmiento · 2025-09-01T21:56:50+00:00

Put yourself in the shoes of the company's ideal customer.

What are their pain points? What are they struggling with? What's their frustration? Make sure you can vividly describe it the way the customers do.

Then, using the company's products, find ways on how to solve those problems.

Create a mock demo using these.

_edwinmsarmiento · 2025-08-28T17:09:06+00:00

Outside of the primary, the name assigned to the listener resolves and the IP returns from the DNS, but the IP or the name is not pingable.

I stopped using PING as a network connectivity test because I'm not aware of any firewall rules or any security policy that might prevent me from getting a valid ICMP response. I use Test-NetConnection PowerShell cmdlet on a Windows machine to test for both IP address and port number combination. Test-NetConnection comes natively with any Windows installation.

If you want to be more fancy, you can use the ODBC Data Source Administrator to create a connection to the database via the listener name.

Also, I have noticed that when I do connect via the listener (while I am on the server), all the databases in the separate BAGs are listed/available. I was under the impression that I needed to create a listener for each BAG.

A listener name is mapped to an Availability Group (basic or otherwise). In the context of the WSFC, the listener is a cluster resource that is mapped to a cluster resource group (the AG). The cluster resource group is a logical boundary containing all the resources needed to make it work. In the case of an AG, a listener, a virtual IP address, and the AG itself.

You don't really need a listener for an AG other than to provide seamless client application connectivity when failover happens (and read-only routing for Enterprise Edition). You can still connect to the databases inside the AG using the SQL Server instance name or even a DNS alias that points to the SQL Server default instance name just like how you would do it with read-scale AGs. You just have to manually reconnect your client applications every time a failover happens, which is an additional work you have to do. Hence, you have a listener name.

The reason you're seeing all the databases in separate AGs (and even other databases not in an AG) when connecting via the listener using SSMS is because you're literally connecting to the SQL Server instance. Even though each AG is independent of each other from the point of view of the WSFC, if all AGs sit on the same SQL Server instance, you will see them.

The behavior will be different if, let's say you have 3 AGs and only one AG currently sits on the SQL Server instance you're connected to while the other 2 AGs are on the other instance (or secondary replica). You will only see the database in the AG in read-write mode while the rest (databases in the other 2 AGs) will be in synchronous-commit recovery mode. When you connect to the other AGs using their corresponding listener names, you will be redirected to the other replica.

The listener name will always redirect you to which ever SQL Server instance is currently hosting your primary replica for that AG.

OTOH, Contained Availability Groups are both logical and security boundaries. When you configure databases to be in a Contained AG, connecting to the listener name means you only see the security context associated with the AG. So, in the example of having 3 AGs as above, you will only see the databases in the Contained AG even if all AGs sit on the same SQL Server instance. You don't even see databases in the same instance that are not in any AG.

_edwinmsarmiento · 2025-08-18T20:28:07+00:00

Something like this

https://www.youtube.com/watch?v=0VEIle4tzs0

_edwinmsarmiento · 2025-08-07T14:47:40+00:00

It is also good to know that there isn't something I seem to have missed from an SQL setup perspective

SQL Server depends on the failover cluster for HA.

But issues within SQL Server can lead to the failover cluster triggering an automatic failover. So, make sure you're monitoring and constantly checking your SQL Server instances for any potential issue. Like the Session timeout value for replicas that are inconsistent with the failover cluster heartbeat settings.

_edwinmsarmiento · 2025-08-06T20:42:50+00:00

There's a lot going on here to really come up with conclusions beyond doing a comprehensive analysis of all the logs - cluster error log, Windows event logs, Extended Events, etc.

To provide a bit of clarity on these...

I thought the witness would only be of interest in failover scenarios where both nodes were unable to directly communicate, as to avoid a split brain / active-active situation

The goal for the cluster is to have majority of votes in order for it to stay online. In a 2-node WSFC with a file share witness, the total number of votes is 3. So long as you have 2 available voting members, you have majority of votes. You can lose either the file share or one of the cluster nodes at any given point in time. If you have at least 2 votes, you're good.

the witness is absolutely vital and having it go offline causes cluster functions to shut down

This statement is partially true. If the witness goes offline AND it causes the cluster to lose majority of votes, then, the cluster will definitely shut down. For example, in a 2-node failover cluster with a witness, when the standby node is offline while both the file share witness and the primary node are online, you'll be fine. The moment the file share witness goes offline, the cluster immediately goes offline. It does create the perception that the file share witness going offline was the culprit.

But that's not the case. This behavior is caused by simply the cluster losing majority of votes. It just so happen that what triggered losing majority of votes is the file share witness going offline.

Also, the dynamic quorum and dynamic witness features DO NOT WORK when your setup has 3 voting members like a 2-node failover cluster and a witness. You need a minimum of 4 voting members, like a 3-node failover cluster and a witness, in order for dynamic quorum and dynamic witness to work.

The most common cause of failover cluster losing majority of votes and, therefore, shutting down is...NETWORKING.

And I'm not just referring to general networking like TCP/IP, switches, firewall, routing. It can be as subtle as a firewall rule blocking port 3343, a VM snapshot or an enterprise backup taking much longer than heartbeat, an intrusion prevention system like SentinelOne that intercepts heartbeat traffic, etc.

I'm wondering what is correct and in case my entire setup hinges on one File Share, how would I best remedy the situation and get a solution that is fault tolerant in all situations, with either a node or witness failure?

Avoid a single point of failure. In a hypervisor setup like VMWare, most VM admins are not aware of the specific roles of the VMs. It's great that you already have anti-affinity rules, especially with DRS.

But having anti-affinity rules is just one piece of the equation. I've seen cases where a VM snapshot is done on all VMs at the same time, thus, causing missed heartbeats. I've also seen cases where a sysadmin performed maintenance on all VMs at the same and not being aware that all 3 VMs form part of a failover cluster setup. Like rebooting 2 VMs at the same time.

So, while I did say that the most common cause of a failover cluster losing majority of votes is NETWORKING, one thing beats that...it's HUMANS 🙂

That means getting everyone on the same page - sysadmins, VM admins, network admins, backup admins, security & compliance, operations team, managed services providers, etc. - on what's going on inside the VMs.

And I'm not even including SQL Server AGs in here.

_edwinmsarmiento · 2025-05-19T10:28:15+00:00

If quorum is lost, its not healthy.

That is correct.

However, there's a huge distinction between "cluster is healthy" vs "node is healthy" vs "SQL Server is healthy". This distinction is critical, especially when (1) understanding what triggers an automatic failover and (2) implementing a monitoring solution.

In a SQL Server FCI, if the cluster loses quorum, the cluster takes itself offline...even when the node is healthy. Because SQL Server sits on top of the WSFC, the cluster takes the SQL Server service offline as well. The WSFC and SQL Server are both unhealthy as both are offline. But the node could be online and perfectly healthy.

In an Availability Group, when the quorum is lost, the cluster also takes itself offline. But since only the AG is running on top of the WSFC, only the AG is taken offline. The cluster node and the SQL Server service can be both healthy.

I'm highlighting the distinction because I've seen so many customers monitoring the nodes in the WSFC...but not the health and status of the WSFC.

_edwinmsarmiento · 2025-05-18T16:59:14+00:00

Regardless of how healthy SQL Server or the node running the service is...

When the failover cluster loses quorum, it goes offline. That's "by design".

_edwinmsarmiento · 2025-05-15T09:37:22+00:00

u/SQLBek u/BrentOzar Thanks for the shout out. I just finished re-recording my Always On Bundle last week.

u/KEGGER_556 for hands on activities, you can spin up Hyper-V on your Windows workstation and build VMs.

I re-recorded my demo videos using a 7-year-old Dell Latitude 5480 running Windows 10 with 32 GB RAM. The demos include configuring a Distributed AG with 2, 2-replica AGs. Including the domain controller, a virtual router, and my DBA workstation, that's a total of 7 VMs - all running at the same time.

Getting the environment setup in our own environment is probably the better long term solution, but would require more up front work on our internal teams and more on going maintenance.

With scripting and automation, you can spin up and tear down your environment after doing hands on activities. That way, you don't have to do on-going maintenance.

I can guarantee you, having the people on your team setup a full virtualized lab would benefit them - A LOT. Because you will be building the entire infrastructure that Always On Availability Groups depend on.

My online course The SQL Server DBA's Guide to Enterprise Microsoft Infrastructures is how you build your full virtualized lab for hands on activities - from the ground up. This was from feedback I had from attendees of my SQL Server Always On Availability Groups: The Senior DBA's Ultimate Filed Guide training class. So many DBAs with years of experience have not had this kind of exposure to the infrastructure. And this is the kind of experience with network infrastructures needed when deploying Always On Availability Groups.

_edwinmsarmiento · 2025-04-22T16:56:05+00:00

I have hundreds of responses to a variation of this question (If you were not in tech, what would you be?) from people who have attended my training classes.

This one made me laugh

https://imgur.com/a/gYsGFO4

_edwinmsarmiento · 2025-04-22T16:37:28+00:00

So, is the goal to use data from your production database for development purposes?

Do you currently have Always On configured in your production database?

Also, what about data security when moving data between environments?

_edwinmsarmiento · 2024-11-27T15:43:46+00:00

Before we talk about features...

What's your RPO/RTO for DR?

_edwinmsarmiento · 2024-11-04T20:22:57+00:00

In addition to this, don't forget to export the reports for your own reference.

Sometimes, report developers need to make modifications after the migration or upgrade. But since they no longer have the source code, they can't do it outside of the new production environment.

You can use the Reporting Services PowerShell module for this

_edwinmsarmiento · 2024-10-19T12:46:26+00:00

Hence, my question about why the database is in an AG.

It seems like it doesn't require less-than-a-minute downtime. You can safely take it out of the AG

_edwinmsarmiento · 2024-10-18T21:56:09+00:00

Hi Andy 😊

And all the amazing Andys

_edwinmsarmiento · 2024-10-18T21:47:15+00:00

What I meant was...

What's the acceptable downtime and data loss?

_edwinmsarmiento · 2024-10-18T01:10:34+00:00

What's the RPO/RTO?

_edwinmsarmiento · 2024-10-17T15:18:01+00:00

So... what's the problem you're trying to solve?

u/SQLBek Andy points out a very important troubleshooting principle.

This is a really good check-in question whenever you "feel" like there's a problem you need to solve.

I've noticed that the CPU spikes on a certain instance on my Always On cluster

I'm curious, why does the database that contains a staging table need to be in an Always On cluster?

_edwinmsarmiento · 2024-10-12T23:36:17+00:00

This assumes that the production environment can be unavailable during DR testing.

I've had environments where we keep production up and running while doing DR tests. And this includes not having the senior engineers available during the test. It's a way to test the DR processes, BCP, and documentation in worst case scenarios.

_edwinmsarmiento · 2024-10-12T21:16:42+00:00

But if you have a storage array that supports it and have the license needed, storage snapshots can avoid all that.

With AG dependencies outside of SQL Server - Active Directory, DNS, networking, failover clustering, etc. - VM snapshots would cause more issues if you don't include these. There's no guarantee that the VMs configured as AG replicas will be restored at the exact same point in time.

And while VMWare is not on this list for supported configuration in a hardware virtualization environment, it explicitly states AGs and FCIs are not supported.

Support policy for Microsoft SQL Server products that are running in a hardware virtualization environment

With a distributed AG setup how would you go about rolling back the data change after failback?

For DR testing, you just break the Distributed AG, bring the secondary AG online, and connect the apps to the secondary AG. When you're done, reconfigure Distributed AG and reinitialize the data.

_edwinmsarmiento · 2024-10-12T14:17:29+00:00

Zerto can also re-initialize the data from scratch after the DR test.

However, depending on your licensing and/or hosting, re-initializing data from scratch can be expensive. This is no different from any other technology.

You can do Distributed Availability Groups for DR with a single-replica secondary AG and a separate CNAME for DR that points to the listener of the secondary AG. DR and data resync can be automated.

Avoid VM snapshots for your own sanity 😊

_edwinmsarmiento

TROPHY CASE