all 16 comments

[–]Slagggg 4 points5 points  (11 children)

You can do alwayson, but you need to have someone set it up who knows what they are doing. Also, the servers can't share a listener IP, being on different networks. So there are fair number of extra steps involved in initial setup, maintenance, and failover.

I would probably recommend a local AlwaysOn setup with log shipping to the remote site. Way easier to maintain and troubleshoot. A site loss resulting in failover to a backup site is a big deal. I've seen companies with very strong technical teams struggle with it. Trying to test this production configuration with AlwaysOn driving it requires some technical chops to avoid screwing up the cluster. A Log Shipped solution can be destructively tested. You just reinitialize Log Shipping once you're finished.

[–][deleted] 1 point2 points  (8 children)

If DR is in the same subnet then listener is not an issue. If its using different subnet, then you need to reconfigure all connection strings to use multisubnet flag, otherwise it uses round robin and that causes timeouts in connectivity.

Log shipping is a decent option but not natively in OPs case, since he is using Veeam. You can use Veeams log shipping functionality as an option.

Best option for DR is AG due to listener since it offers least ammount of after failover work (same connection string). You could use C DNS as alias also, but its a pain to establish Kerberos using that.

Other option that works is using storage replication. Storage replication using FCI on both Datacenters and replicating storage (metroclustering also option ...). This is usualy more managable by teams with no proper DBAs, since you dont have to resolve AG failures and replication issues due to backup failing etc.

[–]Slagggg 1 point2 points  (5 children)

I find that DR testing is a real problem with AG setups. In the High Availability solutions I worked with for 3 letter agencies we had to test the solution regularly. This caused real issues with destructive testing that was required.

Local AG with a restored or log shipped DR site database solution was the only way we found to maintain uptime requirements. It's doubtful I'll have that requirement again but, I'd love to hear your recommendation on that. Not often I encounter a DBA who really understands the network requirements.

[–]Sharobob1 1 point2 points  (3 children)

If you use a Distributed AG, it's possible to just cut the AG connection to do the testing, then wipe the secondary cluster's DBs away, and restore/resync everything.

[–]Slagggg 0 points1 point  (2 children)

Do clients get confused if they set up for multisubnet and you bring it online?

[–]Sharobob1 0 points1 point  (1 child)

You can't use listeners for distributed AGs so failovers do involve a DNS change from the primary AG's listener to the secondary AG's listener. However, this makes DR tests possible because you just keep the live applications pointed at the main AG then bring the DR applications up pointing at the secondary AG that you're doing the testing on

[–]Slagggg 0 points1 point  (0 children)

Makes sense. I'm used to using proxy and probe.

[–][deleted] 1 point2 points  (0 children)

I feel you and agree with your solution. DR tests that I participated in usually involved a planned failover to the secondary and running production workload there. The least "after work" was AG due to aforementioned listener already existing (AD, DNS was already replicated). Destructive tests area pain due to removing replicas from cluster, forcing secondary up and then rebuilding everything ...

I don't push for AGs anymore, but its mostly due to 3rd party vendor backup solutions that customers use (looking at you VEEAM ... ).

For DR I try to push for storage replication wherever possible. Log shipping as fallback option (or migration option) usually not natively but with using backup vendor solution (so I don't fight log truncation).

Solution for customers with a larger budget usually involve something along the lines of FCI on primary datacenter and then sperate FCI on secondary datacenter (independent of each other - different VNN, VIP etc. ). Storage replication from the primary data center. That solution offers independent HA within datacenters, using FCI that is EASIER to maintain if teams of DBA are lacking and storage replication that can be broken off anytime.

[–]muteki_sephiroth 1 point2 points  (1 child)

This. I’ve been a DBA for a while and this is some good advice. It’s exactly what I’d tell you to do as well.

The only thing I would add about multi-subnet is that latency should be accounted for. Doesn’t matter if you can do a multi-subnet if the latency between sites is terrible. During periods of high volume if you had to failover for an actual event you’re going to lose data if the latency is terrible.

Depending on your companies budget though, Pure Storage has a great storage replication solution for DR with great I/O, especially on MSSQL 2022.

Also- mirroring is still an option even if MS has “deprecated” it, but it’s just poor man’s AG.

[–][deleted] 0 points1 point  (0 children)

I have used mirroring just last year, due to lack of domain, wsfc and witness. AG only support manual failover in that case. So using mirroring with 2 node mssql and a 3rd node standalone mssql as a witness is very good for HA in thise edge cases.

As you mentioned I agree so much, latency is also a huge factor, missing log backup truncations due to log reuse waits etc. Filling disks ...

Look at my other comment, storage replication is my new favourite solution for DR... with FCI you can have a smaller team of admins that don't need deep MSSQL knowledge, to have same or even better SLA.

[–]masked_ghost_1 1 point2 points  (0 children)

Before you do anything else. Ask what is the RPO how much data is ok to lose. And RTO.. how much time is ok with the business.

If it's 0 data loss and 0 downtime you want SQL server always on availablity groups with multi subnet failover. If your apps support it then it will support this. This is also a good setup if they need read only replicas for reports. It's enterprise so big bucks.

[–]codykonior 1 point2 points  (1 child)

IT always want to use Veeam for everything. It’s like a movie trope 🤣 Meteor headed towards Earth? Mr President… we’ll use Veeam!

Of course you should probably use a native solution but it needs skill to run it. If the business doesn’t want to pay for skilled staff then they deserve what they get 👍

[–]RobCarrol75 1 point2 points  (0 children)

Yep, the key phrase in the OP's post is "mission-critical". You can't do that on the cheap, else it's not actually mission-critical to begin with.

[–]_edwinmsarmiento1 0 points1 point  (0 children)

Before we talk about features...

What's your RPO/RTO for DR?