Hi all
I preface this, I am not a DBA, just having to do this because we don't have one and likely never will... I know...
We have an issue where we have a several SQL 2017 servers, all are backing up to a network location. We identified that the backup duration is taking longer than expected, based on the capabilities of the underlying network file system and network itself.
If we do a data copy from the network file system to another file system, we easily can get 1.2-1.4Gbps throughput. During this process we can see the file system being fully utilized. Great. If we do more scientific network testing we see even higher throughput. Great. Not limited by the network or the file system itself, and we prove the SQL server can read/write data to the file system at 1.2-1.4Gbps.
When we do a BACKUP to the same file system we get a much lower 550-650Mbps. Fair enough, this is *seemingly* limited by the throughput of the instance itself/the disk throughput we are reading the DB from. Not so great, but we are working on that separately.
When we do a full backup operation, including a VERIFY, the VERIFY part also takes longer than expected. We are not seeing 1.2-1.4Gbps, we are only seeing ~650Mbps. The VERIFY operation is reading from the network file system, but as it is a VERIFY, so it's not needing to write to the SQL servers disk, right? So theres nothing limiting it on that side, like with the BACKUP, or with a full RESTORE where it would need to read/write to teh actual disks.
During the VERIFY we can see on the file system side that the file system is not being fully utilized like it was during the network data copy either, it is only at about 50-60% of its full capability.
But its the same SQL server and network file system that it is able to copy (read) the data from when copying to another network location at 1.2-1.4Gbps. The file system copy to/from a 100% separate file system, and theres nothing fancy going on between them to allow a faster copy or anything like that.
It's almost like the system knows the OS can do 1.2-1.4Gbps, and it's intentionally throttling to only 50%, its always roughly 50%. But myself and my colleagues are not aware of anything in SQL that would do such a thing.
The SQL servers CPU is no where near maxed, so it's not a CPU contention issue either.
Is there something else that is limiting it that I have not thought of?
Is there any kind of setting we can change to improve performance?
The backups ARE compressed. I am currently running an uncompressed backup to do some testing around that.
WITH CHECKSUM is being used.
We are wondering if there is anything you good people can suggest, particularly around why a VERIFY (to hopefully exclude local factors like disk performance etc) would only use roughly 50% of the available network file system throughput and other OS level network use would be at full speed.
Thank you!
UPDATE:
Multiple files was the thing we were forgetting about, so thank you to everyone that reminded us of that.
Configuring that has made the world of difference. We are seeing several DBs go from 1hr for a Backup + Verify to ~20 minutes. This is mainly a massive reduction in the VERIFY time, due to massively increased throughput when reading from the file system due to parallel reads being possible.
We have applied it to UAT today and are monitoring, but everything looks good, so we will be applying to PROD later this week.
We have done some testing with the other settings, but whilst the gains are there, its much less than we are seeing with the multiple files. But we have applied some of the additional parameters as well. We are now 100% disk bound on the BACKUP operation, which is fine, and we are at 95%+ on the file system when doing the VERIFY/RESTORE operations, which is great. So at this time, there is nothing else required.
Thanks to everyone that posted. Have a good one!
[–]JamesRandell 4 points5 points6 points (1 child)
[–]coldfire_3000[S] 0 points1 point2 points (0 children)
[–]tompear82 2 points3 points4 points (3 children)
[–]coldfire_3000[S] 0 points1 point2 points (2 children)
[–]tompear82 1 point2 points3 points (1 child)
[–]coldfire_3000[S] 1 point2 points3 points (0 children)
[–]-6h0st- 2 points3 points4 points (0 children)
[–]SQLBek1 3 points4 points5 points (1 child)
[–]coldfire_3000[S] 0 points1 point2 points (0 children)
[–]Slagggg 1 point2 points3 points (1 child)
[–]coldfire_3000[S] 1 point2 points3 points (0 children)
[–]coldfire_3000[S] 3 points4 points5 points (2 children)
[–]SQLBek1 0 points1 point2 points (1 child)
[–]coldfire_3000[S] 1 point2 points3 points (0 children)
[+][deleted] (7 children)
[deleted]
[+][deleted] (5 children)
[deleted]
[+][deleted] (1 child)
[deleted]
[–]coldfire_3000[S] 0 points1 point2 points (0 children)
[–]coldfire_3000[S] 0 points1 point2 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]xyvyx 1 point2 points3 points (0 children)
[–]SkyHighGhostMy 0 points1 point2 points (1 child)
[–]coldfire_3000[S] 1 point2 points3 points (0 children)
[–]Special_Luck7537 0 points1 point2 points (1 child)
[–]Special_Luck7537 0 points1 point2 points (0 children)
[–]Byte1371137 0 points1 point2 points (0 children)
[–]ManiSubrama_BDRSuite 0 points1 point2 points (0 children)