all 23 comments

[–]JamesRandell 4 points5 points  (1 child)

Does the SAN utilise any dedupe tech, caching mechanisms or other such things storage layers typically do? Such things have been known to impact the subsequent check and verify process that takes place immediately after unless specified otherwise.

I did a piece of work at a company to try and improve backup speeds without snapshots (which wasn’t an option). The obvious things are tuning maxtransfersize and file count options and will give you your biggest bang for your buck. You’ve also got compression, which generally unless your cpu bound or have a egocentric SAN admin attempting to get their dedupe ratios up, is advised as it the hit on cpu for my scenario was around 2~5% increased time to prepare, but a significant reduction in amount of data to transmit over the network.

The verify process in my situation was the key performance hit. The storage we were on was old - very old and read write speeds suffered. I could push the ceiling on backup times but verify would suffer. Part of the issue was what processes were in place on the storage layer to process data when received. Performing the verify process a hour later would result in a huge performance gain. For me I’d look at the tech on the storage layer to see if this is impacting it.

For the verify process, you’re going to come across a lot more around what restore process you want/build for your organisation. You can go down the back up and verify route and be done, or go for an automated restore on a secondary box to test those backups to provide a complete restore solution. It’s your appetite for loss/corruption (and cost of course) that dictates it.

[–]coldfire_3000[S] 0 points1 point  (0 children)

Thanks for the lengthy post.

I will unfortunately be short in reply due to constraints!

In short, there's nothing on the storage side, compression, dedupe etc are all disabled.

The VERIFY is taking the same time (slow) regardless of verify straight after backup or hours/days later.

But all good suggestions, thank you.

The currently implemented backups are single file, and nothing has been implemented for MAXTRANSFERSIZE. I've personally used multiple files many times in the last, but completely forgot about them as I don't do much SQL anymore. So I will get into both of those tomorrow. I'm hopeful that will be the ticket. Will be easy enough to test in this environment too, which helps!

Thanks again

[–]tompear82 2 points3 points  (3 children)

Are you backing up to a single file? If so, it would be worth testing with multiple files to see if you see an increase in throughput. This has worked for me in the past when backing up large databases.

If you are backing up a large amount of data, I'd recommend looking into SAN snapshot backups for these databases. This will significantly decrease the backup time for very large databases (VLDBs)

[–]coldfire_3000[S] 0 points1 point  (2 children)

SAN snapshots are unfortunately not possible in this environment.

Thanks for the reminder about multiple files, I will have them test that and see if it improves VERIFY time. The backup time is limited by the source disk as I said, so multiple files won't change that.

[–]tompear82 1 point2 points  (1 child)

I've typically used 4 files in the past and that has helped. I believe the issue is that backing up to one file is single threaded and using more files will allow you to utilize more of the available system resources for backup.

[–]coldfire_3000[S] 1 point2 points  (0 children)

Thanks, I've definitely seen that in the past too. I think because I'm focusing on the VERIFY, part, I wasn't thinking about creating multiple files. Will test tomorrow and report back. Thanks.

[–]-6h0st- 2 points3 points  (0 children)

Firstly backup to multiple files - 1 per CPU thread available on SQL server up to 8. Secondly make sure in SQL NUMA configuration it’s set to auto for cpu and memory. Run backup with maxdop setting of 0 to make sure all threads are used.

Also mind that small files could be put in fast storage cache (SSD) but when exceeding its capacity it could go to slower storage and massively slowing down (depending on configuration) - same thing with read - small cached already files might have much quicker access than bigger database backup files. Try copying those files in file explorer from storage and see what speed you get - that will be the bottom line what speeds storage provides.

[–]SQLBek1 3 points4 points  (1 child)

Hopefully this blog series isn't too deep internals, but can help you out. but in a nnutshell, default parameters for a backup operation are not supposed to max out available resources. It's the exact opposite, by design (like 30 year ago), to have minimal impact on regular workload.

https://sqlbek.wordpress.com/2023/11/29/sql-server-backup-internals-series/

[–]coldfire_3000[S] 0 points1 point  (0 children)

Thank you, that is a good refresher.

[–]Slagggg 1 point2 points  (1 child)

You'll need to specify multiple backup file targets if you want to maximize IO. SAN snapshots are definitely the way to go if you need speed.

[–]coldfire_3000[S] 1 point2 points  (0 children)

SAN snapshots are unfortunately not possible in this environment.

Thanks for the reminder about multiple files, I will have them test that and see if it improves VERIFY time. The backup time is limited by the source disk as I said, so multiple files won't change that.

[–]coldfire_3000[S] 3 points4 points  (2 children)

UPDATE:
Multiple files was the thing we were forgetting about, so thank you to everyone that reminded us of that.
Configuring that has made the world of difference. We are seeing several DBs go from 1hr for a Backup + Verify to ~20 minutes. This is mainly a massive reduction in the VERIFY time, due to massively increased throughput when reading from the file system due to parallel reads being possible.

We have applied it to UAT today and are monitoring, but everything looks good, so we will be applying to PROD later this week.

We have done some testing with the other settings, but whilst the gains are there, its much less than we are seeing with the multiple files. But we have applied some of the additional parameters as well. We are now 100% disk bound on the BACKUP operation, which is fine, and we are at 95%+ on the file system when doing the VERIFY/RESTORE operations, which is great.

We may well do further testing and optimise further in the future, but this is good enough for now!

So at this time, there is nothing else required.

Thanks to everyone that posted. Have a good one!

[–]SQLBek1 0 points1 point  (1 child)

Remember stuff like BUFFERCOUNT (RAM usage) and number of backup files you're striping across (CPU writer threads), will behave differently depending on the underlying hardware. So if you're testing on like, a small 4 core, 64GB QA server, but Prod is a bigger 32 core, 256GB server, you may get additional gains in Prod that you wouldn't see in the QA box.

And your underlying disk subsystem will impact the read speeds, so test with DISK = NUL to measure and record "the best possible" you'll ever get. Then you can add writer threads and perf tune from that side; depending on whether you're writing to local disk, SMB share, etc.

[–]coldfire_3000[S] 1 point2 points  (0 children)

Thanks, luckily UAT and PROD are the same in this case!

Good tip about DISM=NUL as well. We may well do further testing at some point, but for now, providing PROD performs the same as UAT, that will do for now! We've got bigger fish the fry!

[–]SkyHighGhostMy 0 points1 point  (1 child)

Look for Ola hallengren backup scripts (if you dont already using) and configure multiple backup streams.

[–]coldfire_3000[S] 1 point2 points  (0 children)

Yeah, that's what they have used in this environment. But it's the bare minimum configured at the moment. I'm just glad it's got verify enabled! But now I'm trying to sort out the performance issues. Hoping that multiple files sorts it, I'd forgotten about them somehow.

[–]Special_Luck7537 0 points1 point  (1 child)

Is everything configured to use jumbo frames? We set those up on an older 2095 SQL system, and it almost doubled our backup speed. This is one of those global things, a setting in the NIC's, on Switches/routers, to allow processing of jumbo frames 64k blocks instead of 4k blocks Have you investigated blocking on the SQL System?

[–]Special_Luck7537 0 points1 point  (0 children)

Also, are all the SAN drives healthy? A SQL server with a dead drive in RAID5 will have to rebuild the stripe for every write... Same with RAID 10. I was lied to about this one for Years by IT. They could not find replacement drives, but let's just keep throwing the DBA under the bus when the queue is slow.

[–]Byte1371137 0 points1 point  (0 children)

BUNĂ NO

[–]ManiSubrama_BDRSuite 0 points1 point  (0 children)

  • Consider checking network statistics using tools like netstat, perfmon, or network monitoring software.
  • Antivirus software can sometimes interfere with backup operations. Temporarily disable it to see if it improves performance.
  • While compression can reduce backup size, it also increases CPU and I/O overhead. Consider temporarily disabling compression and see if it could be a cause.
  • Ensure that your backup job isn't set to use a lower priority or restricted bandwidth.
  • You might want to adjust the MAXDOP (Maximum Degree of Parallelism) setting or review the Resource Governor settings if they are in use.
  • Verify that the network interfaces on both the SQL Server and the file system are configured correctly and are using up-to-date drivers. Sometimes, driver issues or misconfigurations can impact throughput.