5 reasons restores can take longer than backups
It comes as a big surprise to many people when their restores are slower than their backups, but it should be no surprise at all. In fact, everyone should plan for this disparity and build into their backup design.
There are a number of reasons why restore speeds are typically slower than backup, and here is an explanation of five of them.
The RAID write penalty
Most modern disk arrays are built using parity-based redundant array of independent disks (RAID)—RAID levels 3 to 6. Others are built using erasure coding, which has a similar challenge as parity-based RAID.
Parity-based RAID requires the calculation of the parity information when writing data to an array. This calculation does not happen when reading data from that same array, thus allowing for much faster reads than writes. The effect of the write penalty on performance can be minor or it can be major depending on the RAID level and/or the settings used in erasure coding. But all such arrays will have some write penalty, and you need to find out what yours is.
Copy-on-write snapshots
A similar concept to the write penalty is what happens in arrays and NAS filers that use copy-on-write snapshots. When you create a copy-on-write snapshot, it simply puts a stick in the ground as a reference point. Almost no I/O occurs when the snapshot is initially created; all the hard work happens afterwards. When a write attempts to overwrite a block that needs to be saved for a snapshot, that block is copied to a snapshot area before that write is allowed to proceed. (This is why it’s called copy-on-write.)
Like the RAID write penalty, this is something that only happens on writes. The snapshot penalty may also be quite egregious, as it depends on the number of snapshots that are being kept on that particular volume. More snapshots increase the chances that an individual write will need to be copied before the write can proceed; therefore, the more snapshots you have on a copy-on-write volume, the worse the performance is when writing new data.
Writing into a file system
The next write penalty comes when writing into a file system, especially if it is a dense one with millions of files. When you restore a file, the file system must first create a file to restore that data to. The creation of that file is a separate operation that takes time regardless of the size of the file. This file creation time can actually take longer than the restore itself if there are millions of files to restore.
Overburdened transaction logs
Relational databases have transaction logs that keep track of all changes to the database. The ability of a database to quickly record its transactions in the transaction log is typically not something that has to be thought of in most database designs. However, a large restore may create many more transactions per second than would be needed in a typical workday, creating a much higher load on the transaction logs than typical. Therefore, transaction logs can also slow down restores.
Multiplexing backup streams
The final thing to take note of when thinking about restores being slower than backups is the necessary evil that is multiplexing. The good news is that this particular penalty only applies if you are restoring directly from tape. If your backup system is based on disk, this problem won’t arise. It’s actually the primary reason why many people moved off of tape in the last two decades.
To understand this issue, consider the main problem with tape drives: They are much faster than they need to be. Modern streaming tape drives are 10 to 20 times faster than the speed of a typical incremental backup. To address this issue, the industry created multiplexing that interleaves multiple backup streams into a single stream that is fast enough to keep the tape drive happy. When multiplexing was created 20 years ago, most people in the field felt there was no other choice, as they had to make the tape drive happy in order to conduct successful backups. However, there is a significant penalty when it comes to restore.
If you are restoring from a multiplexed tape, the backup software must read the entire tape and throw away all the streams except the one you need. If your multiplexing setting is 10, your tape drive has to read all 10 streams and throw away nine of them. This has a significant impact on restore speed. If you combine it with one of the write penalties above, it can make things even worse. If the disk drive is unable to write data as fast as the tape drive can read it, the tape drive will have to stop and start so the disk drive can keep up.
Assess restore delays and set expectations
It’s important to find out what restore-speed penalty your environment has and then build that into your backup design. Perform test restores of each of your different types of data on each of the types of systems that you would restore them to. This includes every different type of RAID that you’re using in your data center, every large file server, etc. Find out what the restore speed is of a given restore and then ask your vendor if there’s anything you can do to make that restore speed better.
If there isn’t anything, then accurately set expectations of what would happen during a large restore. Have a meeting to discuss how long it will take to restore your important file server, and explain to those whom it affects why this is the case. Your vendor can help explain if there is nothing that can be done, and you can either accept that or look into a completely different backup technology.
The important thing is to do all of this well in advance of needing to restore anything. This is not the kind of thing that you want to be finding out at 10 p.m. on a Friday. Do as much restore testing as you can do now, to see the degree to which your restores are slower than your backups, and adjust your design and expectations accordingly.
Copyright © 2022 IDG Communications, Inc.