Enterprise Backup/Restore from A to Z
In the second of a two-part series, we examine performance, scalability, management, and disaster recovery.
By Allan Scherr
As databases get larger, the performance of backup and restore operations becomes more critical. It is useful to examine the basic performance characteristics of backup/restore as a way of seeing the limitations of various configurations on ultimate performance and scalability.
Backup and restore are throughput-oriented operations, which typically strain the bandwidth of system components. In contrast, data-oriented applications are usually interactive and response-time oriented; data accesses are essentially random and can be accelerated with caching. Backup/restore disk and tape operations are generally sequential.
Running backup/restore operations on system components designed for interactive applications can lead to serious performance issues. For instance, on-line backup operations may consume too much bandwidth for the system to respond to application requirements. And system components that are perfectly adequate for application requirements may not be able to handle the demands of backup and restore operations.
Local and network backup/restore also has its disadvantages. For one, disparate backup and application needs can lead to impractical system configuration requirements. Compare the performance characteristics of backup to the characteristics of typical applications such as on-line transaction processing, data warehousing, and other interactive software (see table). Problems arise because backup activity can load the application host`s internal I/O paths so heavily that application activity is significantly slowed. This same type of interference can also occur on the network.
Because of the above problems, systems and networks that are perfectly adequate for the execution of typical applications are sometimes unsuitable for running backup operations. The direct mode of data movement described in the first part of this article (see InfoStor, August 1998, pp. 38-44) resolves these problems by relegating backup activity to storage backup subsystems that have been optimized for backup operations.
With the high-speed capabilities made possible by direct backup techniques, many administrators are now focusing on the performance of restore and restart processes. If disk mirrors are used for backup, the restore process is comparatively instantaneous. However, if the mirror is remote from the application host, data restoration may take considerably more time.
It is possible to re-connect a remote mirror to the data storage local to the application host and to use this mirror to both restore the disks in the local storage and directly supply requested data areas to applications. In doing so, the application can be immediately restarted and run, more slowly than normal, while the remote mirror is used to refresh the local data store.
Once the data is restored, the application must be restarted. If a fuzzy backup was performed, the application must do special processing using logs or journals to bring the data into a defined state.
Special interfaces between the backup and recovery facility and the application and/or database manager are typically created to effect this processing. Once a defined state is achieved, the application can be restarted.
Given the size and complexity of today`s databases, the management of backup and recovery requires substantial attention. Issues include scheduling, dealing with volumes of tape cartridges, and handling exceptions.
The process of backing up data is generally the most data-intensive operation conducted in a data center, further stressing systems already straining to keep up with normal loads. And, in large installations, the number of application hosts and database elements that need to be individually backed up may number in the hundreds. Thus, the ability to schedule backups is very important. Some backups are more critical or important than others and must be treated accordingly, particularly if the backup window is tight. The backup frequencies of certain applications may be different than others. An adequate scheduling system must be able to deal with these issues and be able to reschedule failed backups.
The need to back up terabyte-size databases will mean hundreds or even thousands of tapes to keep track of. And keeping track of the location of all these tapes and recycling expired media can prove to be an arduous administrative task. Other aspects of managing tape media include retiring old tapes and using special tape cartridges to clean tape read/write heads after a predetermined number of uses.
Another administrative problem is ensuring that all relevant data is being backed up. Typically, different people administer backups and databases and application data. A facility to automatically discover extensions or changes to existing data is useful.
Other management capabilities include monitoring and reporting successful and failed backups, equipment status, media availability, performance, and resource utilization. A typical large enterprise will back up hundreds, if not thousands, of items every day. Since backups are often done overnight, a summary report is helpful each morning. Another useful feature is to have a backup system that automatically e-mails or pages the system administrator in the event of failures of certain types or frequencies.
Integrating all of these features into a graphical user interface is desirable. For instance, all of the objects involved in backup processing (i.e., application hosts, databases, backup servers, tape libraries) could be shown as connected icons; color could be used to depict status. Other useful information includes start and estimated or actual completion times, throughput, and size (number of bytes) of backup.
Reports that summarize this data over selected periods are also useful. Trend analysis can help determine such things as additional media or bandwidth requirements. Ideally, it should be possible to formulate various queries of historical backup statistical data.
Disaster recovery strategies are designed to protect businesses against the catastrophic destruction of their computing facilities. The usual approach is to create duplicate facilities at remote sites. Sometimes the primary systems are duplicated; in other cases, smaller systems may be set up to execute only the most critical business applications. The crux of recovering from disasters is the ability to access the backup copies of crucial data and to restore these backups to backup systems that will take over the operation of critical applications.
When executing a disaster recovery plan, data is restored to a backup system. This new system typically has a different configuration than the original one, so the recovery process is more complex than a simple data restoration to the original system. Also, because the backup system typically has a different configuration than the original system, elaborate coordination with applications is necessary.
Backing Up the Backup
Simplistic backup approaches often lead to unnecessary or ineffective measures. For instance, some installations have a "no single point of failure" policy for high-availability systems. Applying this policy to the disk subsystem leads to the conclusion that it is necessary to use duplicate, mirrored disks. However, if you were to apply the policy to the individual disks again, you would wind up mirroring the mirrors, etc.
Clearly, there are elements of a backup system that should be protected from a single point of failure. However, it should always be kept in mind that ultimately a set of tapes representing a backup is essentially a mirror of the data.
To maximize data availability and minimize the possibility of lost data, it is necessary to look at the overall installation. Probabilities should be assigned to the various types of failures that can occur. Then, all of the protection mechanisms, each with its own particular time to effect recovery and its own probabilities for failure, can be added to the equation to determine the overall expected data availability and the expected downtime when there is a failure. In this way, the most effective places to apply additional redundancy and/or increased performance resources for recovery can be determined.
Virtually all of the techniques described in this article and in the first part (InfoStor, August 1998) are in use in real environments, often in combinations, to optimize the availability of data. Clearly, as both database sizes and the number of operationally-oriented applications grow, more and more of these techniques will find their way into enterprise backup strategies.
Dr. Allan Scherr is senior vice president, software engineering, at EMC Corp., in Hopkinton, MA.