Enterprise Backup/Restore from A to Z
In the first of a two-part series, we examine the advantages and disadvantages of various backup techniques.
By Allan L. Scherr
There is probably no one reading this article who has not had first-hand experience with losing data stored in a computer. If you were one of the lucky ones, the experience had no long-lasting consequences. But for some companies, both large and small, lost data has put them out of business for hours, days, and in some cases permanently.
The operational importance of data in business has grown substantially over the years computer technology proliferated throughout the enterprise, enabling companies to make more money, faster. There are new database-intensive applications for on-line transaction processing, decision support, data ware-housing, and data mining and analysis. Important collaborative workgroup activities, such as email and Lotus Notes, are used widely throughout the enterprise.
In parallel, the amount of data involved is growing exponentially. Moreover, operational data in many organizations is used on a 24x365 basis.
Critical data include sales orders and analysis, accounts receivable and payable, general ledger, payroll and human resources, customer information, manufacturing inventory, and product design data. These factors in combination--that is, network computing, a variety of computing platforms, very large databases, and mission-critical operations--make data protection both more complex and more crucial.
Three factors motivate the use of backup and recovery tools:
- Protection from physical failures (hardware failures typically in the form of a disk/head/media failure, unrecoverable hardware errors, and physical disasters ranging from power surges to earthquakes, floods, sabotage, etc). In the event of a physical failure, data becomes immediately inaccessible and must be recreated on new or different hardware. The most desirable outcome is the ability to immediately recreate the data.
- Protection from logical failures caused by errors or defects in application programs or underlying software, data entry errors, etc. Generally, these failures are detected well after the fact, even days later. The media containing the data remains operational; it is possible that only a fraction of the data may be incorrect.
The most desirable outcome is to be able to immediately return the data to its previous state, that is, immediately before the error. Because significant amounts of correct work are often done between the time the logical error occurs and when the error is detected, the ideal solution instantaneously re-processes this work against the correct data content.
- Storage and retrieval of archived data. Applications include responding to audits, re-creating an inadvertently deleted file, gathering historical data for a new data warehouse application, and retrieving records associated with a particular project, customer, user, etc.
Tools and Techniques
In essence, all backup techniques involve creating a copy of the data to be protected. In the simplest case, a backup copy is made on the same disk that stores the original file or database. The protection afforded by this approach is limited, but certainly not insignificant.
If the copy is made to another disk device, protection is increased. Obviously, data is further protected if the copy is made to a disk in another system or at another site. There are two difficulties with simple copies:
- The copy is made at a point in time. Therefore, recovery is made back to this point in time rather than some other moment.
- Programs that modify the original data must be inactive. Copying a file or database that is actively being changed can produce a copy that cannot be used because it may be in an undefined state.
One technique used to overcome the point-in-time limitation of simple copies is to create a dynamic mirror of the source data, or propagate changes to the copy. Propagation can be synchronous (done in parallel with changes made to the original) or asynchronous (after changes are made to the original).
RAID-1 configurations provide for synchronous mirroring with two physically identical copies of the data, while RAID-5 provides for synchronous mirroring such that the backup copy of the data must be derived by processing all of the remaining disks in the RAID set. Remote mirroring, either synchronous or asynchronous, is increasingly being used as a strategy to protect against physical disasters.
The term backup is usually used to denote a point-in-time copy of data made to tape. Thus, data is backed up to portable media that is less expensive than disk and whose performance characteristics are appropriate to bulk data transfer operations.
In the real world of complex, operational applications, data is not a simple monolithic entity. Application software makes use of files, file systems, disk partitions, and physical disk extents in a variety of ways. Moreover, to capture a consistent set of data that can be restored so that the application software can restart correctly often requires dealing with a collection of these items all copied at the same point in time.
Making copies of the entire collection of data every time a new set of backups is required is often unnecessary. Rather, an incremental backup is required--that is, only the elements that have changed since the last backup need to be copied. If a full backup exists, the equivalent of a more recent one can be synthesized by overlaying the incremental backups.
At the extreme end of the spectrum is continuous backup and recovery. Imagine a journal in which changes to a set of data are recorded with time stamps. With such a journal, the current version of the data could be processed back to any instant in time. In this way, the effect of a logical error can be undone. If the current data is not available because of a physical error, a full backup can be processed with updates from the continuous backup journal, creating a new up-to-date version of the data. For this to work, however, at least one full backup must have been performed sometime in the past. The decision about the number of full backups done in combination with continuous backups involves a performance tradeoff: the time to do the full backups versus the time saved by starting with a more recent full backup.
While continuous backup journals afford a great deal of flexibility, the technique has some serious performance drawbacks.
For example, if all the data has to be fully recreated after a physical disaster, the time to restore a full backup and then process changes against it to bring the data up-to-date can be substantial.
On the other hand, the fastest method for undoing a recent logical error is usually to restore only those areas of the data that changed in the time interval since the error. To be fully functional, incremental backup must be combined with either mirrors or full backups. Moreover, incremental backups are generally not used for archival purposes because full backups are generally more efficient in both retrieval time and data storage usage.
On-line and Off-line Backups
Once the backup mode is chosen, the backup window, or the length of time the data is inaccessible due to the backup, needs to be addressed. In the past, backups were typically performed at night or on the weekend. That is no longer the case. More and more applications are approaching full 24-hour operation, every day of the year, which means the backup window must be as small as possible.
There are two ways to minimize the backup window. One way is to bring the application down for the duration of the backup and do the backup so quickly that the time taken from the application is negligible. Bringing the application down or "quiescing" it is generally required to ensure that the application data is in a defined state.
Another way is to have an extra mirror that is disconnected after the application is stopped. Then the application is quickly restarted. This so-called coffee-break window essentially spins off a point-in-time copy of the data that can either be maintained as a separate mirror or used as the source of a point-in-time backup to tape, or both.
This backup approach is essentially an off-line backup in that the application must be off-line to ensure data integrity for the duration of the backup. Some application software and most database managers have specific interfaces defined for use by backup programs to support data integrity for a point-in-time backup while the application continues to run its normal function. This is called an on-line backup.
In some cases, the data is backed up as the application runs, creating a "fuzzy backup" (because the data is in a somewhat undefined state). While a fuzzy backup is being created, some applications build a journal of changes to be recorded with the application to facilitate returning the fuzzy data to a defined state. For example, the Oracle database manager on various UNIX platforms supports this type of backup.
Moving the Bytes
There are three fundamental paths for copying data for backup from the source disk to tape. Data can be read from the disk using the application host and then written to tape via directly-attached tape drives. This is the local mode of data movement.
The second alternative is to read the data using the application host and to send it across the network to a backup server, to be written onto attached tape subsystems. This is network data movement. Finally, the backup tapes can be created by moving data directly from disk to tape without using the application host. This is the direct mode of data movement.
Typically, the restore process uses the reverse of the backup path, although there are exceptions. For instance, a direct backup, in the event of a disaster, may be restored to a different system over the network.
One special consideration about direct backup must be discussed. Since the data storage system generally deals with physical elements such as blocks, tracks, cylinders and disks, the direct backup will deal only with these physical elements unless some special provisions are made. On the other hand, because the data is read on the application host in the local and network backup approaches, the backups easily deal with logical elements such as files, tables, indexes, catalogs, etc.
Dealing with logical elements when doing a direct backup requires that the physical location of the logical elements be determined and made known to the backup server. This job is eased if the backup server runs data access software identical to that in the application hosts. However, when there are multiple application hosts to back up that are different in one aspect or another, a more creative solution is required.
There are three fundamental paths for copying data for backup from source disks to tape.
Allan Scherr is senior vice president, software engineering, at EMC Corp., in Hopkinton, MA. Part 2 of this article will run in the September issue and will cover backup/restore performance, scalability, and management.