New methods of backup and restore are needed to meet today's complex requirements.
By David Freund
Just about all business executives know that their IT systems must have something called "backup." But what does that really mean? When their IT managers say "backup," they're usually talking about a copy of entire disks, usually stored on tape media, that's created by "backup software." This meaning evolved during computing's early days. Once upon a time, it simply meant the creation of a spare copy of data that can be used if the original is lost or damaged. Now the term is usually used to mean a specific type of software that uses a specific method to make copies of data. Since the need for backup has been ingrained into the consciousness of every IT and business manager, that translates into the mandated use of such software and methods on just about every business server in use today.
Common wisdom is seldom either, but in this case it's simply misguided. The need for backup copies of data that can be used to replace lost or damaged information is greater than ever. But today's layered, multi-tier, multi-system, distributed applications simply cannot be properly protected using traditional backup methods.
New technologies and techniques have entered the market. However, many of them do not yet provide the level of protection needed for today's complex information infrastructures—ranging from the ability to restore single pieces of information, such as an e-mail message, to restoring an entire application-service stack to a previous "known-good" state. In other words, today's backup requires a more sophisticated approach to protecting systems and their data.
The point of backup
The point of backing up information is to be able to restore it if necessary. This crucial notion—the point of backup is recovery—is all too often lost. Ensuring that backups are done has been baked into the IT and corporate consciousness. For years, there has been tremendous focus on issues like shrinking or non-existent "backup windows," with vendors proposing various ways to speed up the backup process. Far too many companies, however, have never actually tested their backups to see if they can properly restore entire systems. Even more fail to do so on a regular and consistent basis. When asked, most CIOs admit that their backup-and-recovery strategies have significant vulnerabilities. While the 9/11 attacks caused some action—roughly one-third of US companies have since modified their backup habits in some way—the majority of IT managers must shift their focus from "making backups" to verifying the recoverability of their systems and data.
One major vulnerability is restoring full application-service stacks. For one thing, restoring from full and incremental (or "differential") backups is both time-consuming and prone to error. Sequential-access media such as tape creates a bottleneck in restore operations. The use of tape for backup has been justified by the fact that the cost-per-megabyte for the media has been significantly lower than that of random-access disk drives. However, that cost difference shrinks when you factor in the cost of tape drives and libraries, which have limited life spans due to moving parts.
Yet the tape-versus-disk war rages on. Tape does still have some distinct advantages, such as portability and the fact that tape cartridges don't have head crashes. Tape, therefore, remains attractive for meeting long-term, off-site data-archiving requirements; it's less expensive for sending large volumes of data than using the Internet; and—let's face it—IT shops, especially those of large corporations, tend to resist changing long-term practices.
More important, the ability of traditional backup to capture an image of a system at a specific point in time is based on two fundamental assumptions: (1) no changes will occur in any of the files scheduled for backup between the moment the process begins and the moment it completes, and (2) the state of the entire system can be captured from disks that the backup software can access. Today's applications, however, are increasingly inter-dependent, communicating with each other and with database systems on several networked servers.
Each application typically has its own sense of "state," such as how many steps of a given process have been completed, what requests from users or other applications remain to be processed, what requests made to other applications or databases are still outstanding, etc. What's on disk at any given time has no necessary or guaranteed correlation with the complete logical state of the application. Think about that for a moment, for therein lies the rub: A fundamental divergence from historical realities about how application information is represented demands a fundamental rethink in backup strategies.
Fulfilling the mission
Solutions to these fundamental problems—saving a consistent "point-in-time" state of entire application systems and providing timely recovery of data ranging from individual data objects to entire systems—fall into three major classes:
Incremental solutions take the approach of enhancing existing backup methods. Examples of this include virtual tape products, disk-to-disk-to-tape backup, and backup from file-system or volume snapshots. One particularly interesting enhancement is the addition of "application awareness," giving the backup utility the ability to detect—and even control—application state, enabling it to store a known, consistent point-in-time image of the application. This is usually done using APIs provided by the application. Some major database and application packages such as Oracle's and SAP's ship with tightly coupled backup utilities, but they are limited to the backup of their associated application only and are not easily integrated with the backup of the rest of a system of which the application is only one part. Other utilities, such as Dantz' Retrospect, Veritas' NetBackup and BackupExec, Legato's NetWorker, and others are meant to back up full systems, and they include "awareness" of applications and databases such as Lotus Notes/Domino, Microsoft Exchange, Oracle (DBMS and/or applications), SAP R/3, and more.
Fundamental solutions use completely different techniques for providing extra copies of data. Some of the techniques include the following:
- Mirroring—also known as RAID 1—produces a block-for-block replica of a disk volume. Once the mirrored volume is established by copying the original disk, the mirror is maintained by repeating all write operations to both the original and the copy (or, in some cases, multiple copies). Mirroring can be performed "synchronously," where both the original and copy devices must acknowledge that the write is completed before the next write can occur. This slows application performance but keeps the mirrored volumes perfectly synchronized as mirror images of each other. In "asynchronous" mirroring, on the other hand, the original and copies do not have to synchronize their writes; they are allowed to complete independently. Asynchronous mirroring is thus faster from the perspective of applications, but the copies can be out-of-synch with the primary copy at any given point in time. In practice, copies within a single data center are rarely more than a minute behind the original volume. Both techniques can also be used over long distances, but the greater the distance the slower the performance in synchronous mode.
Mirroring is used for many critical applications and provides the fastest way to recover data. Restore operations involve simply using the mirrored copy instead of the original, which is usually automatic and instantaneous. However, only the entire volume can be restored. Individual files cannot be selectively recovered unless a mirror set had previously been "split," stopping the duplication of writes to one of the replicas, which forms an image of the original at the time the mirror was split; in that case, select files could be manually copied from the stale replica. Another drawback is that each copy must be at least as large in capacity as the original, regardless of how much of that capacity is actually being used.
Products like EMC's SRDF and MirrorView, Hewlett-Packard's StorageWorks Data Replication Manager, Hitachi's ShadowImage, and IBM's PPRC are examples of sophisticated mirroring software. Most RAID arrays also contain built-in mirroring of directly attached disks.
- Snapshots represent an image of a disk volume or file system as it was at an instant in time. This is usually accomplished using a "copy-on-write" technique; after the "point in time" of the "snap," any further writes (including updates to existing data blocks) do not overwrite existing data, but are transparently redirected to another location, keeping the original data intact. The software or hardware that created the snapshots keeps track of which versions of the data blocks belong to which snapshots, as well as the "live volume." Each snapshot can be individually mounted, providing access to the volume as it was at the instant the snap was taken. Some products also allow post-snap writes (updates) to the snapshots' contents.
If the master data isn't extensively modified, multiple snapshots take up very little additional physical space. Rolling back an entire volume to a point in time involves simply using the snapshot instead and declaring it to be the "live volume" going forward—if the software or appliance supports that feature. Restoring data involves either replacing the master with a snapshot or mounting the copy and manually transferring files. Individual files can also be recovered from any point in time for which a snapshot exists. On the downside, snapshots take up more space as they diverge farther from the master over time. Accessing mounted snapshots can degrade application performance because they share storage with the master version. Finally, there is usually no way to determine if a given snapshot represents a point in time when all the data is in a consistent state for the various applications that use that volume.
BakBone's NetVault, CommVault's Quick Recovery, Computer Associates' BrightStor High Availability Manager, EMC's SnapView and TimeFinder, FalconStor's IPStor Snapshot Copy, and Network Appliance's SnapRestore are all examples of snapshot products.
- Replication maintains a perfect copy of a file, a related set of files, file system, or database. Similar to RAID-1 mirroring (which usually refers to disk volumes but is sometimes referred to as replication), replication uses application awareness to maintain self-consistent copies of a specific application's data. This can be a feature of the application or database itself (typically by exchanging network messages with an instance of the application on another server), or via an application's APIs. DB2, Oracle, SQL Server, and other databases have their own replication abilities, as do applications such as Microsoft Exchange and Lotus Domino.
Database replication has the obvious advantage of maintaining a replica's read consistency, since it repeats master-database changes on the copies as full transactions, either in real time or in batches. The master and replica databases can also run different versions of DBMS software or operating systems. Individual tables, or even records, can be selectively recovered. However, replicas contain only the most recent version of the data.
- With journaling, every write and update operation on a volume, database, or file is recorded to a separate file, area, or device. Instead of a working copy of the original, a journal is a sequential history of write events. Usually performed asynchronously, journaling adds information to each recorded entry to associate the write operation with the original location of the data and the date and time the write operation occurred. Unlike mirroring and replication, which would dutifully copy any corruption of data that takes place on the original volume, a journaled volume—or even an individual file—can be "rolled back" to a point in time prior to the corruption event. However, like snapshots, without application awareness at any given point in time the system may not have an application's data in a consistent state.
In typical commercial environments, approximately 20% of all I/O operations are writes, using more capacity over time (unless the journaling software is designed to "drop" journal entries older than some threshold). Normally, journaling is used in combination with mirroring or replication as a means to completely recover the volume, database, or application in the event the master fails. Some implementations, such as StorageTek's EchoView and Vyant's RealTime, are platform-independent—at the price of being unaware of application or OS state.
Hybrid solutions combine these methods, such as sequentially copying files from a snapshot to tape—providing both a true point-in-time copy and eliminating the "backup window" problem. Products like XOsoft's Data Rewinder combine application awareness with file-system journaling to provide known-good snapshots of an application's state, along with the rest of the data associated with the application service.
EMC plans to provide a hybrid solution that adds an interesting twist to this approach. By combining VMware's virtual-machine snapshot capabilities with its other data-protection products, the state of an entire server—and all the applications running on it—exists as simply another piece of data to protect. For example, a VMware snapshot could be contained in a multi-volume snapshot taken at the same instant. If an application were in the middle of a multi-part transaction when this hybrid snapshot was taken, it would simply complete the transaction when the virtual-machine snap was resumed—solving the "application awareness" problem by redefining it. Veritas plans to do much the same with its Ejasent product acquisitions.
Suggestions and recommendations
No one solution fits all scenarios. But here are a few principles that should guide every organization's backup strategy:
The point of backup is recovery. Any backup strategy must be tested to verify recovery. Test early, and test often!
Think beyond disks, files, and servers. The ability to recover data blocks and files is a good start. But it's only a start. Properly protecting today's business systems requires that complex, multi-part applications and services be recoverable in their entirety.
Think about recovery from the application's viewpoint. How does it store its data? What other applications does it interact with? How would those applications react if this application were restored to a previous state and restarted? How well would this application react to similar acts by other applications? Any applications that depend on others' "keeping up" with them should be protected as a single, aggregate entity.
Think both large and small. Recovering a single e-mail message, text document, or voicemail message that has been lost or damaged can be as crucial as recovering entire services. Recovering an entire data center, while highly unlikely, is as crucial as it gets.
Think error, not just failure. Everyone understands the need to protect against data loss due to hardware failure. But there's also a frequent need to "undo" something that's been done—usually by a human being or by something programmed by a human. It could include something simple like restoring a file that was accidentally deleted. Or restoring a range of database records wiped out by a programming error. It could even include reversing damage done maliciously by disgruntled employees, viruses, etc.
Consider your strategy for information archiving and retention. This could include document- and content-management systems (handling "fixed content," "live content," or both), and even archiving/data-protection systems for meeting regulations such as Sarbanes-Oxley, HIPAA, DoD 5015.2, Rule 17a-4, etc.
Think "portfolio." Similar to the way different financial instruments are used to provide a balanced investment portfolio, different backup techniques and technologies should be used to provide a balanced backup solution. Similarly, each organization has a unique mix of needs, risk tolerance, and budget constraints—all of which can change over time. Setting priorities and deciding which applications and data are more central to daily operations than others remain a common-sense part of selecting backup solutions. Doing so in the context of a portfolio protects against overlooking other assets that are less obvious, but no less important to be able to recover.
Be specific. For each asset to be protected (from complex, multi-part application services to desktops), choose products based on how well they match the protection needs of those assets along the following dimensions:
- Recovery-time objective (RTO)—How quickly can applications, systems, and/or data be recovered and operational? Operational, accessible copies of those assets, using technologies like mirroring (and other forms of RAID) or clustering, for example, provide the fastest—and most expensive—RTO. Magnetic media stored off-site provides the slowest method.
- Recovery-point objective (RPO)—How close to the last possible instant prior to a failure can applications, systems, and/or data be restored?
- Recovery-object granularity—What size of objects needs to be recovered? An entire system? Disk volume? File? E-mail message? Database record?
- Recovery-time granularity—How finely can you "turn back the clock" to recover an asset? Must it be restored to the way it was at the end of a given business day? At a given hour, minute, or second? Are quantities other than time, such as transactions, needed?
- Self-consistency—What is the level of guarantee that applications, systems, and/or data will be restored with their states intact? This directly translates into how useful those assets are when restored. Can related objects be grouped and protected as a single entity? How easily can such groups be defined—and changed over time?
- Resiliency—How well does the product tolerate failures? This aspect is actually quite broad, ranging from detailed events like device, software, and network errors, to macro-level failures like power outages, fires, or floods that can affect an entire site. And let's not forget human error!
Regardless of what technology or technologies are used, a good backup strategy requires that IT managers shift their focus from the act of "making backups," managing media, etc., to planning how they would recover multi-server, multi-application services to a logically consistent state, in their entirety, when needed. This involves considering application interdependencies, as well as how those applications handle various kinds of errors such as getting data from a network peer that is suddenly out of sequence, network peers' disappearing and re-appearing, and so on. The goal should be to form a comprehensive backup plan that protects each application system as an aggregate entity—including each component that will not tolerate a loss of synchronization with others.
Cost, as always, is a factor. No one solution fits all needs. A good backup plan will usually be a blend of techniques, ranging from traditional tape backup to journaling, mirroring, snapshots, and application-aware software agents—providing the ability to restore single pieces of information, such as a file or an e-mail, to restoring an entire application-service stack to a "known-good" state. Thinking of backup as a portfolio of products, processes, and procedures can help organizations set priorities and allocate their backup expenditures to protect all of their IT assets.
Taking that a logical step further, backup should be considered as part of an organization's overall IT portfolio. Each firm's need for new software or hardware features and functionality can be more explicitly balanced with its tolerance for exposure to risk. Risk tolerance usually invokes thoughts of regulatory-compliance issues, or preparing for natural disasters or terrorism. However, an inability to recover vital information lost due to human error can pose a greater—and more probable—risk.
IT has become the nervous system of business, critical to many a company's health and well-being. We all hope, and even expect, not to need health and life insurance. But most of us purchase insurance policies because we are unwilling to risk the consequences should something happen. When something does happen, how quickly and effectively an insurance company deals with—and pays—its claims can have a dramatic impact on people's lives. For that reason, savvy consumers buy their insurance policies with that in mind. Backup is very much like insurance.
Just having any sort of policy might be enough for some to sleep at night. Knowing for certain that one's policy will actually pay off—recover application services and data, including the complex infrastructure that supports them reliably and quickly—is a much wiser choice.
David Freund leads the Information Architectures practice at Illuminata Inc. (www.illuminata.com) in Nashua, NH. This article was excerpted from a longer research report that is available on Illuminata's Website.
A backup plan for SMBs
Consider the example of a small business with 10 employees, a Microsoft Exchange Server, a few databases, a file/print server, and some test systems. The company relies heavily on Exchange, as it does business primarily through e-mail, maintains its contacts, and collaborates on much of its work using that platform. The business is not a 24/7 shop; after-hours interruptions of service can be tolerated, but interruptions lasting more than an hour during the workday would have painful consequences. Files on the file/print server are vitally important, but most of the company's "work product" is worked on collaboratively and kept on the Exchange server. A portfolio for such a company could include the following:
- External RAID for the Exchange and database servers, including data and boot volumes. This protects against hardware failures in both disks and servers. In the event of a server failure, a lower-priority server (such as one of the test machines) can be quickly cabled to the array(s) and started.
- Use of Exchange- and database-aware, file-system-based journaling (with snapshot capability) on the Exchange and database servers. This provides an ability to "undo" data loss or damage to any point in time, with a guarantee of self-consistency. Ideally, the solution would be "Exchange-aware" enough to restore individual mailboxes, folders, and even messages, but that's unlikely to be available (at least within the organization's budget).
- Exchange-aware traditional backup software with message-level restore. Although both RPO and RTO are less than ideal, they satisfy the firm's recovery-granularity need.
- Periodic, host-software-based snapshot on the file/print server. Excellent RTO and an RPO that can be varied according to the directory tree concerned. Another option is to replace the file server with a NAS appliance with similar data-protection capabilities.
- Traditional, network-based backup for all servers and desktops. This is used to cover all the remaining information, as well as to provide a means for maintaining an off-site copy of all the company's data, including self-consistent snapshots of the Exchange and database servers.
- Optional: If money and disk space permit, journaling of desktop file systems could be added for an extra level of "undo" protection—with volume- and file-level restore granularity to any point in time—for those users who are prone to "tinkering" with their systems.