Due to decreasing disk costs, replication is becoming affordable to more and more companies.
By Marcia Reid Martin
Until recently, data replication was, oddly enough, both a luxury and a necessity. Only large enterprises could afford it—and only for their most critical applications. Today, however, few companies—regardless of size—can afford downtime.
As falling disk prices allow IT administrators to re-evaluate replication, they are finding a confusing plethora of alternatives. This article examines how some replication technologies work and helps end users to assess their relative value.
How replication creates value
The value of information technology comes from the primary business processes it supports. To create business value, applications must be available, recoverable, and secure. Replication technology can enhance these value propositions.
By using replication, you can ensure that applications stay available through events such as site disasters, equipment failure, secondary processing tasks (such as backup), and data loss due to human error or malicious interference, the last item being the most difficult to deal with. The best we can hope for today is that when human error or malice stops business processes, they can be re-started. This brings up the next class of value creators: data security.
Data security means knowing that when something happens to the main data copy, there's another copy that's safe. Ideally, safe should mean that preserved data can be quickly deployed to return the application to an operational state. Replication can help to
- Create safe offline copies (i.e., backup);
- Reduce the risk window (the elapsed time between data creation and completion of the first safe copy);
- Improve copy validity. A big problem with backups is that by the time you learn that they're unreadable, inconsistent, or incomplete, it's usually too late; and
- Improve recovery. Recovering an application from backups can be like putting together a puzzle. Data isn't safe unless you know that you can get it back and working in a reasonable amount of time.
Traditional disk-to-tape backup fails to meet many of these requirements. Although it creates "safe" copies, recovery is slow and uncertain, and the risk is big. And ensuring that backups copy valid data may mean shutting down other applications.
Adding point-in-time technology addresses several of these problems. It lets backup (on a consistent, stable copy) and primary applications run concurrently. As a bonus, sometimes lost or damaged data can be "recovered" directly from the point-in-time view. And recovering files from a disk-based replica is much faster than recovering them from tape.
Point-in-time technology began as an extension of RAID 1. In this approach, the disk controller stops updating one of a mirrored group of disks, allowing it to be accessed as a separate logical unit. Because the "frozen mirror" is a complete, independent copy of the original, data protection starts as soon as the mirror is created.
Applications use frozen mirrors just as they do the original copy. If something happens to the original data after a frozen mirror is created, the application can "fall back" to the mirror copy in seconds, or minutes, at worst.
Frozen mirrors are write-able. If the primary application weren't in a consistent state when separation occurred, the frozen mirror can be crash-recovered. Differential point-in-time methods, in contrast, are read-only or have limited write-tolerance.
So what's the downside of mirroring?
- Cost—Only high-end (expensive) disk controllers can do mirror rotations, and the disk storage behind them is also relatively expensive;
- Limited viewpoints—Today, the number of copies any mirroring controller can manage concurrently is fewer than 10. And since each copy takes as much storage as the first copy, the incremental cost of adding another copy is high; and
- Resilvering—As a frozen mirror ages, it becomes less and less useful as a fallback point. Although the controller can bring any frozen mirror back to currency without interrupting the availability of the mirror group, there is a negative effect on performance while it does so. (Resilvering is the process of bringing a replica volume that has been frozen back to currency with its primary volume. This can mean copying the entire contents of the primary back to the replica, or copying only the portions of the primary that changed while the replica was frozen.)
Some businesses have applications so critical that maintaining eight frozen copies of their data, spaced every few hours apart, seems like a good investment. Tape backups and other secondary processes such as reporting and data extraction are always run from the frozen mirrors—typically using redundant servers. The primary application is impacted only during resilvering.
The chart compares the dominant point-in-time technologies (snapshots and mirror rotation) to un-enhanced disk-to-tape backup and a future replication technique.
Nearly all companies have applications so critical that extended downtime is unacceptable, and even short outages incur major costs. Not all of these companies can afford multi-million-dollar disk solutions to prevent it. Snapshots are a less-expensive way for smaller businesses (or less-critical applications in big businesses) to enjoy many of the benefits of mirror rotation.
Snapshots are only "virtual" copies. Only one full copy of the primary data exists on-disk at a time. When a snapshot is taken, almost nothing happens. But subsequent write events are handled differently—with the old value of affected disk blocks being saved, either in a separate partition reserved for that purpose or in "unallocated" areas of the snapped partition.
While storage applications can handle snapshots as if they were actually frozen mirrors, there are tradeoffs:
- Safety limitations—Snapshots provide immediate protection against logical data loss, but not equipment failures;
- Contention—80% or more of the blocks in a snapshot are shared with the primary view. Hence, while a secondary process and a primary process are both running, they are in contention for the same storage. Neither will perform as well as either would alone;
- I/O overhead—Most snapshots use a technique called copy-on-write. Changes to the "real" copy of the volume generate extra writes. Reading from the snapshot invokes special redirection logic; and
- Short lifetime—While it is merely inadvisable to allow frozen mirrors to age considerably, a snapshot that is allowed to age too long will use all the available space to store differences, and thus can fail. With some methods, this causes the primary copy to run out of space as well.
Also consider limitations on the number of available snapshots. Controller-based snapshot mechanisms have low limits on the number of snapshots per volume (typically four to eight snapshots). Software-based mechanisms claim more—sometimes hundreds— but the practical limitations of the copy-on-write mechanism make keeping many snapshots problematic.
Point-in-time technology can deliver major benefits. It's important to evaluate the cost-benefit tradeoffs of competing point-in-time technologies—mirror rotation, software snapshots, and controller-based snapshots.
However, snapshots and mirror rotation are not the end of the data replication story. Soon, advanced replication technologies will offer all the advantages of snapshots and mirroring, with fewer disadvantages (see table).
Marcia Reid Martin is a consulting software engineer at StorageTek (www.storagetek.com) in Louisville, CO.
Glossary of replication terms
There's little consistency in how replication terms are used, but in this article the following definitions are used:
Point-in-time view: Any consistent, stable disk image. The properties of a point-in-time view are that it looks like a complete image of a disk volume or logical unit and that it can appear private to a single process—so the view cannot change unexpectedly.
Mirror: A block-for-block replica of another disk volume or logical unit. Once established (such as by copying the original unit), a mirror is maintained by duplicating all write operations to the primary on the mirror and by allowing no other writes to the mirror.
Frozen mirror: A point-in-time view made by creating a mirror, then at some point in time (when a specified event occurs), ceasing to update the mirror. Updates to the original continue, but the mirror is "frozen" as it was when the event occurred.
Snapshot: A differential point-in-time view. There are several ways to manage a snapshot, but all involve keeping two or more copies of only the areas within a storage unit that have actually changed since the view was created.