Technical view of disk journaling for secondary disk

Posted on November 01, 2003

RssImageAltText

Using disk journaling in a disk-to-disk-to-tape architecture can solve a lot of backup/restore issues.

By Marcia Reid Martin

Previous articles in this series have shown how introducing a secondary disk layer between the primary storage (application) layer and the tape layer can enhance the utility and value of the data protection process in many ways, including

  • Speeding up recovery;
  • Eliminating or reducing the backup window (the time it takes to make a backup);
  • Eliminating or reducing the risk window (the time between a change to data and the creation of the first backup copy);
  • Improving the reliability of data protection (e.g., ensuring data is copied accurately);
  • Reducing the impact of data protection processes on primary business processes and improving the availability and performance of the primary processes; and
  • Reducing the administrative complexity of managing data protection.

Not all secondary disk architectures provide all these benefits to the same extent and at the same cost. This article introduces a secondary disk architecture that promises to provide high value at a moderate cost. This architecture is called disk journaling, a technique used in various ways by vendors that claim to provide the ability to retroactively restore to any point in time.

What is disk journaling?

In disk journaling, every write or update event that affects a primary storage device (disk partition or volume) is also written to another, secondary destination. So far, this is the same as simple replication or RAID 1 mirroring. However, in journaling, the secondary destination is not a simple, one-to-one replica of the primary destination, but an actual sequential history of write events. Figure 1 illustrates how such a journal is created.

The journaling interposer is the first critical mechanism for initiating journaling. The interposer is invisible to both the storage application and the primary disk storage. The interposer passes reads and writes without affecting them in any way. Writes, however, are also queued to a second device, the journal storage. The journal processes write operations differently than the primary store does.


Figure 1: In a basic disk journaling scheme, the secondary destination is a sequential history of write events (as opposed to a one-to-one replica of the primary disk storage).
Click here to enlarge image

null

A journaling interposer can be implemented in a variety of ways: It can reside on the server that hosts the application, as a layered device driver, or it can reside in a network switch or storage controller. The interposer must implement a protocol, typically asynchronous, that allows it to ensure that, in the event of a communications failure, the journal can be reconciled with the primary store.

On the primary store, every time the storage application writes to a particular logical address, the prior value of that address is overwritten. The primary disk thus only represents the current state of the application data. In the journal, however, the data from each write event is written to a new physical location in the history. Metadata is associated with each event to record where the data would belong if it were in the primary store—its logical address in the primary address space.

Theory of journaling

Suppose a disk journal was begun the first time an application wrote anything on its primary store, and that this journal could be maintained for the life of the application. Then the journal could be used to reconstruct the application's disk image as it was at any time in the history of the application. It would be, in many ways, a perfect backup. By applying the journal entries in reverse order to the primary disk, it can be "rolled back" to any prior state. If the primary is physically destroyed, a replica can still be constructed from the journal.

Of course, this basic journaling scheme is too simple to be practical. First, a complete journal would continue to grow indefinitely, so that the amount of space it required would soon get out of hand. Second, as the journal grew, the roll-forward and roll-back recovery techniques would take longer and longer to accomplish. Even if stored on disk, a "pure" journal would not provide a rapid-recovery mechanism.

Fortunately, it is not necessary to maintain a complete, "pure" journal to obtain most of the benefits of journaling. A journal structure has two important characteristics—I/O efficiency and "severability"—that make it an ideal foundation for a policy-based data-protection system that yields all the data-protection goals identified at the beginning of this article.

I/O efficiency—Sequential write operations are the most efficient kind for almost any storage device. This, coupled with the fact that, for most business applications, only about 20% of the total I/O operations are writes, allows a low-cost journaling device to manage data protection for a high-efficiency application, or for a number of less demanding ones.

Severability—Once the interposer has divided the write stream, the entire journal management process is entirely offline with respect to the primary application. The journal can be processed independently of the primary storage and application, without contending for any primary resources and without creating a need to stop or even slow down any primary business process.

A practical journaling architecture

The drawbacks associated with the "theoretical" journaling model are its excesses, not its limitations.

To make journaling practical, all that is needed is to pare away the excess, leaving behind a data structure that meets data management needs without being too large to store and process. The "severability" property of the journal makes it easy to do this, without creating the complexity associated with backup and without impacting the performance or availability of the primary application tier.

To understand a "practical" journaling architecture, it is necessary to assess what we already know about backup and recovery. The key questions are the following:

  • How long does it take to detect data corruption?
  • When (in the life cycle of data) is the probability of needing to recover a particular data object the highest?
  • When does the probability of needing to recover a particular data object on an emergency basis become small enough that it no longer justifies the cost of keeping it online?
  • Are some points in the "history" of an application more significant than others?

The answers to these questions form a simple but complete policy for managing a journal. While the answers vary from application to application, studies repeatedly have shown that most recovery requests occur within the first 24 hours of changing a data object, and that the probability of restoring a version of data to use it as the primary production instance nears zero after 30 days (far less for critical databases). Additionally, most application stores—both databases and file systems—do have "points of significance."

Adding the ability to mark these points of significance in a historical journal is a minor enhancement to a journaling protocol. Starting the journal with a full replica of the primary store allows journaling protection to start at any time in the history of a data store, rather than requiring that it start with the first write.

Keeping the replica fairly current (say, within 24 hours) yields additional benefits when performing recovery operations. Finally, putting intelligence in the journaling store to carry out a journal management policy based on the four characteristics of recovery patterns listed above ensures that the growth of the journal can be controlled.

In Figure 2, an interposer located either on a storage area network (SAN) switch or on the application server replicates writes to an intelligent journaling device, which manages the resulting journal to a policy. The device maintains a full replica of the primary store and two journals—a continuous journal of recent events, and a "sparse" or thinned-out journal of more distant events. The sparse journal contains only enough information to reconstruct points in the history that have been marked as "significant" to the application, but the continuous journal records every event.


Figure 2: An interposer located either on a SAN switch or on the application server replicates writes to an intelligent journaling device, which manages the resulting journal based on policies.
Click here to enlarge image

null

A sensible length for the continuous journal would be the longer of (a) the period of time when the probability of a recovery request is highest, or (b) the expected amount of time needed to detect a data corruption situation. Typically, this would be between eight and 48 hours. During this interval, if the entire database becomes corrupted, there are significant advantages to be obtained from being able to crash-recover to the point in time immediately before the corruption occurred. Additionally, for individual data objects (end-user files, for example), the difference to a user between one version and the next will be remembered as important. But that user's recollection, and therefore the significance of the differences, diminishes over time.

A sensible length for the sparse journal is the period of time after which it no longer makes sense to return the data to production status. This interval typically varies between one and four weeks. Data older than that is archival and doesn't need to be retained in a "hot" storage format. A policy of this kind may be enforced using a small multiple of the primary storage size—typically between 1.5x and 5x. In most cases, this is a smaller multiple than a typical library of full-plus-incremental backup tapes, despite the fact that the recovery capability provided is much greater.

It is worth noting that this simple policy—journal this volume, keep n hours of continuous history followed by m days of sparse history—is the only "management" required to set up automatic data protection with a journaling device. Until a recovery is needed, no operator intervention is needed. The journaling device can manage its own resources and, because it accumulates data about the pattern of the write stream, can alert accurately when additional secondary disk will be needed to satisfy the policy.

Recovering from a historical journal

We have already discussed the idea of using a journal to "roll back" an entire data volume to a previous state. An intelligent journaling device can speed up this process by computing a minimal set of blocks to move to accomplish this, rather than reading the journal sequentially. The same is true when an entire volume must be recovered: The journaling device pre-processes the journal metadata and moves the minimum number of data blocks needed to create a functional replica. However, another powerful method of journal recovery involves a form of disk virtualization.

As Figure 3 illustrates, in a virtual recovery the journaling appliance performs the pre-processing needed to locate the precise set of blocks needed to create a physical replica of the protected volume at the desired point in time. Then, instead of moving the data, the journaling device emulates a new disk target, presenting a virtual volume that appears to be an exact copy of the original disk, as it was at the specified past time.


Figure 3: In a virtual recovery, the journaling appliance performs the pre-processing needed to locate the set of blocks needed to create a physical replica of the protected volume at the desired point in time.
Click here to enlarge image

null

Creating a virtual replica takes a few seconds, or minutes at most. When the virtual replica is ready, a system administrator can mount the virtual replica using the storage application (file system or database) that created the data. Now the virtual replica can be browsed, searched, and filtered using tools provided by the application for managing data. Individual files, embedded objects, rows, or tables can be located in and extracted from the virtual replica without moving any other data. Further, since data on the virtual replica is "live" and viewable, recovery isn't hit-or-miss as is often the case with recovery from backup sets on disk or tape. No data needs to be moved until it has been verified to be what is wanted.

Other applications of disk journaling

A journaling device can present virtual replicas to any application server—not just the original one. The journal can also be processed directly. These capabilities will eventually support a range of data management applications that are typically not feasible today.

Backup—or archive—consolidation is only the first and most obvious such application. In this scenario, the journaling device presents views from many application servers to a single media server to make tape copies of the data to be retained for longer than the journal duration. The same topology applies equally well to data extraction for decision support. In both cases, significant simplifications derive from the fact that an online point-in-time image can be retained for many days or even weeks, as opposed to the matter of hours that is feasible with today's mirroring and snapshot methods.

Increasing emphasis on national security and business accountability will also foster a need for forensic applications that process the journal directly. Because a block-level journal is captured at such a low level of abstraction, most of the techniques used by malicious invaders, human and programmatic, to cover their tracks can be exposed by analysis of the disk journal. Journal analysis will be used to reconstruct how viruses propagate, to locate inconsistent transactions, and to provide early warning of threats to data before damage becomes extensive.

Marcia Reid Martin is a consulting software engineer at StorageTek (www.storagetek.com) in Louisville, CO.


Comment and Contribute
(Maximum characters: 1200). You have
characters left.