Data replication at the file-system level between two independent systems is a high-availability configuration that has advantages over shared-storage clustering.
BY ADAM STEVENSON
The majority of existing high-availability installations use a shared-storage clustering architecture, where multiple servers each connect to external RAID subsystems. However, this architecture presents several hurdles to eliminating data loss and achieving high availability. A high-availability system based on data replication over TCP/IP between two completely independent systems does not have the same shortcomings. This replication architecture, not widely used until recently, can potentially offer higher availability at lower cost.
An architecture based on data replication contains no single points of failure and enables sub-second fail-over because file-system recovery time is eliminated. Furthermore, the cache and lock management issues associated with shared-storage clustering are not present in data-replication architectures.
Drawbacks of shared-storage clustering
Shared-storage clustering configurations have multiple nodes attached to the same external RAID subsystems (see Figure 1). In the event that one server fails, a second server provides access to the shared storage.
Figure 1: Shared-storage clustering configurations have multiple nodes attached to the same external RAID subsystems.
There are two variations of the shared-storage clustering architecture. In one variation, called "single-system access," each multi-host disk is accessed through a single, primary node. Only in the event that this primary node fails does a second node take over control of the multi-host disk array. In the second variation, called "multi- system access," two or more nodes access the disk at the same time. If one of the nodes fails, the remaining nodes continue to access the disk. Both of these variations of shared-storage clustering have shortcomings that can reduce availability and increase costs.
The first drawback is that the shared storage must be an external, dual-ported, dual-controller, multi-disk-channel RAID array. These subsystems are significantly more expensive than internal storage or external, single-controller, single-ported RAID subsystems. Shared RAID subsystems also have several single points of potential failure, including the disk channel, channel controller chip, backplane, and the communications link between the redundant RAID controllers.
The other potential drawback to shared-storage subsystems is the single geographic point of vulnerability that they present. A fire, flood, or other disaster can eliminate the disk subsystem and the data on it. In addition, the limited distance of a SCSI or Fibre Channel connection limits the distance between the clustered servers. As such, this model does not provide disaster recovery.
In both single-system and multi-system access, lost cache data due to server failure can be a problem. When a server fails with committed data in its cache that has not yet been written to disk, that data is unavailable, at best, or permanently lost, at worst. To prevent lost cache data, all file-system writes must be synchronously passed to the disk subsystem before the write is committed to by the file system. This synchronization requirement reduces system performance.
The multi-system access variation of shared-storage clustering presents several difficult issues that must be dealt with by the clustering software and application. Lock management and cache consistency must be provided. Because multiple systems can directly access all the data in a multi-system access configuration, the clustering software must deal with the intricacies of lock management. Before accessing any data, each system must determine whether any other system is already accessing that data. When one system fails, the other nodes must be aware of which locks belonged to that system so that they can remove those locks and access the data.
In the case of cache consistency, some mechanism must ensure that data recently written by any of the cluster members to its cache is signaled to other cluster members. It is necessary either to immediately write all data to the shared-storage subsystem (disk synchronization) or to alert other cluster members to what blocks on the shared storage are invalid.
These lock management and cache consistency issues are difficult to solve because they involve the entire data path from file system to disk subsystem. The cache and lock management issues increase the total cost of high availability by requiring sophisticated configuration and system design.
Possibly the most significant drawback of shared-storage clustering is that upon a system failure, a file-system recovery operation must be run on each file system of the failed system. This file-system recovery makes rapid (i.e., sub-second) fail-over impossible even if the other two components of the fail-over time budget-failure detection and application recovery-are instantaneous. A lengthy process must identify and roll back any data partially written by the failed server to prevent corruption and lost data. Although a journaled file system simplifies and speeds up this data integrity checking, sub-second fail-over is not possible. In the case of single-system access, the standby server must also mount the file systems previously controlled by the downed server after conducting a data integrity check. Where nearly continuous service is required, shared- storage clustering is not appropriate because of the required file-system recovery operation.
Advantages of data replication
Synchronous data replication between two independent systems (see Figure 2) offers an alternative high-availability architecture that can deliver advantages over shared-storage clustering. It enables sub-second fail-over in a configuration that is easy to implement and maintain.
Figure 2: Data replication creates a second copy of all data onto an independent standby system, with its own separate storage. Replication takes place over a TCP/IP network.
Data replication creates a second copy of all data onto a completely independent standby system, with its own separate storage. Note that replication differs from mirroring, in that the data duplication takes place over a TCP/IP network to an independent system instead of within a RAID array or within a storage area network (SAN). Until recently, data replication has not been widely used as a high-performance solution for high availability. Instead, it has been used to provide periodic snapshots for limited disaster recovery. But, recent advances have improved the architecture for high-availability applications.
There are two methods of data replication: block-level and file-system-level replication. Block-level replication functions at the disk or volume manager level and replicates blocks of data (see Figure 3, left). Block-level replication sits below the cache.
Figure 3: Two methods of data replication are block-level (left) and file-system level.
File-system-level replication replicates data to the standby system as it enters the file system (see Figure 3, right). File-system-level replication sits above the cache, which can provide advantages over block-level replication. Because block-level replication functions below the cache, all data must be synchronously passed to the volume manager so that it can be replicated to the standby system. In file-system-level replication, the replication occurs above the cache layer so disk synchronization is not required. More importantly, rapid fail-over is not possible with block-level replication since the standby file system must be quiesced, checked, and mounted upon fail-over. With file- system-level replication, the standby file system is always ready to take over, making the file-system recovery instantaneous.
File-system-level replication involves synchronously replicating data across TCP/IP at the file-system level to an independent system. The file system does not commit to a write until the data has been replicated to the other system. In the event the active system fails, the standby system has immediate access to all storage data that was written to the failed system. There is no possibility of lost data due to any single failure, and because replication occurs at the file-system level, no file-system recovery operation is required during a fail-over. The data replication function can take place completely within the operation system kernel, making it transparent to applications, volume managers, and storage systems. This transparency makes high availability easier to implement and to maintain, which can reduce the total cost of high availability.
Although some people have expressed concern that synchronous data replication will adversely affect system performance, file-system-level replication can actually improve the performance of the local file system. This performance improvement is due to the ability to control synchronous disk writes. In many applications, synchronous writes are used to ensure that data written by the application is actually written to disk or to an independent disk subsystem. In the case of shared-storage clustering, synchronous writes are required to eliminate cache consistency issues and to prevent lost data in the case of a failure. These systems require synchronous writes to ensure that data is flushed from the cache of the server out to the storage system. Otherwise, a failure of the server makes the cached data lost or unavailable. These synchronous writes decrease performance by making the application wait for the data to be written to disk.
In a file-system-level replication environment, the disk synchronous writes can be replaced by network synchronous writes where the file system commits to the write and the application is allowed to continue once the data has been transmitted to the standby system. The active system can then commit to a write once it receives confirmation (e.g., a TCP acknowledgment) that the standby system has also received the data. Thus, although the data may not yet be written to either disk, the active file system can commit to the write. Because both systems have the data, no single fault will prevent the data from being written to disk. Thus, this method provides the same level of security as synchronous writes without the penalty of waiting for the disk.
Synchronous file-system-level replication also provides high-availability features. First, the redundancy of the architecture guarantees that there are no single points of failure. Second, file-system-level replication enables sub-second fail-over for any application. Because the standby system is a completely independent system, file-system recovery is not required, eliminating it from the fail-over time budget. The standby system does not need to check the integrity of the file system, and the standby system already has the storage mounted. Application recovery time can also be eliminated by having a hot standby application ready to take over for a failed application. With file-system recovery and application recovery occurring almost instantly, the time required for complete fail-over reduces to failure detection. Therefore, the replication configuration provides the sub-second fail-over required by the most demanding environments.
Because both the active and the standby systems have their own independent storage, internal server storage or external single-ported, single-controller RAID subsystems can be used instead of expensive dual-ported, dual-controller RAID subsystems. Even though up to twice the storage capacity may be required, this less-expensive storage can substantially reduce overall costs. And because the two independent systems require only a TCP/IP connection, there are no distance limitations between the systems.
Finally, data replication at the file-system level does not suffer from cache and lock management problems. Because file-system-replication takes place above the caching process and because replication is synchronous, cache consistency and lost cache data issues are eliminated. The lock management issues that arise with shared-storage clustering are not present in replication configurations because each system has its own independent storage. By eliminating these issues, replication architectures can reduce the complexity of implementing and maintaining high-availability applications.
File-system-level replication offers a high-availability architecture at relatively low cost. The architecture has no single points of failure, enables sub-second fail-over, and is ideal for disaster recovery because no geographic limitations exist. Replication is transparent to applications, volume managers, and storage, which reduces the complexity and cost of implementing and maintaing high-availability applications.
Adam Stevenson is a product marketing manager for Continuous Computing (www.ccpu.com) in San Diego, CA.