Fault tolerance versus high-availability failover
Choosing a failover configuration involves weighing passive and active redundancy and asymmetric versus symmetric designs.
Lost revenue, angry customers, liability exposure, missed opportunities. The precise cost of computer downtime may vary, but for today`s global, Internet-reliant businesses, nonstop availability and protection of mission-critical data is essential. Employees need continuous access to corporate databases. Customers expect 24-hour sales and service from your Web site. And if a company`s systems can`t deliver, the stock market analysts take no prisoners.
The issues IT managers face in selecting technologies to maximize system availability are critical. One key challenge is to implement effective failover solutions that meet high-availability requirements without undue costs and complexity.
There are many approaches to failover, but they typically involve redundant hardware or software that, in the event of a hardware or software failure, will automatically take over and perform all of the necessary tasks of the failed system. Since failover technology can be expensive, it is wise to do some homework. Armed with an understanding of the basic concepts, you will be better prepared to determine which approaches best meet your business needs.
One way to ensure availability is hardware failover, or fault tolerance. Fault-tolerant systems rely on redundant processors, memory, buses, power supplies, and disk storage. The active components on the system are continuously monitored, and in the event of a failure, the faulty component deactivates and a backup system takes over.
Passive vs. active redundancy. In a fault-tolerant system, the redundant hardware is either passive or active. In a passively redundant system, standby systems are not used until a failure occurs. A noticeable service interruption and some loss of information will occur during the failover and restart processes. Actively redundant systems, on the other hand, operate in parallel with the primary systems to provide continuous service without noticeable interruptions.
In this scenario, the results of operations performed by both systems are always compared. If a mismatch is detected, a diagnosis program is run to determine which component has failed. If a fault is located, the failed component is isolated, and is eliminated from system operation.
Examples of fault-tolerant systems are mirrored disks and RAID. In a mirrored system, if one disk crashes, a copy can be retrieved. RAID arrays are similar in that if one disk in an array crashes, it can be replaced and the system will rebuild that disk. The benefits of a fault-tolerant system include:
No single point of failure or repair
Errors or failures are identified before data can be corrupted
Errors or failures are isolated to provide continuous operation
Failed components are repaired without system downtime
The failed subsystem can be brought back into the configuration, restoring full functionality with little or no service interruption.
In actively redundant systems, failures are transparent to users and applications (requiring a method of notification so repairs can be made).
The most common approach for an actively redundant fault-tolerant system is triple modular redundancy (TMR), in which the executions of three processors are passed through a "voter" and the majority result is the one used by the system. In another type of configuration, called "pair and spare," two pairs of processors are configured such that each pair backs up the other pair.
Though they do a good job of failover, one of the biggest disadvantages of fault-tolerant systems is cost, since they rely on duplicate systems that sit idle until their counterparts break down. Furthermore, actively redundant systems often involve proprietary hardware and applications that drive up the initial investment as well as the life-cycle cost.
The term high availability (HA) applies mainly to software failover, which eliminates or reduces the need for inactive redundant hardware. In an HA environment, resources are pooled, shared, and remapped in the event of a hardware or software failure. For example, an application that was running on a failed system is transferred to a working system. An HA solution can be a more efficient and less costly failover alternative, since the substitute machine is a functional, working part of the system.
High-availability systems are clustered into resource groups, and software becomes responsible for the duplication, update, and synchronization of information across redundant hardware components. Each computer monitors the others in the group by tracking a signal, or heartbeat, provided by each computer. If a particular machine`s heartbeat is not heard for a certain time period, that machine is considered to have failed. When that happens, another machine in the group takes over the functions of the down machine.
HA systems are controlled by specialized software packages designed to manage resource groups of two or more computers. Since a shared disk subsystem and the network are crucial pieces of a resource group, it is common for HA software to use both a network heartbeat link and a shared disk heart- beat link to determine whether or not a system has failed. Once failover is triggered, the resource group is "in transition," and a variety of transition scripts are used by the HA software to transfer applications and data to a working machine, bring a repaired system back into operation, and perform other failover-related processes. In general, the length of time for failover to be completed can range from 30 seconds to 5 minutes.
Asymmetric vs. symmetric HA. High-availability resource groups can be either asymmetric or symmetric. An asymmetric configuration uses primary and stand-by machines in combination with high-availability software. When a failure occurs, the HA software initiates the necessary steps for the stand-by device to take over processing from the primary device. An asymmetric configuration typically has fewer implementation problems and is easier to establish.
In a symmetric configuration, each machine in the resource group runs its own application and is used regularly as a primary device. At the same time, each machine monitors others in the system to verify operation and takes over when a failure occurs.
Choosing the right HA configuration depends on business requirements, the systems and applications being used, and, of course, cost. While a symmetric configuration usually costs less and uses resources more efficiently than an asymmetric configuration, it is more complicated to deploy. Also note that in a symmetric configuration, post-failover performance may be hampered, since one machine is doing the work of two. Therefore, total available resources must be sufficient to provide an acceptable level of performance, even when a system goes down.
In both environments, additional scripting may be required. Symmetric configuration scripts are more complex. Another consideration when choosing an HA system is software transparency, since the level of transparency affects the product`s complexity and portability.
Each enterprise system is unique, and failover designs must be tailored to meet the specific performance goals of your business. The goal is to strike a balance between design complexity, cost, and availability.
High-availability software typically uses both a network heartbeat link and a shared disk heartbeat link to determine whether or not a system has failed.
Jeff Wells is product manager, Storage Management Division (Rancho Cordova, CA), at Sterling Software (www.sterling.com), in Dallas, TX.