Watch out for points of failure and performance penalties in dual-controller RAID arrays.
BY TOMLINSON G. RAUSCHER
In the design of RAID controllers, it has been argued that "active-active fail-over is the cornerstone of high availability." When you consider RAID system availability, however, it helps to look beyond the RAID controller configuration at the total disk storage system. With this perspective, we find that most active-active RAID systems contain multiple single points of failure. As a result, these systems aresusceptible to common failures that can cause the loss of data availability. In addition, the performance improvements provided by most active-active systems often fall short of user expectations.
An active-active RAID system uses two RAID controllers that simultaneously process I/O requests from host computers (see Figure 1). The two controllers communicate with each other, so that when one controller fails, the surviving controller
- Takes over the identity of the failed controller;
- Takes over communication to the disks to which the failed controller communicated; and
- Takes over processing all the I/O operations for the RAID system.
After this automatic fail-over process, the failed RAID controller can be hot-swapped (e.g., replaced with a functional controller). The controllers then perform a failback operation and restore the system to its original configuration. Thus, just as redundant disks enable a RAID system to continue operation after a disk fails, redundant controllers in an active-active RAID system let the system continue to operate after a controller fails.
Figure 1: An active-active RAID controller system can survive the failure of a disk drive or RAID controller (top). Figure 2: Active-active RAID controller configurations have multiple points of potential failure (bottom).
While an active-active RAID system can survive the failure of a disk or a RAID controller, there are several other system components whose failure causes loss of data. This is a fundamental problem with many active-active systems. For example, when a disk channel fails, the disks attached to that channel become unavailable. For RAID systems that have two disk channels and use parity RAID (such as RAID 5), the loss of the disks on a channel means the loss of data.
There are a variety of problems that can cause disk channel failure:
- The disk channel controller chip in a disk or a RAID controller fails and locks the disk channel.
- The physical disk channel itself fails (e.g., as a result of the failure of a cable, a trace, a connector, or a terminator).
In addition to these hardware failures, firmware in the disk channel controller chips can lock a disk channel and cause system failure. For the RAID system illustrated in Figure 1, for example, there are at least 30 single points of failure associated with the disk channels that can cause the RAID system to fail. Figure 2 shows some of these potential points of failure.
Typical of a large number of RAID systems, the RAID systems shown in Figures 1 and 2 have two disk channels. For more-sophisticated RAID systems that have more disk channels, the previous point can be generalized as follows:
Figure 3: In some cases, a dual- controller implementation can provide only nominal performance improvements over a single-controller configuration.
An active-active RAID system that uses parity RAID (such as RAID 5) with n disks in a stripe set (including parity) and fewer than n disk channels has multiple single points of failure that can cause data loss.
While it has been written that "active-active fail-over and failback capabilities ensure nonstop online I/O operations," this is clearly not true. It is critically important to look beyond active-active controllers to determine if a RAID system will survive single failures.
In addition to disk channel failures, there are other potential points of failure in many RAID systems (e.g., the backplane into which the RAID controllers are inserted). Al-though the figures above do not illustrate a backplane, in the design of most active-active systems each RAID controller plugs into a common backplane. There are many ways in which a backplane can fail, causing the system to fail. Although some backplanes have been designed with only passive components to reduce the probability of failure, in most designs, an active-active RAID system that uses a single backplane has multiple single points of failure that can cause data loss.
Another design issue in an active-active RAID system that can be a source of problems is the communication link between controllers. RAID controllers use this link-sometimes called a heartbeat connection-to inform each other of their status. Should one RAID controller fail to send (or respond to) a signal, the other controller initiates fail-over activities. If the heartbeat connection fails while both controllers are operating properly, the system can become dysfunctional as both controllers attempt to take over the identification of the other controller and its disks.
From the viewpoint of reliability, active-active RAID configurations offer advantages like those of active-passive RAID systems. The additional advantage of an active-active RAID system is the potential for increased performance because both RAID controllers actively process I/O commands at the same time. Indeed, adding another controller to a system offers the opportunity to double system performance, which is what many users expect. How ever, in practice, the performance of an active-active RAID system is rarely double the performance of a single-controller system. Performance improvement of an active-active system is typically only 30% to 40% more than the performance of a single-controller RAID system.
Figure 4: A RAID design with multiple disk channels per controller can avoid data loss caused by disk channel failures.
In an active-active system, like the one shown in Figure 1, the system can be configured so that Controller #1 uses different disk channels from Controller #2. In this case, the disks in the RAID sets for Controller #1 are on different disk channels from the disks in the RAID sets for Controller #2. Alternatively, Controller #1 can use some (or all) of the disk channels that Controller #2 uses. In this case, some (or all) of the disks in the RAID sets for Controller #1 are on the same disk channels as the disks in the RAID sets for Controller #2.
In the first case, each RAID controller uses only some of the disk channels available in the system, limiting the opportunity to use all the disk channels concurrently and therefore limiting performance potential. In the second case, each RAID controller uses the same disk channel to access different disks. Thus, when one controller uses a disk channel, the other controller must wait until the channel is available, limiting performance potential. In fact, as an active-active RAID system becomes increasingly busy, the conflict for disk channels increases, and the controllers spend more and more of their time resolving conflicts. As a result, the percentage performance improvement of an active-active RAID system compared to a single-controller RAID system can decrease as system utilization increases (see Figure 3).
As mentioned previously, we identified a disk channel failure as a typical single point of failure in active-active RAID systems using parity RAID. To overcome this problem, a RAID controller should have several disk channels so that the loss of all the disks on one channel will not cause the system to lose data. For example, a controller with four disk channels can have RAID sets in which the number of disks is an integer multiple of four. In each of these RAID sets, one-fourth of the total disk space will be devoted to parity information, which in RAID-5 configurations is striped across all the disks. Should a disk channel fail and all its disks become unavailable, the RAID controller can continue to operate without losing data by using the parity information and data on the surviving disks (see Figure 4).
A RAID controller with several disk channels can reduce or eliminate disk channel problems that cause active-active RAID systems to lose data. Similarly, a RAID controller designed with several disk channels can significantly improve system performance. First, a RAID controller with several disk channels will have greater bandwidth than a controller with fewer disk channels, thus providing the opportunity for greater system performance. Second, comparing two systems with the same number of disks, the system with the greater number of disk channels can have a smaller number of disks per channel. This can reduce the opportunity for channel conflict and thus improve performance.
Contrary to popular belief, most active-active RAID systems have multiple single points of failure that can cause data loss. Furthermore, active-active RAID systems can provide performance that is far less than twice the performance of a single controller. RAID controllers with several disk channels with redundancy are a partial solution to this problem.
Tomlinson G. Rauscher is president of Digi-Data Corp. (www.digidata.com) in Jessup, MD.