A review of RAID levels, implementations, and RAID 5 write operations.
BY PAUL LUSE
A solid understanding of how RAID-particularly RAID 5 write operations-works will help IT professionals gain a better understanding of how the various features of a particular RAID product affect its operation and performance.
The most common RAID implementations are host-based, hardware-assisted, and intelligent RAID. Host-based RAID (sometimes called software RAID) runs on the host CPU, does not require special RAID hardware, and uses native drive interconnect technology. In host-based RAID, the server's available bandwidth for application processing is reduced because the host CPU sacrifices cycles to handle RAID operations, including eXclusive OR (XOR) calculations, data-mapping, and interrupt processing.
Hardware-assisted RAID combines a drive interconnect protocol chip with a hardware ASIC, which typically performs XOR operations on either an add-in card or a motherboard. Think of hardware-assisted RAID as an accelerated host-based solution, because the actual RAID application still executes on the host CPU. Hardware-assisted RAID enhances host-based RAID by offloading the XOR calculations to an ASIC. However, the host CPU is still responsible for all other RAID operations, undermining overall server performance.
With intelligent RAID, the host CPU is not a part of the RAID subsystem. The RAID application and XOR calculations execute on a separate I/O processor. Intelligent RAID implementations cause fewer host interrupts because they offload RAID processing from the host CPU.
There are several types, or levels, of RAID, each offering different perfor mance and data-protection characteristics. A key concept in RAID is abstraction, which is the practice of "hiding" the details of an implementation to provide simplicity at a higher layer.
Figure 1: In a RAID 0 configuration, all of the strips across a row are referred to as a stripe.
In RAID, multiple disk drives are combined into a disk array via the RAID controller and appear to the host as a single logical disk, or an "abstraction" in RAID terminology. The most popular RAID levels are 0, 1, and 5.
RAID 0 is commonly referred to as "disk striping," because the RAID subsystem actually lays out the data across a number of disks in "stripes" to take advantage of parallel processing over all of the disks. Figure 1 illustrates this concept.
A RAID 0 array is much faster for both reads and writes than a single disk because of the ability to parallel process (all disks working at the same time). RAID 0 is typically used in applications where perfor mance requirements outweigh data-protection requirements. In Figure 1, each Dn represents a chunk of data, often referred to as a strip. All of the strips across a row are referred to as a stripe. So, D1 represents a strip and D1...D4 collectively represents a stripe.
Figure 2: A RAID 1 array provides faster reads than a single disk but slightly lower write performance.
RAID 1 is commonly referred to as "mirroring" because data is essentially duplicated over two or more disks. RAID 1 is typically used in applications where data protection is more important than performance. The RAID 1 array in Figure 2 provides faster read capability than a single disk but slightly lower write performance. RAID 1 might be used to mirror the operating system boot volume of a server in a system, where protecting the operating system drive is critical.
RAID 5 protects data for n number of disks with just one disk that is the same size as the smallest disk in the array. For example, let's say you have a Web server with five disks in an application where failure of one disk must not cause server downtime. If each disk is 72GB, the total usable capacity for a five-disk RAID 5 array is 288GB (RAID 5 usable capacity equals s * [n-1], where s is the capacity of the smallest disk in the array and n is the total number of disks in the array).
Figure 3: In RAID 5, parity data is on a different stripe on each disk, which is referred to as parity rotation.
In this example, a single 72GB disk can guarantee that any one of the others making up the 288GB array can fail and all of the data will be safe. Let's say you have a 15-disk array of 72GB disks. A single 72GB disk protects the entire 1,008GB array. Not only does a RAID 5 array offer this efficient method of protecting data, but it also has read performance similar to a RAID 0 array, while write performance suffers only slightly from that of a single disk. For these reasons, RAID 5 is very popular for general-purpose servers such as file and Web servers.
How can a single disk protect the data on any number of other disks? The primary calculation is based on the very simple Boolean XOR operation. XOR is both an associative and commutative operation, meaning that neither the order of the operands nor the grouping affects the outcome of the operation. XOR is also a binary operation and only has four possible combinations of two operands. Simply put, two operands have a true XOR result when one and only one operand, exclusively, has a value of 1.
Implementing the XOR function in dedicated hardware-an XOR ASIC or I/O processor with integrated XOR function-greatly increases the throughput of data requiring this operation. Every byte of data stored to a RAID 5 volume requires XOR calculations. Understanding how an XOR works is critical to understanding how RAID 5 can protect so much data with such limited extra capacity.
Figure 3 represents a data map of a typical four-disk RAID 5 application.
Note that we have introduced a new data element called Pn. P stands for parity data, which is simply the result of an XOR operation on all other data elements in its stripe. To find the XOR result of multiple operands, one would start by simply performing the XOR operation of any two operands. Then, perform an XOR operation on the result with the next operand, and so on, with all of the operands until the final result is reached.
Figure 4: If a disk in an array fails, the missing data for any stripe is determined by performing an XOR operation on all of the remaining data elements for that stripe.
Note how the location of the parity data is on a different stripe on each disk. This is called parity rotation and is done for performance reasons.
Before we dive into how the write operation works, we'll start with how a RAID 5 volume can tolerate the loss of any disk without data loss. Figure 4 is the same as Figure 3, except we've added some arbitrary data values to the elements. Assume that each element represents a single bit. In real implementations, each data element would represent the total amount of data in a strip. Typical values range from 32KB to 128KB. Recall how parity is calculated: For the first stripe: P1 = D1 XOR D2 XOR D3. In this example, the XOR result of D1 and D2 is 1, and the XOR result of 1 and D3 is 0. Thus P1 is 0.
The dark shaded disk in Figure 4 represents a failed disk. In this situation, the disk array is typically considered degraded. The missing data for any stripe is easily determined by performing an XOR operation on all of the remaining data elements for that stripe. If the host requests a RAID controller to retrieve data from a disk array that is in a degraded state, the RAID controller must first read all of the other data elements on the stripe, including the parity data element. It then performs all of the XOR calculations before it returns the data that would have resided on the failed disk. All of this happens without the host being aware of the failed disk, and array access continues.
Figure 5: Diagram illustrates a four-disk RAID 5 read-modify-write operation.
However, a second disk failure will result in total failure of the logical array, and the host will no longer have access. Most RAID controllers will rebuild the degraded array automatically if there is a spare disk available, returning the array to normal. Most RAID applications include applets or system management hooks to notify a system administrator when such a failure occurs. This notification allows the administrator to rectify the problem before another disk fails.
It should now be clear how RAID 5 protects a disk array from failure of a single disk and how the XOR operation provides the capability to reconstruct data from the remaining disks. But how does the parity data get generated to begin with? The RAID 5 write operation is responsible for generating parity data, which is typically referred to as a read-modify-write operation.
Consider a stripe composed of four strips of data and one strip of parity. Suppose the host wants to change just a small amount of data that takes up the space on only one strip within the stripe. The RAID controller cannot simply write that small portion of data and consider the request complete. It must also update the parity data. Remember that the parity data is calculated by performing XOR operations on every strip within the stripe. So, when one or more strips change, parity needs to be recalculated.
Figure 5 shows a typical read-modify-write operation where the data that the host is writing to disk is contained within just one strip and identified in position D5. To perform the read-modify-write operation:
- Read new data from host: The host operating system is requesting that the RAID subsystem write a piece of data to location D5 on disk.
- Read old data from target disk for new data: By reading only the data in the location that is about to be written to, we've eliminated the need to read all of the other disks. We also have taken care of the scalability problem because the number of operations involved in the read-modify-write is the same regardless of the number of disks in the array.
- Read old parity from target stripe for new data: From the previous step we have the old data sitting in memory. We now pull in the old parity. This read operation is independent of the number of physical disks in the array.
- Calculate new parity with an XOR calculation on the data from steps 1, 2, and 3: The XOR calculation of steps 2 and 3 gives us the resultant parity of the stripe if it were totally absent of the target data element's contribution. Calculating the new parity for the stripe containing the new data is simple: Perform the XOR calculation on the new data with the result of the XOR procedure performed in 2 and 3.
- Handle coherency issue: This step isn't documented in Figure 5 because its implementation varies greatly from vendor to vendor. Ensuring coherency essentially means dealing with the time period from the start of step 6 to the end of step 7. For the disk array to be considered coherent, or clean, the subsystem must ensure that the parity data block is always current for the data on the stripe. Since it's not possible to guarantee that the new target data and new parity can be written to separate disks at exactly the same instant, the RAID subsystem must take some measure to identify that the stripe being processed is inconsistent, or dirty, in RAID vernacular.
- Write new data to target location: The data was received from the host and the RAID mappings determine which physical disk and where on the disk the data is to be written.
- Write new parity: The new parity was calculated in step 4; now the RAID subsystem just writes it to disk.
- Handle coherency: (See disclaimer on step 5.) Once the RAID subsystem is assured that both steps 6 and 7 have been successfully completed and the data and parity are both on disk, the stripe is considered coherent.
In this example, assume that D5new=0, D5old=1 and P2old=0. Processing step 4 on this data yields 0XOR 1XOR 0 = 1. This is the resultant parity element P2new. The second row in Figure 5 following the read-modify-write procedure is D4=1, D5=0, P2=1, and D6=0.
This optimized method has several interesting characteristics, the key one being that it is fully scalable. The number of read, write, and XOR operations is independent of the number of disks in the array. Also notice that the parity disk is involved in every write operation (steps 3 and 7). This is why parity is rotated to a different disk with each stripe. If the parity were all stored on the same disk all of the time, that disk would most likely become a performance bottleneck.
An interrupt is simply a request from a system component for CPU time. I/O subsystems generate a host CPU interrupt upon completing an I/O transaction. So, consider a write operation to a four-disk RAID 5 array employing host-based RAID, hardware-assisted RAID, and intelligent RAID applications. Let's again assume that we have the simplest of transactions: a one-bit write.
In the case of host-based RAID, the host is responsible for mapping the data to various disks. So, the host must generate each read and write required to perform the read-modify-write operation. If you add them up, the host CPU should get four completion interrupts from the subsystem-two reads and two writes (steps 2, 3, 6, and 7 in the example).
A hardware-assisted RAID solution would also generate four completion interrupts because it is associated with only an XOR ASIC. The I/O processor in an intelligent RAID subsystem typically has the ability to "hide" the interim read-and-write operations from the host via various integrated peripherals.
In an I/O processor-based subsystem, only a single completion interrupt is sent to the host. The I/O processor handles all of the others, freeing the host CPU to perform other non-RAID-related tasks.
With a detailed understanding of a RAID 5 operation, you should be able to better evaluate how important interrupt offload is to your application. If you want to keep the host CPU free of the additional interrupts needed to perform a read-modify-write operation, you'll need intelligent RAID. By understanding RAID 5, you have a big advantage when choosing a solution.
Paul Luse is a senior software architect at Intel Corp. (www.intel.com) in Santa Clara, CA.