Controllers are the "brains" of RAID subsystems. Understanding how they work is key to choosing the best array for your applications.
By Steve Garceau
Traditionally, high-end RAID systems have targeted enterprise-class mission-critical applications requiring high-availability features, while lower-priced configurations have provided basic levels of fault tolerance for workstation, department, and small-server environments. Today, RAID systems are bridging the gap between high- and low-end markets. Technologies, terminology, and implementations previously only applicable to the high end are making their way downstream, making a review of RAID technology and definitions useful. Here are some questions and answers to get you started.
How does parity work?
RAID controllers protect data by withstanding the failure of any single disk drive within a set of drives. Although actual implementations of RAID concepts are complex, they can be illustrated in a simple example.
RAID controllers use a parity disk to ensure that if a disk drive fails, the controller can reconstruct missing data.
Assume four disks are being managed as one RAID group, or set (see figure). Each write from the host computer requires an update to one or more disk drives with a corresponding update to the parity location or drive. The term "location or drive" is used because RAID 3 and RAID 4 use a dedicated parity methodology-that is, one drive contains the parity information for the entire RAID set. In a RAID-5 implementation, however, parity is distributed across the RAID set members.
Every time data is updated on disk one, two, or three, the RAID controller updates the corresponding location on the parity disk drive to reflect the sum of the three disks. In this example, it means that the parity drive contains a value of 13 in the location corresponding to the value of four on disk one, six on disk two, and three on disk three (i.e., 4 + 6 + 3 = 13).
The host system has no knowledge about the information on the parity disk. The RAID controller uses the parity disk to ensure the controller can reconstruct missing data if a disk fails. The RAID controller subtracts the surviving disk values from the parity disk to come up with the value of the failed drive. For example, if disk two fails, the RAID controller can figure out its value by using the following equation:
Parity - disk one - disk three = disk two (13 - 4 - 3 = 6).
What is a RAID write hole?
To write to a disk array, a server makes the RAID controller perform many steps in a read-modify-write sequence. Some of these steps are usually performed in cache memory for added performance. Because this procedure involves multiple steps, it is vulnerable to RAID write holes, i.e., a partial update caused by unexpected interruptions (e.g., electrical power loss, controller fault, etc.).
Consider the following example. In a disk array, the sum of the individual disk data adds up to the parity value: 4 + 6 + 3 = 13 (see above figure). Suppose a host write requires an updated disk value, necessitating a recalculated parity value. If the disk value of 4 is updated to 5, but the process is interrupted before the parity is recalculated, the disk array would be left with the invalid expression: 5 + 6 + 3 = 13. If a drive in this array failed, the reconstructed data would be incorrect and undetectable by the host.
To maintain data integrity, most RAID controllers keep track of operations in progress so that, following an interruption, the controller knows exactly which steps of the update sequence have and have not been completed so the sequence can be completed without corrupting data.
Why is mirrored cache important?
Almost all hardware RAID controllers use cache memory to enhance performance. Cache is used as a staging area for information destined for the disk drives and for information coming from the drives to the host. Cache memory is also used to house important vital "housekeeping" information.
Redundant RAID controller configurations are designed to provide continuous data access in the event of a controller failure, with the surviving controller assuming the entire workload.
But how does the surviving controller pick up where the failed controller left off if the cache memory of the failed controller is inaccessible? Remember: The cache memory has in-progress housekeeping information and unstaged data resident in it. If a copy of cache memory is not kept somewhere (e.g., mirrored), it is a single point of failure.
Mirrored cache is a fault-tolerant technique typically used by storage systems with dual RAID controllers. Cache can be mirrored in two ways: a) each controller can keep two copies of its cache data or b) each controller can keep its cache data and a copy of the other's cache data.
For maximum fault tolerance, a redundant RAID controller ensures that two copies of each controller's cache data exist before acknowledging to the host that the write operation is complete. Should a memory module fail, another copy of its data is available. If a controller fails, the mirrored contents of its cache can be used to write to disk the data stored in the cache of the failed controller. Mirroring cache ensures high availability; the system can withstand a memory or controller failure, without loss of data.
Why is battery or UPS backup important?
Among other things, cache memory in a RAID controller can improve performance. The host performs a write to the storage system, and the RAID controller accepts the write into its cache memory and tells the host the write is complete, enabling it to attend to the next request. Eventually, the data in cache memory is written to the disk drives.
But what if a power failure occurs before all the data in the cache has been written to the disk drives? Remember: The host believes the write is complete, even though the data is still sitting in volatile cache memory-that is, memory that cannot retain its data without power. Without a battery or uninterruptible power supply (UPS), that data will be lost and the host will deliver an I/O error the next time that data is requested.
Most RAID vendors recommend or even require battery or UPS protection. Should primary power fail, a battery or UPS provides the RAID controller with ample time to write the data from volatile cache memory to non-volatile disk drives or at least enough power for the cache to maintain its data contents until primary power is restored.
What is write gathering?
When a host writes to a RAID 5 disk array, multiple I/O steps are performed to disk drives, including data and parity updates. Breaking down the I/O, a stripe of data consists of strips of data on each drive and associated parity information (see figure below). When a host wants to update data, it may only be updat-ing a single strip on a drive.
A stripe of data consists of strips of data on each drive and associated parity information.
In an active system, the host may subsequently issue multiple drive up-dates in the same general addressing area so that the other drives of the stripe are also updated.
Most RAID controllers use write gathering to group multiple strips in cache until the assembled strips make a stripe. Once the entire stripe is assembled, it is written along with the parity data to the drives. Eliminating the need to recalculate and rewrite the parity strip for every individual strip update minimizes the number of I/Os and maximizes cache use. The end result is improved performance.
What is a distributed lock manager?
As storage area networks (SANs) and clustering gain popularity, sharing storage and associated data among like and unlike servers becomes increasingly necessary. To effectively implement multi-server data sharing, each server in the SAN must contain distributed lock manager (DLM) software to manage access to the RAID storage system. The software protects against data corruption if two or more servers try to update the same data concurrently.
What is LUN mapping?
Logical unit number (LUN) mapping is a SAN-enabling technique that is key to sharing storage among multiple network servers. With a RAID storage system, RAID sets are of-ten subdivided into groups known as LUNs, which the OS acknowledges when addressing usable storage.
With LUN mapping, each server can only see designated LUNs in the storage pool. In a homogeneous OS environment, one or more LUNs can be designated to a particular server on the network, or SAN, eliminating the need for a distributed lock manager.
In a heterogeneous Unix/NT environment, LUN mapping also eliminates the need for common file systems.
What is the role of a hub?
Fibre Channel hubs provide fault tolerance in SAN environments. On a Fibre Channel Arbitrated Loop (FC-AL), each node works as a repeater for all other nodes on the loop. This means that if one node fails, the entire loop fails.
Hubs provide routing via loop re-siliency circuitry (LRC) to automatically bypass failed or unused ports. This functionality allows the loop to be self-healing.
Fibre Channel hubs can simplify cabling on a loop. Rather than breaking the loop to add another node, hubs can be used in a star configuration while attached directly to the loop. Nodes can be added or removed from the loop by plugging them directly into the hub.
Furthermore, Fibre Channel hubs feature loop management functionality such as signal re-timing to eliminate signal jitter and invalid transmissions, as well as signal regeneration to facilitate cascading. SANs requiring high availability may use multiple hubs to ensure no single point of failure.
When is a hub required?
Hot swapping a controller physically breaks the loop. Access to any device on the loop, and associated data, is then lost. One solution is to place a Fibre Channel hub between the storage system and the loop. The hub's LRCs (see above) are designed to sense when a device is removed from the loop and automatically bypass the device, thus maintaining loop integrity. On the downside, while cost per port is dropping, the approach can be costly.
Alternatively, some RAID systems now incorporate LRCs on the controllers. The embedded LRCs allow users to hot-swap a failed controller without breaking the loop. This approach is also significantly less expensive.
Do benchmarks gauge RAID performance?
Speed can be a confusing issue when comparing RAID controllers and subsystems due to a number of misleading performance claims. Generally speaking, many controllers require qualification, even though most performance claims (expressed in I/Os per second) are based on the de facto standard Iometer benchmark.
Some RAID vendors claim astounding data transfer rates that magically exceed the capability of the interface, thanks to a number of tricks. One way is to simply read the same data from cache over and over without ever actually reading from disk. Obviously, cache is faster than disk (nanoseconds vs. milliseconds), and with optimized benchmarking software, I/O rates can be impressive.
However, "your mileage may vary." Some RAID controllers have an architecture that makes them better at reading than writing or dealing with random data access vs. contiguous (sequential) access, for example.
If possible, the best way to assess performance is to benchmark the RAID system in your environment running your applications or to ask for references at similar sites. The way an application or operating system reads and writes to storage and the number of users or data streams determine RAID performance.
Steve Garceau is storage product manager at CMD Technology (www.cmd.com), in Irvine, CA.