Do your RAID controllers pass the test?

In addition to performance, data availability and protection are critical.


RAID has become the methodology of choice for attaching storage to all but the smallest systems. The popularity of RAID can primarily be attributed to three key user requirements: increased performance, data integrity, and data availability. While a RAID controller's performance capabilities are relatively easy to demonstrate, data availability and data integrity capabilities are another story.

Currently, there are no benchmark programs available for end users to quantify a RAID controller's robustness in these areas. How a controller will perform when disaster strikes requires an understanding of the controller's hardware and firmware architecture.

RAID controller technology was developed to provide data integrity and data availability. Therefore, all RAID controllers provide high levels of functionality in these two areas, right? Product literature abounds with claims of no single point of failure, automatic fail-over, and parity checking, etc. However, these claims are not always true. All RAID controllers are not created equal and as with any purchase, it is the customer's responsibility to fully understand the capabilities of the product.

Mirroring RAID controller cache using....
Click here to enlarge image

The goal of any high-availability storage system is to combine RAID techniques with a hardware/firmware implementation to ensure the highest degree of data availability and integrity. Most high-availability RAID arrays provide the protection necessary to survive the failure of a major component such as a controller, cache memory, or power supply-high-level components that could interrupt data accessibility. However, there are significant differences among RAID systems in more subtle areas. These differences determine whether a system has complete protection against data access interruption and, more importantly, loss of data integrity.

Click here to enlarge image

While knowing all of the "ones and zeros" of a RAID controller's architecture is unnecessary, key questions will help determine how well a RAID controller will perform in adverse conditions:

How does the RAID controller deal with media errors on the disk?

This is one of those not-so-obvious differences between RAID controllers-the management of disk media errors (bad areas on a disk surface that cannot reliably store data for subsequent retrieval). Dealing effectively with disk media errors is critical to guaranteeing high data integrity. There are several techniques employed by RAID suppliers to keep track of data that cannot be reconstructed.

The different methods involve trade-offs among performance, capacity, and complexity. Consider both the RAID system architecture and the specific environment for which it is being used in order to qualify which method is most appropriate. However, regardless of method, find a RAID system that uses a technique for disk media error management because not all do. Whether it is a software or hardware-based RAID solution, bus-based or external, disk media errors must be addressed.

Does the RAID controller perform proactive subsystem diagnosis and corrective action?

It is always better to find a problem and correct it before it adversely affects access to data or the data itself. Many-but not all-RAID controllers include scheduled media scans and error management implemented in a task scheduler.

Does the RAID controller provide an effective method for protecting data in the event of a power failure?

Virtually all RAID controllers have a feature called write-back caching to enhance write performance. This feature allows the controller to notify the hosts that write commands are complete as soon as the data is written into the controller cache memory, without waiting for the data to actually be written to the disk(s). Write data in a controller's cache is vulnerable to power losses and/or controller failures until the data is written to disk. If write-back data is lost, host applications will be oblivious to this loss since they were informed that the write was completed successfully. Therefore, unwritten data must be maintained and, in the case of a controller failure, be available to the surviving controller until it can be written to disk.

This area is normally covered by RAID controllers through the use of battery-backed-up data cache. However, merely retaining unwritten data with batteries does not ensure against data loss. The RAID controller must also keep track of what it was doing at the time of the power failure and what micro code step it was at when the event occurred. This is often referred to as operational checkpointing. And even if the RAID controller keeps track of all of this information in its memory, batteries have a limited time to protect unwritten data in the controller's memory. If the batteries become exhausted before power to a storage subsystem is restored, then data in the controller's memory will be lost. Further loss of data may be incurred on the disk drives due to partially written-and therefore incomplete-records. Therefore, it is vital to know how long the RAID controller retains unwritten data and operational checkpointing information to ensure it is able to handle a lengthy power outage.

Does the RAID controller provide redundant dedicated fail-over or cache-mirroring paths between redundant pairs of controllers for the purposes of mirroring data?

All high-availability RAID subsystems that incorporate redundant active-active or active-passive controller pairs must maintain redundant copies, or mirrors, of unwritten cache data and operational checkpoint data to allow the surviving controller to complete writing the data. Regardless of where these mirrored copies reside, both copies of cache data must be successfully created before the host is notified that the controller has completed the I/O. Some controllers use the disk drive channels to move this data between the two controllers. While this process is a proven design, it can have a negative performance impact because the transfer of cache data between the two controllers must share the limited bandwidth available on the disk channel. A RAID controller architecture that provides dedicated redundant data paths between the two controllers does not burden the disk channels with non-disk related I/O and therefore can provide greater overall I/O performance.

Can you update the RAID controller firmware without taking the controller offline or restarting it?

RAID controllers need to have firmware updates from time to time, either to add new features as they become available or fix problems. And IT administrators find it impractical to take RAID subsystems offline for periodic maintenance. Therefore, be sure that the RAID controller offers non-disruptive firmware updates.

Does the RAID controller disable the write-back caching capability of all of the disk drives attached to it?

When disk drives are performing write-back caching, they inform the RAID controller that it has safely written the data as soon as all of the data is in its cache buffer and before it is actually written to the media. Enabling write-back disk caching on disk drives attached to RAID controllers can provide increased write performance. In fact, many RAID controller manufacturers activate this feature when conducting performance benchmarks to improve their write performance numbers. However, enabling the write-back caching feature in disk drives attached to RAID subsystems can place data at risk. If the disk drive fails before the data is actually written to its media, data will be lost or corrupted. If only one disk drive in a RAID set fails, the RAID controller can recover the data through its normal RAID algorithms. But if more than one drive fails, unrecoverable data corruption will occur. The controller must either automatically disable all write-back caching on the drives attached to it or provide the administrator with an option to manually do so.

RAID technology has evolved to the point where differences between products are no longer obvious to most end users. Many RAID systems look the same, but in actuality, not all RAID controllers are created equal.

Stephen T. Ferrari is a product marketing manager at CMD Technology Inc. (www.cmd.com) in Irvine, CA.

This article was originally published on July 01, 2001