How to Manage Disk Media Errors
RAID does a good job of managing drive failures, but can your array manage disk media errors?
By Roger J. Klein
The basic function of RAID is to provide protection against disk-drive failure. The goal of high-availability storage systems is to combine RAID techniques with a hardware and firmware implementation that ensures the highest degree of data accessibility possible. RAID is the cornerstone of high availability storage systems, but there`s a lot more to the picture.
The storage system`s hardware and firmware implementation, or architecture, takes into consideration the broader implications of failures and their effect on accessing data. At the highest level, the architecture of a high-availability storage solution protects against the failure of major components, such as the controller, cache memory, or the power supply. Most high-availability RAID solutions do a good job of addressing the high-level items that can interrupt data access; however, there are significant differences among RAID system suppliers in more subtle areas--differences that at a minimum can interrupt data access or worse yet can result in the loss of data integrity.
One of the not-so-obvious areas is the management of disk media errors. A RAID implementation that deals with disk media errors can spell the difference between an environment with high data integrity and a nagging state of uneasiness about the data.
All disk drives have media errors, regardless of the manufacturer. Media errors are defects, bad spots, or damaged areas of the disk surface that cannot reliably store data. These areas are mapped out in all new disk drives; however, they`re not limited to new drives. Disk media errors occur over the course of a disk drive`s life, so drive manufacturers incorporate spare blocks into the media. The premise is that media errors can be replaced (reallocated or reassigned) with spares either automatically by the drive or by instruction from a controller or host. When a block goes bad, it can be "fixed" using a spare location that has been set aside expressly for that purpose.
Today, most high-end drives boast mean time between failure (MTBF) rates of more than 1,000,000 hours. Nonetheless, drives are still susceptible to media errors. In fact, as much as two percent of the drive`s quoted capacity is comprised of spare blocks. For a 9.1GB drive with over 17.7 million blocks, that`s over 350,000 spares, or almost 180MB of spare blocks! It`s not a question of whether your disk drives will experience a media error, it`s a question of when and how your RAID system deals with them.
Two general areas of the RAID system need to be considered: the handling of media errors for non-degraded RAID sets, for which properly implemented systems can reconstruct the data, and the handling of media errors on degraded RAID sets, for which media errors pose the greatest threat.
A non-degraded RAID set can withstand the failure of any single disk member without loss of data availability. Unrecoverable media errors are manageable and have little effect on non-degraded sets because they can be handled as if the entire disk drive failed--that is, the missing data can be electronically regenerated from the remaining members of the RAID set.
In Figure 1, assume the RAID set has encountered an unrecoverable media error while attempting to read drive one (red oval). The RAID controller can regenerate the missing data from drive one by reading drives two through four and performing the appropriate XOR calculations. However, before the data can be permanently corrected (i.e., rewritten), the bad media on drive one needs to be relocated using the SCSI REASSIGN command. When the bad media has been successfully relocated using spare blocks provided by the drive manufacturer (e.g., yellow oval), the regenerated data can be recorded and the RAID set returns to normal.
A more serious case involves a media error while the RAID set is in degraded mode. A degraded RAID set cannot withstand the failure of any single disk member without loss of data availability. Unrecoverable media errors are a serious threat to degraded sets because they have the same effect on the stripe as an entire disk failure, and data is lost. The data cannot be electronically regenerated from the remaining members of the RAID set because there simply is not enough information to complete the regeneration. This point is illustrated in Figure 2.
Assume that drive number four (red X`ed drive) in Figure 2 has failed. While the set is degraded, the system receives a host request for data from the now-failed drive four. In the process of regenerating the missing data from the remaining drives, an unrecoverable media error occurs while attempting to read drive three (red oval). The RAID controller no longer has sufficient data to perform the XOR calculations. As a result, the data located on drive four (red X`ed drive) and the data lost in the unrecoverable media error on drive three (red oval) are lost and cannot be regenerated. The RAID system could correct the media by doing a SCSI REASSIGN, but it has no valid data. So, the RAID system has no option but to report a SCSI CHECK status to the host, indicating an inability to satisfy the request with data.
Let`s explore this example a little further, assuming a rebuild is in process. In Figure 3, assume the RAID set has suffered a disk failure to drive number four (red X`ed drive) and that a spare disk has been able to reconstruct it (yellow spare drive). In the behind-the-scenes process of automatically reconstructing the failed drive, an unrecoverable media error is encountered while reading drive three (red oval), and the RAID controller no longer has sufficient data to perform the XOR calculations. As a result, both the original lost data and the data lost in the unrecoverable media error on drive three cannot be regenerated for those strips.
What should a RAID system do if an unrecoverable media error is encountered during the rebuild of a failed disk drive? Clearly, the RAID system can`t stop because aborting the rebuild leaves the remaining data unprotected, exposing it to potential loss. Requiring some kind of immediate human intervention is inconsistent with the premise of high-availability storage systems, for which the objective is to maintain the highest level of data accessibility possible. However, if the rebuild continues, it must keep track of the location of the unreconstructable data.
In Figure 4, the best the subsystem can do is relocate the bad media on drive three using the SCSI REASSIGN command, leaving it with good media on drives three and four (the spare drive), but with no valid data to write to them. Remember: good media, no valid data. Also keep in mind that only the RAID controller knows that there is no valid data in these two locations. At the point of discovery in this example, the host has not asked for the data.
If the host has not asked for the data, why keep track of the unreconstructable location? Because at some point, the data at this location may need to be used. It may be needed to satisfy a host READ request or it may be needed to regenerate missing data in the stripe caused by a subsequent drive failure. Whatever the reason, the underlying architecture must deal with the fact that the disk drive will deliver data from the location regardless of its logical validity. If the disk media is good, the drive will deliver data. Keeping track of the unreconstructable locations allows the RAID system to distinguish between good data and invalid data.
In the case of the host READ request, permanently keeping track of unreconstructable data allows the RAID system to correctly provide a SCSI CHECK status to that READ. This check tells the host that the subsystem cannot reliably satisfy the request. Remember, since the system fixed the media using a SCSI REASSIGN, it can read the location, but unfortunately it would not contain valid data. If the RAID system did not keep track of that fact, it could provide the host with invalid data with no error indication.
Another event that may affect the unreconstructable location is a WRITE. In effect, a WRITE fixes the problem because we now have valid contents to record to the media. The invalid data problem is fixed and data access is not interrupted. It is also possible that there will never be a READ or WRITE to the affected location. In all circumstances, however, it is essential to keep track of the unrecoverable locations for the reasons cited above.
So, how do you keep track of unreconstructable data? RAID suppliers use several different techniques--the trade-offs are performance, storage capacity, and complexity. Whether a given method is good or bad depends on the RAID system architecture and the specific environment in which the architecture is used. What`s important is that the RAID system uses some technique to manage disk media errors because, believe it or not, not all systems do. Those that don`t are either weak with respect to high-availability data access or worse, they put the system`s data integrity at risk.
Whether it`s a software- or hardware-based solution, bus-based or external, the RAID solution needs to deal with disk media errors. When evaluating high availability storage systems, get answers to these questions:
- Does the RAID implementation have an effective media management scheme?
- Can the supplier describe the implementation and answer specific questions about it?
- Does the RAID solution survive hardware failures, power failures, controller restarts, redundant controller fail-overs, etc.?
- Does the scheme allow for continuous operations?
- How are media errors handled during rebuilds?
- If you`re running a 7x24 system and a disk fails at midnight and a rebuild initiates, what happens if a media error occurs?
- Will the scheme ensure that bad data in good media is not confused with good data in good media?
There are many well-designed RAID products in the market today. However, RAID technology has evolved to the point where differences between products are not obvious. On the surface, many RAID systems look the same. But not all RAID systems that look alike are alike. Know the product you`re going to trust your data to, ask questions, and insist on answers. Your data depends on it.
Roger J. Klein is director of storage marketing at CMD Technology Inc. in Irvine, CA.