The RAB Guide to Nonstop Data Access
Continuous data availability is an increasingly important goal at 7x24 IT shops. Here`s a checklist to help you evaluate the ability of RAID subsystems to withstand failures.
By Joe Molina
To support nonstop data access for retrieving, storing, and modifying data, a disk system must provide uninterrupted, timely on-line access to reliable data despite abnormal occurrences. In the terminology of the RAID Advisory Board (RAB), a disk system that circumvents one or more abnormal occurrences is said to have extended data availability and protection (EDAP) attributes.
An abnormal occurrence is anything that could destroy or corrupt data, anything that could prevent timely on-line access to data, or any period during which a disk system`s EDAP capability to circumvent a fault is diminished, thus making the disk system susceptible to further failures. These conditions include:
- Internal failures within the disk system.
- Failure of external equipment attached to the disk system, including host computers and host I/O buses.
- Failures resulting from abnormal environmental conditions, including external power sources or temperatures that are out of normal operating ranges, natural disasters such as floods and earthquakes, accidental disasters such as fires, and unlawful acts such as sabotage, arson, terrorism, etc.
- Replacement periods, or intervals required for replacement of a failed component. If "hot swap" is not supported by the disk system, then the component replacement period is tantamount to downtime. If "hot swap" is supported, then downtime due to a replacement period is eliminated. However, until the failed component is replaced, the disk system is in a vulnerable period.
- Vulnerable periods, which occur when the disk system has invoked its ability to circumvent a failure, rendering the system vulnerable to additional failures and causing the system to operate at less than optimum performance until the fault is corrected.
EDAP attributes of a disk system vary from circumventing an internal disk failure to protecting against any internal, external, or environmental failure. The lowest level of EDAP--prevention of the loss of timely on-line access to reliable data due to a disk failure--is also referred to as RAID.
How Much EDAP?
Determining how much EDAP to purchase is a function of application requirements and budget. A few applications do not require EDAP; others require high levels. It`s important to select a disk system that meets the needs of your specific application--too much capability is a waste of money; too little puts users at risk of either losing vital data or not being able to access data in a timely manner.
The cost of increasing EDAP is nonlinear. For example, a disk system that withstands a full range of abnormal environmental conditions is roughly five times the cost of one that only withstands a single disk failure.
Disk systems that allow users to add or substitute components to increase EDAP are referred to in RAB terminology as "EDAP scalable." Users who expect future applications to demand more capability may want to limit their pre-purchase procurement analysis to disk systems that can be easily expanded and upgraded or to those that allow substitution of components and functions. Users who do not consider future EDAP requirements may end up replacing, rather than expanding, their disk systems.
Internal Failure Criteria
Because the data resides on disks, EDAP capability starts there, with either mirroring or parity RAID protecting data in the event of a disk failure. EDAP capability then extends to other functional elements within the disk system, including controllers, power supplies, cache memory, and device channels.
The RAID Advisory Board`s EDAP criteria for determining the degree to which a disk system exhibits EDAP capability for identifying and circumventing internal failures are:
- Data access and data protection--protection against data loss due to a disk failure and protection due to a disk failure or multiple concurrent disk failures in a disk-array redundancy group.
- Data protection--protection against data loss due to inconsistency between data and related redundant data (e.g., as a write hole) to disk-system component failure (controller, device channel, or power supply) or to cache component failure.
- Data access--protection against loss of data access due to the failure of the device channel, controller, cache, or power supply.
External Failure Criteria
The RAB EDAP criteria for determining the degree to which a disk system exhibits EDAP for external equipment failures are:
- Data protection--protection against data loss due to host and host-I/O bus failures.
- Data access--protection against loss of data access due to host and host-I/O bus failures.
Environmental Failure Criteria
The RAB EDAP criteria for determining the degree to which a disk system exhibits EDAP for identifying and circumventing failures caused by abnormal environmental conditions are:
- Data protection--protection against data loss due to external power failure or to temperatures that are out of normal operating ranges.
- Data access--protection against loss of data access due to external power failure, FRU replacement, or failure of one zone of a disk system.
Replacement Period Criteria
A disk system is "down" if it does not provide timely on-line access to reliable data in response to application requests. The minimization or elimination of downtime by replacing failed components is an important goal and can be achieved through:
- Disk hot-swapping. To attain all but the lowest of the seven RAB EDAP classifications, the system must be capable of disk hot-swapping, i.e., disks can be replaced without shutting off power or affecting the system`s ability to satisfactorily service application requests.
- FRU replacement. To attain the four highest RAB EDAP classifications, a disk system must be able to replace any FRU without disrupting data access.
Vulnerable Period Criteria
When a disk system invokes its capability of circumventing a failure, the system becomes vulnerable to additional failures until the first fault is rectified. Vulnerable periods include the time required to write the contents of a failed disk to its replacement disk and the time between a failure and its rectification.
A disk system is vulnerable in any one of the following states:
- Disk replacement. A disk system is in a disk replacement state from the instant a disk failure occurs until reconstruction begins. If the disk system supports on-line sparing, reconstruction commences immediately after a disk failure, and the disk-replacement period is zero. If on-line sparing is not supported, the disk replacement period equals the time it takes for the system administrator to heed the failure indication and replace the failed disk.
- Disk reconstruction. Disk reconstruction is defined as the time required to regenerate (for parity RAID) and record the contents of a failed disk onto the replacement disk. Regeneration of data residing on the failed disk from check data (parity) and surviving data on other disks is not required for mirrored disk systems (RAID level-1), but it is required for parity-RAID disk systems (RAID levels 3, 4, 5, and 6).
- Component failure--from the instant a component (other than the disk) fails until the moment it is replaced, assuming that the disk system supports full-functional redundancy.
- Equipment failure--from the instant any equipment (host, host-I/O bus, external power source, etc.) attached to the disk system fails until it is repaired or replaced, assuming additional equipment can be attached to the disk system.
A disk system is placed in a vulnerable state by a single fault in any one redundancy group. A redundancy group can be two or more disks for a mirrored disk system, three or more disks for a parity RAID disk system, any pair of elements (controllers, power supplies, hosts, host I/O buses, etc.) within or external to the disk system, or two disk-system zones. When a disk system is in a vulnerable state, it cannot circumvent another fault in the same redundancy group.
Minimizing a system`s vulnerable period is important. Therefore, users who are thinking about purchasing disk systems with some degree of EDAP should consider the following features:
- On-line disk sparing. To eliminate any delay in reconstructing the disk system after a failure, the system must support on-line disk sparing (also referred to as a disk "hot sparing"). Disk sparing enables immediate reconstructing after a disk failure, without user intervention.
- Failure indication. In addition to circumventing failures, the disk-system EDAP must indicate failures to users. Without this warning, the system is subject to further failures until the failed unit is replaced. With the exception of on-line disk sparing, user intervention is required to minimize a disk system`s vulnerable period. For this reason, a disk system`s ability to warn users of a fault is critical.
- Failure warning. The disk system must warn users of environmental conditions that could be detrimental to the disk system, including power and temperatures that are out of normal operating ranges.
- Reconstruction period. Because of wide variations in acceptable reconstruction periods, RAB does not specify a maximum reconstruction period as an EDAP criterion. However, users are advised to include comparative data on reconstruction periods in their disk-system procurement analysis.
Joe Molina is chairman of the RAID Advisory Board. He can be reached at (507) 931-0967. For more information, go to www.raid-advisory.com.