How to Evaluate RAID for VLDB
In the last of our three-part series, two IT administrators evaluate key RAID attributes, including performance, redundancy, scalability, portability, configurability, and manageability.
By Edwin Lehr and Christopher Schultz
As covered in part two of this series (InfoStor, February), RAID 0+1 provides the best performance and protection in very large database (VLDB) applications. But the cost is often prohibitive. Instead, the two most popular RAID levels for VLDBs are RAID-3 and RAID-5. RAID-3 stores data across several drives, with parity data on a single drive. For random write operations, this single drive can be a bottleneck. However, for very large sequential write applications (such as video imaging), RAID-3 excels. RAID-5, on the other hand, spreads both data and parity data across multiple drives in the array. In a five-drive array, for example, parity data for drive one is striped across drives two through five; parity data for drive two is striped across drives one, three, four and five; and so on. This configuration tends to work best when the host computer executes a great deal of random I/O operations.
Which configuration is best for you depends on your application. RAID-3 is most applicable when you need to stream a lot of sequential data to and from an array. When you read data from a RAID-3 array, it looks like a simple stripe of drives to the controller. Thus, RAID-3 has a slight advantage over RAID-5. As you stream data back to the array, you fill up the stripe and then calculate the parity. Since the array knows it just filled the data drives with enough information, it doesn`t need to read old data blocks or parity values. It just calculates and writes the information on the fly. Random writes nullify this advantage.
For most VLDB applications, RAID-5 is optimal. When you need to randomly update a database file, RAID-5 is the better choice because bottlenecks are less likely. Striped parity is also important when changing a data block in a parity-based RAID scheme, since multiple disk accesses are involved: The array reads the old block, reads the old parity, and then calculates the new parity and writes the new parity data and the new data back to the drive. The parity data is accessed twice for each I/O operation; by striping this data across multiple drives, the array functions more efficiently.
Most of the RAID vendors we evaluated support both RAID-3 and RAID-5 (e.g., Digital`s Clariion unit). MTI, however, only provides RAID-5, but its RAID-5 implementation is fast and competes well with most RAID-3 implementations. Ciprico, on the other hand, only supports RAID-3, primarily because it serves the high-end visualization and imaging industries. EMC takes a different approach: RAID 0+1 for maximum performance, or a proprietary implementation called RAID-S.
The parity calculation that allows a RAID array to survive a drive failure compromises performance by requiring additional I/O operations. This is most noticeable in certain RAID software implementations (e.g., the Sun SPARC Storage Array 100). In such cases, the host computer performs the additional I/O operations and calculates parity values.
A fast RAID array requires the host computer to do very little processing. However, these arrays can be fairly complicated. All of the manufacturers we evaluated place additional I/Os and parity calculations onto separate processors in the array, and they add hardware to optimize RAID performance. When cache, multiple internal SCSI buses, and striping schemes are used properly, a good RAID implementation can move data faster (and more safely) than a similar set of striped drives.
Another critical component of a fast RAID implementation is a good write-back cache. This memory is located on the I/O controller in the array and immediately commits to the host once the data is safely written to the cache, giving the I/O controller flexibility in timing the actual writes. In a simple scheme, the data is written during periods of idle time, which may be few and far between in VLDB environments.
Another alternative is to use a write-gathering cache. In this case, the array controller optimally arranges the writes in its cache based on the number of drives in the RAID stripe. When the controller flushes the cache, it maximizes the use of each I/O operation internally in the array.
RAID vendors do not like to talk about the specifics of their caching algorithms since they are a key differentiating feature among products. Some designs throw in a lot of cache (up to 4GB in the case of EMC`s Symmetrix) to accommodate the I/O. Other vendors focus on caching algorithms and sometimes deliver equal or better performance with less cache. Data General typically uses 32MB or 64MB, and MTI is almost always deployed with 16MB.
RAID manufacturers started introducing redundancy with the idea of averting drive failures created by the increasing complexity of arrays and of eliminating single points of failures. Duplicating components negates the effects of a single component failure, while hot-swappable components allow for uninterrupted use of the array should a drive fail.
Most RAID manufacturers provide multiple, hot-swappable components. Additionally, the internal SCSI bus should have redundant pathways to each drive.
Write-back cache also needs to be redundant. Since write-back cache immediately commits to the host after a write has been made, it cannot afford to fail. Most RAID manufacturers dedicate an equal amount of memory to mirroring the primary cache. A fully redundant RAID array does not commit a write to the host until the data is written to both caches.
To keep the contents of the cache "alive" in the event of a power outage, administrators need to be sure their systems have adequate battery-powered backup. Battery backup allows cache to be flushed to the drives once power is restored. You may also want to consider supplementing your battery backup system with an uninterruptible power supply (UPS). Some vendors design power systems that keep entire arrays operational for a few minutes after power failures. The entire array runs on the backup system, but the UPS flushes the cache to disk, thereby riding out a power failure of any duration.
If the cache is not redundant, the RAID unit has a very serious single point of failure. A typical scenario involves the failure of a SIMM chip during heavy usage. In this situation, the array communicates to the host computer that the last 16MB to 4GB (depending on cache size) of data have been written to the drive. In reality, up to 4GB of that data may not have been written before the drive failure. This data is lost in the defective cache.
For VLDBs, scalability may be a less important issue than it used to be. Virtually all RAID vendors recognize the need to provide hundreds of gigabytes of space for databases. A typical RAID vendor offers around 60GB of usable space in a single array, and arrays can usually be grouped for more storage capacity.
The total amount of RAID that can be attached to a single server is limited by the number of available SCSI ports. For optimal performance, you should avoid daisy chaining devices, but you can attach multiple arrays to a single SCSI port.
An array may also have multiple ports so it can be attached to more than one server. Typically, the limit is two computers (though "enterprise-class" RAID solutions often provide connectivity to more than two). The two hosts do not share the files, but the drives can be logically divided between the two. In a fail-safe configuration, one host may own all of the drives until that system fails, at which time the second host assumes ownership of the failed computer`s RAID arrays. This type of availability greatly reduces total downtime, but it does not eliminate it entirely.
Since a RAID array is basically a virtual drive that communicates with a host computer via a standard SCSI port, you should be able to easily connect it to different types of machines. As long as the server supports SCSI-2 connections (and most do) and RAID management software is available for the host platform, then a transition from Unix to Windows NT, for example, should not be an issue.
Part of scalability and portability is the ability to accommodate forthcoming technologies. Today, most RAID vendors use standard SCSI-2 connections to the host computer. An ideal RAID array would allow you to upgrade to Ultra SCSI and later to Fibre Channel.
A RAID array is composed of a set number of drives. These drives appear to the host computer as a single large virtual drive or they can be divided into LUNs. The separate LUNs can be mounted as different file systems on different hosts.
There are two ways to divide a RAID array. Most arrays require you to choose a subset of drives in the array and then a drive for parity (even though the parity may be striped across all of the drives). So, if you want two LUNs in an array of 20 drives, you would have the equivalent of two parity drives and two sets of nine data drives. Some vendors allow you to assign different RAID levels to each LUN.
In this context, the MTI configuration is more manageable than the other arrays we evaluated. The array does not require you to select certain drives for individual LUNs. The GUI allows you to specify any combination of up to nine LUNs. The array controller then balances the I/O across the available drives and ensures the proper redundancy.
Also important is the ability to monitor the status of your array. In a Unix environment, important events should be captured in the SYSLOG or messages file for centralized monitoring. SNMP traps should also be available for more robust monitoring environments. These features are important because the quicker a drive is hot-swapped when a drive fails, the quicker the system returns to normal.
Most vendors have a feature called "hot spare" that dedicates a drive that is automatically switched into the array when a drive fails. The array then starts rebuilding the contents of the failed drive on the "hot spare" without waiting for the failed drive to be removed and replaced. Since it may not be possible to swap disks right away in an unattended environment, this feature helps shorten the length of time that the array is vulnerable to a second drive failure. "Hot spares" may need to be dedicated to each LUN. Some suppliers offer a "global hot spare" that can be used automatically by any LUN, reducing the number (and cost) of hot spares.
At the time this article was written, Edwin E. Lehr was the Oracle database administrator and Christopher Schultz was the Unix system administrator at a major bank in eastern U.S. The bank has since been acquired. Lehr is now a principal consultant at Oracle and Schultz is a systems engineer at Silicon Graphics.