Pros and cons of caching controllers in RAID arrays
Controller-based data caching can double or triple I/O performance, but "your mileage may vary" depending on the access patterns of your applications. Aside from corner-case applications like video broadcasting, caching can significantly increase I/O performance. But by how much and at what risks? This article explores the data reliability question.
What separates mission critical arrays from the rest is how well they perform under adverse conditions--in the face of errors and component failures along the I/O path, which includes host bus adapters, hubs, switches, controllers, drives, and cables. When I/O subsystems are fully operational, you can rely on cached writes making it safely to disk and trust the currency of cached reads.
But what happens if components in the I/O path fail or worse yet, your applications are oblivious to the problem? In these situations, the robustness of your array`s error recovery mechanisms determines whether your database is corrupted or stale data is delivered to your applications when a path failure occurs.
Caching: how it works
Most RAID controllers provide read and two user-selectable flavors of write caching: write-back and write-through.
Read caching. Read caching is based on the locality of reference principle: Applications are more likely to reference data that is stored in proximity to previously referenced data than data stored on distant areas of the disk. In other words, the next block an application will probably read has a high probability of being stored in the same data stripe as the last block read. While read-caching algorithms vary in sophistication, the basic notion is the same: With each read operation, grab the requested block and the rest of the blocks in the stripe (hence the term "read-ahead" caching). The requested block is returned to the application and the rest are stored in the read cache. If a subsequent read I/O wants a block stored in the cache (a "cache hit"), then a long trip out to the disk pool is avoided.
Write caching. Write caching is based on another simple principle: It takes a few microseconds to store write data in a controller`s cache versus a half dozen milliseconds to store it on disk. Writing to (or reading from) cache is more than 1,000 times faster than writing to (or reading from) disk. There are two types of write caching: write-back and write-through.
With write-back caching, a write is written to cache, the I/O is acknowledged as "complete" to the server that issued the write, and some time later the cached write is written or flushed to disk. When the application receives the I/O complete acknowledgement, it assumes the data is permanently stored on disk. With write-through caching, sometimes referred to as conservative cache mode, writes are written to both the cache and the disk before the write is acknowledged as complete. Write-through caching improves I/O performance with applications that frequently read recently written data.
Caching is a cost-effective way to improve I/O performance. However, unless the RAID controllers are configured in dual-active pairs, and designed with cache coherency and robust recovery mechanisms, caching can cause incorrect data to be delivered to applications and corrupt databases when elements in the I/O path fail.
Cache mirroring. One element in the I/O path that will obviously jeopardize data integrity if it fails is the RAID controller. Data written to a write-back cache is vulnerable until it is made permanent on disk, which is done later as a background task when spare cycles are available. If a controller with write-back cache enabled fails, the writes in its cache may be lost, and since the controller has already acknowledged the I/Os as complete, the application is unaware of the data loss. In database parlance, this type of data corruption is called the "lost write" problem. The application thinks the writes were written to disk, but the write never made it past the controller`s data cache.
RAID array vendors provide battery back-up units (BBUs) that preserve cache contents during power failures. BBUs do not protect data against controller failures unless 1) the battery-backed memory is transportable; 2) the battery circuit or the cache memory is not the cause of the failure; and 3) the failure did not propagate to the memory, corrupting the cache before the controller shut down. Even if these conditions are met, transportable battery-backed memory technology comes up short in mission-critical environments that cannot wait for a field engineer to show up with a replacement controller.
Uninterruptible power supplies (UPSs) provide essentially the same function as BBUs, but if properly sized can keep the entire array alive at least long enough for the controller to flush its write cache to disk. If operations continue on the UPS`s battery, RAID controllers will switch to conservative cache mode, ensuring that writes are stored on disk before the I/Os are acknowledged.
In single-controller array configurations, no dependable cache recovery mechanism protects cached writes against controller failures. However, external storage arrays with dual-active RAID controllers can provide a reliable cache recovery mechanism called mirrored caching. During normal operations, the dual-active controllers share the I/O workload; however, if one controller fails, its partner assumes the entire workload.
In dual-active RAID configurations with cache mirroring, writes are written to the caches in both controllers before the write is acknowledged as complete. In some controller designs, the memory is logically partitioned with a "write buffer" reserved for mirrored writes. When both controllers are operational, writes are mirrored to the other controller`s write buffer. If a controller fails, its partner completes the write operations that were in process at the time of the failure by flushing its write buffer to disk, restoring the database to a consistent state. The surviving controller then transparently (to the servers) fails over the host port address of the failed controller, updates its configuration files, and assumes the workload of the failed controller in addition to its own.
Cache coherency. High-availability computing environments require I/O subsystems with "no single point of failure." The only way to achieve this objective is with redundancy built-in throughout the subsystem, including host bus adapters, hubs or switches, RAID controllers, power supplies and fans, and data paths from servers to controllers and from controllers to disks. In the event of a host-side path failure, both RAID controllers in a dual-active pair must able to respond to I/O requests with the current state of stored data, regardless of the I/O path to the controller. High-end RAID controllers typically solve this problem with an expensive low-latency memory bus between mirrored caches in the two controllers. A more cost-effective approach for NT environments is a controller-to-LUN access control strategy or reservation system that locks LUNs or parts of LUNs before I/Os are serviced (see above figure). A LUN is the SCSI protocol terminology for a logical disk.
Consider the simple case of a server with redundant paths to dual-active controllers. The server issues I/O requests over Path A (solid arrow in Figure 2) to Controller 0 for LUN 0. Controller 0 reserves LUN 0 and services the I/O requests. Then Path A fails and the server switches to Path B connected to Controller 1. Controller 1 negotiates with its partner to release its LUN 0 reservation. Since Controller 0 does not have any outstanding I/O requests for LUN 0, it flushes its write cache, updating LUN 0 with the current state of all data blocks, and then releases its reservation. Controller 1 can now reserve LUN 0 and begin servicing the I/O requests.
To minimize reservation contention between controllers, reservation algorithms can lock LUNs at the stripe level of granularity, allowing multiple servers shared access to a LUN.
Features like cache mirroring and cache coherency have been available in high-end RAID controllers for mainframes and Unix systems for some time; however, a new generation of low-cost end-to-end fibre controllers designed specifically for NT clusters and SANs is beginning to hit the market.
In some controllers, memeory is logically partitioned with a write buffer that is reversed
A controller-to-LUN access control design, with a reservation system, locks LUNs (or parts of LUNs) before I/Os are serviced.
Kevin Smith is senior director of business management and marketing for external products at Mylex Corp. (www.mylex.com), in Boulder, CO.