How to Optimize RAID for VLDB
In the second of a three-part series, two IT administrators give tips for tuning RAID subsystems for very large databases; in this case, Oracle.
By Edwin Lehr and Christopher Schultz
From a performance perspective, the best RAID level for Oracle databases is RAID-0. From a performance and reliability perspective, RAID 0+1 is optimal, and from a performance, reliability, and cost perspective, RAID-3 or RAID-5 is best. Most very large databases (VLDBs) use RAID-3 or RAID-5 as a tradeoff for performance, reliability, and cost. Only VLDBs requiring high sequential throughput, such as multimedia databases and video applications, use RAID-3. Because most Oracle databases can be concurrently accessed by a number of users or processes for which the I/O pattern is random access, RAID-5 is most often the choice at commercial sites.
When placing Oracle segments in RAID, database administrators must be aware that different segments have different I/O patterns and therefore should be configured independently. For example, Oracle redo logs are written and read sequentially by a single process, while data segments and sort segments exhibit random-access patterns with higher concurrent operations. Read-only tablespaces may be segregated to advantage in some scenarios, particularly when other data segments are write-intensive. Some I/O operations in Oracle occur synchronously and halt database operations until the I/O is complete (e.g., archiving). These operations require special planning and configuration.
For these reasons, it is helpful if your RAID solution supports configurable RAID levels. For optimal configuration of your Oracle database, your RAID solution must support RAID-0, RAID 0+1, and RAID-5 (or, in some cases, RAID-3) across its logical unit numbers (LUNs). It is possible, with logical volume managers from your operating system, to implement RAID-0 and RAID 0+1 external to your RAID cabinet. Instead of presenting you with a matrix of reliability, performance, and RAID-level tradeoffs for each of Oracle`s segments, Table 1 lists the configurations that we have successfully used at our company.
You may be surprised by our use of RAID-5 in such write-intensive segments as rollbacks. We have found that the "write penalty" associated with RAID-5 can be addressed by improving cache and sort algorithms produced by such vendors as Clariion and MTI Technology. The write performance of a vendor`s RAID-5 should be benchmarked using your application, however. A critical component of a fast RAID-5 is a good write-back cache or write-gathering cache. Write performance is more important in transaction-oriented applications.
The closer the cache is to where the data is needed, the more effective it is--thanks to the so-called "skimming effect" of cached I/O chains. The application (in this case, Oracle) caches the most frequently used data, requesting blocks from Unix file system buffers when the data is not found in its own buffers. Blocks that aren`t already in Unix buffers are requested from the RAID controllers. One cache "skims" the hottest blocks from the cache beneath it. This means that the controller cache, at the end of the chain, has to be quite large to be useful.
Vendors who solve performance problems with cache may be wasting your money. Our tests show that a RAID controller with 64MB of cache performs no better than 16MB in VLDB environments. So, when RAID salespeople tout all the cache they can put on controllers, ask them what good 4GB of cache will do when your users do a full-table scan of a 20GB table. We have found that MTI`s read-ahead caching algorithms produce effective hit ratios in 16MB of cache when performing full-table scans. MTI has not crammed a 20GB table into 16MB; instead, they have made their software more intelligent. It recognizes a sequential read and makes intelligent guesses about which blocks will be needed in the future.
Array Size, Stripe Depth, I/O Size
Apart from placing Oracle segments in appropriate RAID configurations and tuning the buffer space, the most-effective tuning parameters are the RAID solution itself and the optimization of the individual arrays. The relevant parameters are array size, stripe depth, and I/O size. How you tune them depends on the level of concurrent access you expect.
To understand the relationship between array size and stripe depth, consider a "concatenated" array. In any array, a number of drives are defined as being in a logical volume and appear as one large drive to the operating system. In a concatenated array, the first drive in the array must be filled before data can be written to the next drive. If you have eight 4GB drives, for example, the first 4GB block is written to the first drive, the next 4GB to the next drive, and so on. Think of this eight-drive array as having a stripe depth of 4GB.
With this arrangement you achieve the goal of simplified storage. Though many drives are now seen as one drive, the throughput never exceeds that of a single drive because the largest I/O request from your application is 64KB to 128KB. Since this small chunk of data is always found in one drive, any one request is always serviced by one drive.
Suppose you want better service time than one drive can provide, or you need all the drives in this eight-drive array to work on a single I/O request of 64KB so that the I/O request is serviced in a fraction of the time.
To accomplish this, you create a striped array with a stripe depth of 8KB. That way, when you write a file to the array, 8KB is written to drive one, 8KB to the next drive, and so on. Now a single I/O request would scream. Stripe depth and the size of the I/O requests determine how many drives service a single request. The more drives servicing the request, the faster the service time for a single request--to a point.
Concurrency Is Key
Recall our concatenated array of eight 4GB drives. This time, suppose eight users are reading concurrently from the array. If they all want to read data stored on the same drive, the response time would be poor because their requests would be queued. However, if each user requests read data from each of the drives, there would be no contention and therefore no queuing--one drive could serve each user. If the access pattern is truly random and the number of concurrent requests never totals more than eight, the situation would be all right. However, no request would get the benefit of the throughput provided by multiple drives servicing the request. To eliminate contention, you have sacrificed service time.
But if you use an eight-drive striped array with a stripe depth of 8KB and eight users are accessing data concurrently (see figure on top of next page), each request is serviced faster because eight drives are working on the request. However, contention is now guaranteed and the average wait time is higher (due to queuing). The challenge of tuning drive arrays is achieving an optimal balance of average wait time and average service time in an array of a given size with a given concurrency.
Tuning Arrays for Oracle VLDBs
At our bank, we devoted over 180 hours of machine time to determine the optimal configuration of drive arrays at varying array sizes, stripe depths, and concurrency. With Oracle, you can change the I/O size to impact the balance between average wait time and average service. In this study, our goal was to optimally tune Oracle sorts on arrays of raw drives configured under the operating system logical volume manager (LVM). The techniques presented below address tuning unintelligent drive arrays, such as those under the LVM.
To the extent that vendor software does not manage queuing for you, these techniques can be applied to intelligent RAID arrays. In arrays managed by intelligent controllers, vendors` algorithms do a better job of managing average waits and average service times. In such cases, you should make your arrays as large as possible. Most vendors still allow you to configure stripe depth. If so, use the following ideas to pick an appropriate depth.
To see how we can get to that goal, let`s start with Oracle`s I/O size (IOS). You should not alter Oracle`s IOS to tune sort arrays. In addition, the effect of IOS on data arrays should not be a determining factor in configuring IOS without first carefully considering the other effects of this parameter. We do not go into tuning IOS here, except to say that VLDBs with primarily select activity (particularly full table scans) should probably use the maximum I/O available (64KB or 128KB, depending on the Unix port).
Transaction-oriented databases should probably be configured with a smaller IOS. If you are doing enough sorting activity to warrant special efforts on tuning sort arrays, your database is probably doing enough reads to justify a large IOS. We assume an IOS of 64KB.
Concurrency is the next most-intractable variable, though there are some things you can do to control it. Concurrency is determined not only by the number of user processes, but also by the number of parallel query servers configured, if you are using them. (VLDBs generally should use the PQ option.)
As with IOS, PQ servers should be tuned separately with the goal of using as many as you can within the system`s resource limits. PQ server processes and user processes represent work being done, so it is advisable to treat your concurrency as a given and seek to limit it only if it will enhance the total throughput of your system.
Concurrency is the number of I/O requests at a given time. Let`s assume you expect a peak of 20 concurrent requests per second to be active in your sort space. Referring to models A and B in the figure above, you can see the two opposing configurations. "A" stripes each I/O across all the drives for the fastest service time; "B" stripes the I/O across one drive at a time for maximum concurrency. If we set our stripe depth to equal our IOS of 64KB (model B), we eliminate queuing for service. However, in doing so, we ensure that only one drive services a single I/O, thereby limiting service time to the speed of one drive. If we set our stripe depth to IOS per number of drives, the service time is excellent, but the average wait for service is higher. Model C shows a possible compromise between these two extremes. As long as the total I/O time does not exceed 50ms, we consider the compromise a good one.
"C" allows each request to be serviced by four drives, improving service time by a factor of 25%. With 20 concurrent I/Os hitting this configuration, the average queue at any one time is four. Whether or not this queue is acceptable depends on the performance of the drives as reflected in the average wait and average service times for each job in the queue. Since average service times should drop by 25% in this configuration, a job can wait in queue for about 7ms without feeling the difference between this configuration and one served by a single drive. The first job in the queue will wait 21ms; the second, 42ms; the third, 63ms; and the fourth, 84ms. The average wait will be 42ms ((0+21+42+63+84)/5). A wait of 42ms plus 21ms for service means a response time of 63ms. You have increased the queue by four, but the total response time by only two. Is this compromise worth it? You save four drives, but at peak loads, this array performs only half as well as 20 drives, and 63ms is over the acceptable limit of 50ms for total I/O time.
Is there a better compromise--one that gets your response time below 50ms with fewer than 20 drives?
In model D, each 64KB I/O takes about 23ms (14ms + 5ms + 4.3ms). The average queue is 2.5 requests (20 requests/8 paths), and the average response time is 40.25ms. To implement this configuration, follow these steps:
- Let AS = array size = # of drives
- Let IOS = Oracle IO size = (64, 32, 16, 8, 4, 2) in kilobytes
- Let S = stripe depth where stripe depth is a factor of IOS
- Let C = concurrency = the expected concurrent operations
- Let IOP = IO paths = AS/(IOS/S)
- Let AV = average service = seek time + rotational latency + transfer time of IOS (in milliseconds). Unless actual drive values are known, use 14 + 5 + ((S/204) * 0.27).
- Let AQ = average queue = C/IOP
Create a Queue Table (see Table 2) by filling IOR (I/O request) cells from IOP 1 down to IOP 8 and back again until total Q equals C (Q is the number of cells marked "io" in each row).
Next, for each IOP with a distinct Q complete a Wait Time Table (see Table 3). The average in Table 3 is the total job wait time divided by the highest Q# (69/3 = 23).
Now, create a queue table for the IOP with a Q of 2 (see Table 4). The average of the two average job wait times (11.5 + 23 = 34.5/2 = 17.25) equals the average wait time for a job in this array. When this number is added to the AV, you get the total response time for the average job in this array: 17.25 + 23 = 40.25ms. The performance of this array is acceptable and will improve under less than peak loads.
Each of the models (A, B, C, and D) has an Oracle database application. Models C and D, where IOP is more than one and less than the array size, are applicable to random I/O patterns with high concurrency. Model A, where IOP to the array is one, is applicable to low concurrency applications for which service time must be minimized. For example, model A on RAID 0+1 (mirrored) would be ideal for Oracle online redo logs in an OLTP environment. In this scenario, the online redo logs would be a bottleneck for the whole database and therefore should be configured for the fastest response time. Model B would be ideal for high concurrency and sequential I/O patterns.
All of these models are for RAID-0 or RAID 0+1, and perhaps RAID-3 and RAID-5. With Oracle, the most useful application of large RAID-0 arrays is for sort segments. These should be configured on raw devices, assuming the sort segments will be not be a part of any backup scheme. Note that using RAID-0 for Oracle sort segments assumes the application can afford to lose the sort segments in the event of a failure. If it cannot, consider RAID 0+1 for sort space.
Note: The values for seek time, rotational latency, and transfer time reflect the average "off-the-shelf" drive. You should determine the actual values for your drives, either by de-rating the manufacturers claims by 20% or by monitoring them while in use with sar or iostat.
At the time this article was written, Edwin E. Lehr was the Oracle database administrator and Christopher Schultz was the UNIX system administrator at a major bank in eastern U.S. The bank has since been acquired. Lehr is now a principal consultant at Oracle Corporation and Schultz is systems engineer at Silicon Graphics, Inc.