Windows NT I/O performance boosts
New applications, such as data warehousing, are putting exponential pressure on both the capacity and the performances of I/O subsystems. IT managers have long searched for a software silver bullet to instantly boost I/O throughout.
By Keith Walls
Useful computer system performance is largely dependent on the I/O subsystem. Today`s CPUs run at speeds of up to 600MHz. If just 8 bytes of data were processed during each cycle, the resulting throughput would be nearly 5GBps. Unfortunately, today`s fastest individual disks are capable of delivering only something on the order of 10MBps.
While disks and storage subsystems have become faster, they have not achieved the rate of improvement that we have seen in CPUs, memory, and bus throughput. As a result, for decades CPUs, operating systems, and other hardware and software products have all been designed around a relative deficit in I/O.
So it is no surprise that the quest for increased throughput from disks has occupied many minds. We would all like to gain significant I/O throughput for the smallest possible cost. The idea of being able to add software to a system and magically improve its I/O performance by a factor of two or more is quite attractive.
One of the earliest ways to improve I/O throughput was to focus on the issue of file fragmentation. In the days of single-user DOS machines, it made sense to eliminate file fragmentation for two principal reasons. First, under DOS almost all files were accessed sequentially--files were opened and read from beginning to end. Second, while CPUs were slow, disks hardly spun by today`s standards.
That made the penalty for missing a read on a portion of a file and waiting for that data to spin around again a significant problem. The result was the birth of a class of "defragmentation" software that restructured files into as small a number of fragments as possible.
However, Windows NT is not DOS, and the pattern of I/O on a server is quite different from that of a desktop workstation. With the rise of databases, a great deal of the file I/O on a server is far from sequential.
When a file is being used as a database container, then the contiguousness of records, tables, rows, or columns within that database is unlikely to bear any resemblance whatsoever to that of the file on disk.
Furthermore, a server is not used by just a single person. Servers typically support tens or even hundreds of clients. What`s more, modern disk systems have become sophisticated in terms of read-ahead and seek optimization just for that situation. To optimally handle multiple simultaneous requests from multiple independent clients, I/O subsystems insert the request into a processing queue.
To optimize data transfer, the electronics of a modern I/O subsystem reorders the queued I/O requests to minimize head-movement and revolution-time stalls.
As a result, the order in which the requests are serviced depends not on the order in which they were received, but on their proximity to other requests.
If that alone were not enough to complicate matters, Windows NT adds one more level of indirection: a file-based cache. This cache is able to maintain copies of Master File Table (MFT) entries, file data blocks, and many other types of frequently accessed data. So, once we introduce the notion that there are multiple consumers of the disk`s data, the old-line arguments for file and disk defragmentation are not quite so compelling. While a disk may look bad aesthetically because of all the fragmented files on it, remember: Computers are very good at keeping track of where they left things.
The effects of file fragmentation on read-ahead and file placement with respect to disk caching are interesting. The most effective way to boost disk performance is to avoid doing the I/O in the first place. If the caching process can be made more effective, either by placing files in such a way that favorably biases caching or by manipulating the caching algorithms, I/O throughput can be improved not by a factor of one or two, but by a factor of one or two orders of magnitude.
A Windows NT disk is self-describing, in that it contains sufficient data in a known place to find all of the other information about the volume. As Windows NT boots, it looks for all disk devices and tries to read a particular data block on that disk. That data block contains a set of pointers, one of which is the disk address of the beginning of the master file table (MFT).
When a program opens a file, it does so through a directory file. A directory file translates file names into file addresses. A file address is used by the file system to locate further information about the file. First, the file`s summary information data block is located directly in the MFT. This contains all security and access information for the file. Once it has been determined that the user is allowed to access the file, the starting location for each fragment of the file is extracted from the MFT.
The storage map contains descriptors that essentially consist of a disk block address and a length counter. Once this map is loaded into memory, the user program can gain access to the data blocks of the file. For example, a map might look like this:
Length Disk Address
This tells the file system that the file is 302 blocks long and is broken up into two extents or fragments. If the user program asks to read block number 70, the operating system will calculate the offset from the start of the file, address 122880. If the user reads block 250, the operating system will calculate the offset from the start of the second file fragment, address 1105920.
Now consider the case of a client or user I/O that specifies that a read be issued for file block 198 and that the length of the read be eight data blocks. The operating system has sufficient data to calculate the offset for the starting point 122880. However, the operating system must also detect that the remaining length in the first extent is too small to satisfy the entire request. As a result, the operating system must break the user`s I/O request into two parts.
Now, lets look at what is happening on the physical disk. Let`s assume that the directory file is in memory. We must now move the disk head to the MFT, offset by internal file number. We read the data at that location and check access against the security descriptor stored there. When the user issues the read that crosses extents, the operating system must divide the I/O into multiple parts corresponding to each of the extents into which the I/O falls.
The big question is: How much of an effect does the fragmentation of files have on performance in a multiuser environment that is strongly cached? This turns out to be a difficult question to answer, because there are many dependencies.
Suppose a file is 100MB in length and has 400 extents. The old logic would predict a significant measurable impact on performance. In a Win NT server environment, however, performance loss will depend on the proportion of user I/Os that cross file extents.
If only 2% of the I/Os cross extents, then fragmentation will have minimal impact. If 98% cause divided I/Os, then the effect will be harmful. Consider a disk that has a cluster size of 8192 bytes (8KB). If all I/Os directed to the file have sizes in the range of 2n KB, where n is 0, 1, 2, or 3 (1KB to 8KB) and begin on an 8KB boundary, then no divided I/Os will be generated.
Compounding the potential problems associated with file fragmentation is a time dependency issue that leads to a second problem: disk fragmentation. After a sufficient length of time, it is likely that the free space on a disk will also become fragmented.
When a file is created, it occupies data blocks on the disk. When the file is deleted, those data blocks are marked as free and another file can allocate them. Consider the simple case of creating three 8KB files. Now delete the second file. You have just left an 8KB gap between the data blocks allocated for files one and three. When a new file is created, the file system has the choice as to whether this newly freed 8KB segment is used.
If this fourth file is 8KB or smaller, the fragment left over from the previous delete is a good choice. If the new file will be larger than 8KB, however, using that remnant will result in more file fragmentation.
This is the classic problem that occurs with log files. Typically, chunks of data are added to the end of a log file on an infrequent basis. Most likely, the file is opened when a particular application starts and is closed when that application exits.
It is likely that the file will become highly fragmented in this scenario. Similarly, when that log file is deleted, it will likely leave behind many small fragments of unused space on the disk.
Cache on demand
The most complicating factor to this scheme, however, is the presence of a truly effective cache built into the Win NT operating system. The major principle of caching is a simple one: Data that has been read or written is likely to be read again, and the best measure of that likelihood is how recently the access occurred. Since memory access is at least 100 times faster than disk access, caching can significantly reduce disk I/O traffic and mitigate or possibly eliminate the second-order effects of file fragmentation.
Windows NT performs look-ahead caching when and where it can. In general, files fall into one of three categories in relation to the patterns in which they are read. In the simplest case, a file is read sequentially from a starting point and moving through the file data in the order of the physical disk blocks. Files can also be read in random order. In this case, by definition it is not possible to predict the order in which the data file will be read. Finally, the database case indicates one of two conditions. Either there is an application "supervisor" that will perform its own caching or there are hot-spots in the file where indexes reside.
Windows NT gives the application developer the option to specify a bias to the read-ahead cache. When an application opens a file, either a FILE_FLAG_SEQUENTIAL_SCAN, a FILE_FLAG_RANDOM_ACCESS, or a FILE_FLAG_NO_ BUFFERING flag can be set within the application, if the developer knows that the file will be read sequentially, accessed randomly, or cached by a database supervisor. For example, SQL Server maintains its own internal cache that is tuned to the access patters for the database. As a result, the Windows NT file system cache does not cache SQL Server files.
Piecing the puzzle
CTO Labs set about taking a new approach to disk performance measurement. We wanted to accurately measure I/O performance of files in such a way that we could compare the performance of files and disks under a multiuser load and adequately access the effects of Win NT caching. We already have a multiuser raw-disk load benchmark and a linear file I/O performance benchmark. We therefore designed version 1.0 of our Fileload benchmark as a companion to our existing benchmark suite.
First, we wanted to be able to analyze file-access performance on the basis of varying access patterns. The character of file I/O performance varies greatly, depending on whether that file is being accessed sequentially, randomly, or as a database file. In the purely sequential case, the I/O pattern is relatively easy to predict. In the truly random I/O case, access patterns cannot be discerned and prefetching can only succeed if the entire file is brought into memory. The database access pattern is difficult to predict from the perspective of the file system, but is easily understood from the database-management perspective.
Tightly intertwined with the access pattern of a file will be the effects introduced by Win NT`s caching scheme. As a result, we also wanted to be able to select Win NT`s caching and prefetch algorithms, so that we could understand their effects on file throughput. Finally, we wanted to lay the foundation for distributing the benchmark in such a way that we could run multiple series of
tests across several disks and files within a LAN.
Nonetheless, our initial goal with the Fileload benchmark was to determine the nature of the effects file fragmentation has on server I/O performance. Our initial expectation was that file fragmentation would affect only sequential file access and perhaps random access with large read sizes, where the likelihood of divided I/O was greater. We further expected that once a file`s data blocks were resident in the Windows NT file cache, file fragmentation would have no meaning at all. Finally, there were questions of what intrinsically was the meaning of file fragmentation in a RAID set where physical disk blocks are distributed across multiple disks.
To answer these questions, we set up shop on a four-processor DELL PowerEdge 6100 server with 512MB of RAM. This gave us a sufficiently high-end platform to run multiuser tests with no fear of running out of CPU or memory resources for caching. We then created several utility programs to be used in conjunction with our new Fileload benchmark. One of these utilities generates an intertwined set of highly fragmented files of any given size. For our tests, we created several sets of 100MB fragmented files. The analyze utility in Executive Software`s DISKEEPER, which is being bundled with Windows NT 5.0, reported these files as having approximately 400 (394 to 470) fragments each.
We then made several contiguous copies of our 100MB file.
We decided to start with the obvious and began by sequentially reading the contiguous and fragmented files in single-user mode from a single NTFS-formatted disk. While Fileload provides for varying the read sizes, we chose to focus on 8KB reads when reading sequentially, which is the typical I/O size found in many Win32 applications. For database reads, we decided to focus on 8KB and 64KB reads, as SQL Server 7.0 accesses data in 64KB chunks for many internal functions.
Water is wet (usually)
In setting up the baseline, everything went according to Hoyle. I/O throughput jumped 30% when reading a contiguous file sequentially, as compared to our highly fragmented version of the same file. Nonetheless, when we introduced Win NT caching into the test via enabling the benchmark to set FILE_FLAG_SEQUENTIAL_SCAN, the performance differential dropped to 17%.
From this point, the Win NT cache was consistently used in every test, since our goal was to measure system throughput, not disk performance. To avoid getting trapped in the cache in sequential read mode--in database mode our goal was quite the opposite--we used another of the utilities that we created to flush the NT cache on demand.
Having proven that water is wet, we now wanted to get to the heart of this scenario--understanding the effects of file fragmentation when multiple users access multiple files on the same disk. To achieve this, we ran four copies of the benchmark in competition with each other. Now the fun began.
With multiple users sequentially accessing individual files on the same disk, there was a marked degradation in throughput reading the contiguous files. Throughput on our fragmented files was consistently about 10% better than with the contiguous files in test after test. File fragmentation appeared to offer quite a performance advantage!
The truth about what was happening could be found in the details of the individual file results. Consistently, the highest single test throughput would come from a contiguous file. The answer to the anomaly we were encountering had to do with the position of the files on the disk. Our fragmentation utility had created four highly fragmented but tightly intertwined files in the center of the disk. Our contiguous files lay at either end of the fragmented files.
As a result, the I/O subsystem`s optimization routine was actually biasing reads in favor of the fragmented files. It doesn`t have a notion of file structure. It only knows about physical disk blocks. The four processes accessing the fragmented files were generating requests with greater locality of reference than the four processes accessing the contiguous files. As a result, the seek optimization mechanism was pushing "far" accesses for our contiguous files further down the I/O queue.
Whether or not the files were contiguous was of far less consequence than whether the files were located near each other. As soon as we reconstructed the disk so that the contiguous files were consolidated in one region, the pendulum swung in favor of the consolidated files, which consistently benefited from about 7% faster throughput.
The important observation here is the effect of file proximity for frequently accessed, and especially simultaneously accessed, files. While there is an internal logging feature in Windows 98 to identify frequently accessed files, there is no such mechanism in Windows NT 5.0. This means that most Windows NT disk defragmentation utilities will simply blindly pack all files at the outer edge of the disk, if you are lucky.
The ideal situation would be to have frequently accessed files grouped by folder location at the outer edge, where for very high-capacity disks throughput may be faster by upwards of 30%. Grouping by folder location ensures the best locality of reference for files opened simultaneously by a particular application. This is precisely the scheme used by Norton Utilities for Windows 95 and 98. It is not yet available for Windows NT.
Worse, PerfectDisk for Windows NT properly identifies frequently accessed files, but currently maps these files to two locations at opposite ends of the disk--the scheme is supposed to change in a future release. There is logic in such madness, but unfortunately the logic only makes sense in the OpenVMS world. OpenVMS puts its equivalent to Windows NT`s MFT in the center of a disk and does not have a caching facility to keep that file metadata in memory. As a result, putting active files on both sides of the MFT facilitates all of the MFT lookups that occur with heavy disk access under that OS. Under Windows NT, such a scheme is less than perfect.
Database cache out
In all cases of competing non-sequential access to files, we primed the Windows NT cache with data from each file. Database I/O is the most complex pattern we designed for the Fileload benchmark. In a database scenario, we needed to simulate the distinguishing characteristic of database access--most of the I/O is directed to index areas within the file.
To meet this requirement, version 1.0 of CTO Labs Fileload benchmark divides the underlying container file into eight regions. Four of those regions are used to simulate index areas and four simulate data areas. The benchmark then executes a pattern of I/O that directs 90% of the file access towards the index parts. The remaining 10% of I/O is spread evenly but randomly across the data areas. For our simulated database access, all file I/O is queued asynchronously and notification of I/O completion is achieved through an I/O completion port.
In addition, we set FILE_FLAG_RANDOM_ACCESS for Win NT caching.
As expected, the differences between fragmented and consolidated files under a database access profile were marginal--within about 5%. With large 64KB accesses of the type used in version 7.0 of SQL Server, throughput on the fragmented files was slightly higher, indicating the power of locality of reference in keeping the Win NT cache populated.
Even more interesting was the result of running the database access mode tests on a RAID10 disk set. Because of the effect of the Windows NT Cache, there was statistically negligible difference in throughput from the same test on a simple disk. This was in direct contrast to running a sequential single-user test on the two data sources. In the latter case, throughput off of the RAID10 volume was on the order of four times greater.
These results should be of obvious interest for those architecting database implementations. While RAID10 offers a performance edge over RAID5, since it eliminates the need to calculate and write parity bits, that performance edge comes at a stiff storage penalty--it requires twice the storage to provide full data redundancy. Given the likelihood that any database system configuration will have a generous pool of RAM for caching, the performance edge offered by RAID10 will be restricted to write operations. Thus, knowing the proportion of reads to writes with reasonable accuracy will be of critical importance for making a good price-performance decision.
Finally, with all of our tests pointing to cache improvements as having by far the most dramatic effects on performance, we turned to an add-on caching product, EEC`s SuperCache, to boost performance. This is a tricky performance issue. Layered caches can compete with each other for both memory and CPU resources if care is not taken.
Suppose we have two levels of caching for an HTML text file. It is possible to implement the Internet service to cache recently used HTML pages. In doing this, it will burn CPU cycles, allocating memory and populating the cache with the text file data. If that file is also cached in the Win NT file system cache, there will be two copies of the data residing in memory. Put another way, the two copies will occupy different memory segments and consume twice the space required.
Given this inherent problem with layered caches, we had a number of reservations when we started out to measure the performance of SuperCache. Frankly, we expected to be somewhat disappointed. After all, the native Win NT cache enjoys a closer relationship with the NT file system and should therefore be able to make more informed decisions about how best to use the memory space available as a cache.
The key difference between SuperCache and the Win NT cache, however, is the ability to dedicate specific memory resources for the cache. EEC Systems provides a tuning utility that allows the system administrator to change the granularity and overall size of the cache. There is, however, a major limitation: SuperCache can only cache one disk on a system at a time. What`s more, reconfiguring SuperCache for another disk requires a reboot of the system.
On the DELL PowerEdge 6100, SuperCache immediately ran into what should have been its worst-case scenario. SuperCache can offer no significant performance benefit when Windows NT has clearly cached most of the file contents. In our test case, the Fileload processes were using few CPU resources and that left the OS with the ability to dedicate significant resources--often over 100MB--to its file system cache.
Nonetheless, disk subsystem performance can be improved substantially with caching software that can grow larger than Windows NT allows the native cache to grow or that can hold on to its contents for a longer time.
With SuperCache`s stickier cache, we were hardly disappointed. Using four fragmented files in database mode, I/O throughput increased on the order of 400% to 500%. This did not, however, come without some significant deviation from the standard SuperCache configuration. With four or more processors, a page size of 8KB as opposed to the default of 32KB makes all the difference in the world.
In the idyllic world of infinite cache size, we could keep all disk data in cache and never need to wait for the disk to deliver data. In such a world, physical file organization would be irrelevant. This picture, however, is far from reality. The size of system memory is still quite small relative to the size of aggregate disk storage. The best we can do is to strike a balance between performance and the consumption of memory by optimizing the filling of the cache.
Performance results were far less predictable when CTO Labs introduced the added complexity of multiple users. In this case, performance actually declined with contiguous files. As demonstrated by the performance of consolidated files, file proximity is a much more important key to I/O performance.
In single-user sequential mode, the Fileload benchmark behaved in a perfectly predictable fashion. With Windows NT caching on and off, throughput was substantially increased with contiguous files.
In database mode, the file simulates a database container file and 90% of the I/O requests are mapped to four regions that represent indices. In our tests, caching was the dominant determining performance characteristic. In tests of both 8KB and 64KB data-access sizes, there was little difference in the performance of fragmented and consolidated files.
When it came to database-mode tests, even underlying RAID architecture had a secondary effect on performance. While performance on a RAID10 volume set showed a 350% improvement for sequential reads, in database mode this was entirely masked by caching performance.
In tests on a four-way DELL PowerEdge 6100 server with 512MB of memory, after a bit of tuning for a four-way SMP environment, SuperCache boosted throughput in database mode on the order of 400% to 500%.