Three of the latest Linux journaling file systems show their merits on a fast RAID appliance from Adaptec.
By Jack Fegreus
As Linux servers continue to expand beyond specialized Web and e-mail serving and into the heart of database-driven IT applications, the need for serious data storage options grows in tandem. Spurring this trend is the nearly universal adoption of 24x7 IT operations for business continuity. But that's only half the story. Linux is a very young and evolving operating system. As a result, taking full advantage of the latest in storage devices can be a challenge.
A perfect example of coming to terms with the right hardware/software synergy can be found in our evaluation of Adaptec's DuraStor RAID array. The premise of this RAID appliance is to maximize options for rack-mounted server farms by moving RAID management out of the server and into a rack-mounted 1U RAID appliance that can provide access to as many as 36 drives for multiple servers.
The base configuration is the DuraStor 6220SS, which comprises a DuraStor 6200SR controller module and one DuraStor 312R storage module. The 2U storage module can hold 12 drives. From this starting point, the configuration options grow fast. The 6200SR module can be configured with either one or two internal controllers for single- or dual-port host connections.
The DuraStor 312R module by default is configured with two disk I/O channels. The net result of all these options is the ability to configure the RAID appliance in four basic configurations with various levels of device redundancy: standalone single-port, standalone dual-port, active-active single-port, and active-passive dual-port. Setup and management of these options can be accomplished either through the front control panel or via Adaptec's Java-based Storage Manager Pro software. However, the software only runs on Windows.
Nonetheless, most sites have a mix of Windows NT/2000 servers along with Linux. Given the wide range of configuration options for the 6200, in theory it should be possible to configure a dual-port scheme with a Linux server and a Windows 2000 server. In fact, it works very well: Just don't tell Adaptec you're doing that.
The differences in the role that I/O caching plays in Linux and Windows 2000 are reflected in the performance profiles that result when our OBLdisk benchmark is run. The benchmark reads data sequentially from a disk file in increasingly larger block-size requests. On the standard Linux ext2 file system, small reads are delayed and bundled into large reads. As a result, performance throughput does not vary as our programs issue larger and larger reads. On the other hand, Linux transaction processing is bound by the speed of cache hits. This puts a double whammy on the Adaptec DuraStor configuration. In the case of the DuraStor RAID appliance and the HP NetRAID-2M controller, system memory and cache size were identical. The big difference between the two is that the HP NetRAID-2M sits internally on a 64-bit PCI bus while the DuraStor RAID controller is at the end of an Ultra160 SCSI bus.
When Adaptec talks about dual-porting scenarios, their assumption is that a single host system is connected via two internal SCSI host bus adapters (HBAs). For Adaptec, the heart of the dual-porting problem is that the first version of Storage Manager Pro deals only with configuring and managing the RAID appliance. That means configuring logical RAID arrays (including RAID 10 and 50) and monitoring the health of the physical drives. There is nothing in the software that virtualizes the array volumes at the host level.
Currently, systems administrators must have clustering or virtualization software to prevent data corruption incidents in a dual-porting scenario. This is not a difficult stumbling block for well-seasoned systems administrators, especially in a mixed operating system environment. Windows 2000 is blind to Linux file partitions, so its greedy device-gobbling ways are never an issue. So in reality, very careful dual-port configuration on two servers is not a difficult task. An experienced systems administrator should be able to mask and insulate all of the potential pitfalls from users.
On a single logical volume, we were easily able to create four primary logical drive partitions. We installed QLogic Ultra160 SCSI HBAs in two HP Netserver LP 1000r servers, which were running SuSE Linux 7.3 and Windows 2000 Server, respectively. We chose SuSE 7.3 because it gave us direct out-of-the-box access to all of the major Linux file systems: ext2, ext3, JFS, and ReiserFS.
At the BIOS level, the QLogic HBAs reported both the DuraStor 6200SR unit and the logical "DuraStor disk" that we had created and configured as having four primary partitions. Both SuSE Linux and Windows 2000 saw the disk, and we were able to format the partitions to suit our testing. At the operating system level, Windows 2000 formatted one partition as NTFS and ignored the Linux-formatted partitions. When we loaded the Storage Management Pro software on this system, it recognized the DuraStor 6200SR attached to the QLogic HBA, and we were able to launch the management browser.
The management browser supplies all of the necessary basics. The systems administrator can drill down on the devices. This view allows checking on the physical aspects (e.g., power, temperature, and hardware faults) of the RAID appliance. From there, any array associated with the drives in a DuraStor 312R can be accessed and configured. What is lacking, however, is real-time performance monitoring and tuning. For example, there is currently no provision for collecting caching statistics. For Linux hosts, this would be a powerful add-on module.
For version 2.4 of the Linux kernel, cache is king when it comes to I/O. This is clearly reflected in the baseline performance results of our OBLdisk and OBLload benchmark suites (see figures). The fundamental assumption for disk I/O under Linux is that the data should be in cache. As a result, the operating system does everything necessary to make that so. When that strategy fails, however, the performance hit taken by the Linux operating system can be severe.
Data being in cache is a real plus for Windows NT/2000, too, but it is in no way necessary for good performance. Unlike Linux, the Windows NT/2000 I/O subsystem anticipates that needed data won't be in cache. As a result, Windows NT/2000 follows a strategy of launching volumes of asynchronous I/O requests to go about its processing tasks until the responses come back from the storage devices.
Our OBLdisk benchmark, which reads data sequentially from a disk file in increasingly larger block-size requests, reveals two different I/O throughput profiles. To optimize cache utilization, the Linux ext2 file system bundles I/O requests in order to issue large-block read requests. Such requests have the added advantage of triggering large-block look-ahead requests, which serve to populate the cache for likely future hits. With 512MB of RAM in the server and a dedicated 128MB RAM device cache, sequentially reading files from the DuraStor volumeeven files up to 256MB in sizewith tiny I/O requests still streamed data at full bus speed.
In contrast, streaming throughput performance under Windows follows a fast ramp-up curve as I/O request sizes get larger. The Windows 2000 operating system does not intervene for an application that is making small I/O requests. To reap the advantages of large-block reads under Windows 2000, the application must explicitly issue large-block reads.
The Windows I/O processing strategy does have a decided advantage in a high transaction-processing environment. Here the asynchronous approach pays major dividends. In a database-driven application with hundreds of simultaneous users, the I/O pattern is made up of a complex mix of localized high activity areas such as index tables and essentially random access over the remaining areas of the disk. In such a scenario, robust asynchronous I/O is essential so as not to be held hostage by localized caching performance. This is currently the one really bright spot for Windows 2000 in any benchmark comparison with Linux. One of the hot areas of Linux kernel development is to improve dramatically asynchronous I/O.
In the OBLdisk benchmark, write performance was consistently higher on all of the Linux journaling file systems than on ext2FS. Read performance was slightly lower. As expected, with multiple threads, total read throughput increased while total write throughput declined. Of particular interest, performance differences were most evident with ext3FS. More importantly, the results of our OBLdisk benchmark were consistent with our file-structure copy test in all cases except ext3FS, which was slower than ext2FS.
Until these changes are implemented, Linux transaction processing will remain bound by the speed of cache hits. This puts a double whammy on the Adaptec DuraStor configuration. In previous InfoStor Labs tests using PCI-based RAID controllers, the size of cache and system memory proved to be dominant variables in the performance equation. In the case of the DuraStor RAID appliance and the recently tested HP NetRAID-2M controller, system memory and cache size were identical. The big difference between the two is that the HP NetRAID-2M sits internally on a 64-bit PCI bus while the DuraStor RAID controller is at the end of an Ultra160 SCSI bus.
This is a relatively insignificant hurdle for Windows 2000, which blithely went on fulfilling 3,500 8KB I/O requests per second. For Linux, however, this configuration proved a major stumbling block. Throughput fell from 1,500 I/Os per second using the HP NetRAID-2M card to 500 I/Os per second on the DuraStor. This of course begs the question of just how many applications need to process more than 500 I/Os per second.
The importance of the DuraStor RAID appliance lies in its capabilities for supporting large numbers of disks, configuration flexibility for supporting numerous variations in hardware redundancy, and dual porting for use in high-availability clusters.
The importance of the role that caching plays in a Linux environment for overall system performance in general, and file-system performance in particular, makes file-system architecture an important aspect of any Linux distribution. This is reflected in the interest being paid to the performance of the journaling file systems that are now included in the latest distributions of Linux. SuSE 7.3 includes three of these file systems: ReiserFS, JFS, and ext3FS.
The new file systems are designed to provide more-robust file structures through the introduction of journaling techniques pioneered in high-end relational database systems. To understand the significance of these new file systems, it is first necessary to understand the problem they are designed to solve. Traditional Unix file systems date back to when disk storage was very costly. As a result, they were designed primarily to minimize wasted disk blocks. To minimize the number of empty disk blocks associated with a file, these file systems allocated a minimal number of disk blocks to a file at creation. When these blocks were fully utilized, the file system continued to add new blocks in minimal amounts as necessary.
Naturally, such an allocation scheme only serves to fragment files in scattered clusters of disk blocks. This makes it essential to have additional metadata that describes the attributes of files and maps the scattered physical disk blocks onto the sequential logical blocks of a file. A single logical write will therefore typically involve multiple physical writes of metadata. A system crash in the midst of one of these multiple writes can easily leave the system in a corrupted state. The solution to this problem has been the dreaded fsck utility, which scavenges all of a volume's metadata to restore its structure to a consistent state in what can be a considerable time-consuming process.
The new journaling file systems borrow constructs from high-end transaction-processing databases to simplify the task of maintaining structural consistency. These file systems log operations performed on the file system's metadata as atomic transactions. In the event of a system failure, simply replaying a finite set of log records that represent all of the file-system updates since the last file-system checkpoint restores the entire volume to a consistent state. In many instances, these metadata-only transaction logs actually simplify the total amount of data involved in a write operation.
In addition, these file systems also extend a number of I/O enhancement techniques introduced with the current de facto file system for Linux: ext2FS. By the time Linux was introduced, the explosion in disk capacity was well under way. The ext2 file system therefore attempts to maximize performance rather than minimize disk space. To do this, ext2 was modified to delay issuing writes to bundle them whenever possible into more-efficient large-block operations.
This construct has evolved further in the new file systems, which automatically allocate new disk blocks to files in large-block extents. Extents also serve to speed reads. When a read request is issued, that request can be expanded to put the entire extent in cache on the reasonable assumption that all of the data will be eventually accessed. The net result is that these new file systems should often result in faster writes and equivalent reads to ext2FS.
In our first look at the new file systems, we used our OBLdisk benchmark. To minimize the effects of caching, we wrote to and read from a 512MB file with a single thread and with multiple threads accessing the same file at different locations. We then followed up by copying a file hierarchy with 14,648 files in 1,645 folders representing 3GB of data. We used the system defaults for each of the file systems and did not attempt any manual tuning in this first analysis.
For the most part, the test results went according to Hoyle. In the OBLdisk benchmark, write performance was consistently higher on the journaling file systems than ext2FS. Read performance was slightly lower. As expected, with multiple threads, total read throughput increased while total write throughput declined. Performance differences were most evident with ext3FS, which can be considered a superset of ext2FS.
One of the unique advantages of this file system is the ability to easily upgrade an existing volume in place. In our tests we used both the default configuration, which logs only metadata, and the fstab option data=journal, which logs both the metadata and the physical data changes to the file system. The latter option slowed writes to a fraction of ext2FS performance.
The most intriguing results, however, were reserved for the file structure copy test. Consistent with the OBLdisk benchmark results, the copy operation was faster on ReiserFS and JFS (135 seconds in both cases) as opposed to 147 seconds on ext2FS. Directory copy performance on the ext3FS volume, however, was consistently slower (170 seconds) on ext3FS. With both metadata and real data journaling, the time to copy zoomed to 203 seconds.
We'll be following up on these tests with more-detailed examinations of tuning options in upcoming reviews.
Jack Fegreus can be contacted at firstname.lastname@example.org.