Sporting a native (uncompressed) throughput speed of 80MBps, an LTO-3 tape drive is capable of turning into a blazing backup device or a spotlight for performance bottlenecks on other I/O subsystems attached to the server. With this level of performance, there’s an important caveat: To maintain native throughput of 80MBps, the source of the data directed toward the drive must also maintain a constant overall I/O throughput rate of 80MBps.
That’s a fairly tall order. I/O specs for disk drives are invariably derived from aggressive asynchronous scatter-gather reads that bypass the file system. That benchmark technique reveals a lot about a disk subsystem, but virtually nothing about single-threaded file-by-file I/O, which is how a backup program works.
Case in point: Using a four-drive, RAID-0 array in our SAN, we ran our asynchronous I/O benchmark and a file system benchmark. To avoid issues of caching by Windows, we measured I/O at the SAN switch as it came from the array. Our asynchronous I/O benchmark pegged a single LUN capable of delivering 189MBps. We then ran a file system benchmark that opened and read a series of files. Throughput off the same LUN was pegged at 73MBps. At best, that level of file-by-file throughput translates into backup throughput of about 60MBps.
For this lab review, we tested Hewlett-Packard’s StorageWorks Ultrium 960 LTO-3 tape drive. LTO-3 provides a huge jump in performance over LTO-2, but in other ways HP’s Ultrium 960 is an evolutionary step forward from HP’s Ultrium 460. In fact, the Ultrium 960 is both read- and write-compatible with LTO-2 cartridges.
LTO-3 cartridges are also available in write-once, read-many (WORM) format. WORM functionality is implemented by a fail-safe mechanism that prevents anyone from tampering with WORM media without detection. A unique identifier is written to the 4K cartridge memory embedded within the media cartridge. That identifier triggers unique servo code for WORM media in the drive, which prohibits overwriting existing data. Since there is no difference in tape format, throughput is identical with either a WORM or a normal read/write LTO cartridge.
At the heart of the Ultrium 960 is the same Data Rate Matching (DRM) circuitry that has provided continuous dynamic support for adjusting linear tape speed since the introduction of the first generation of LTO drives. The idea is to match the speed of the tape to the speed of data coming from the drive’s internal data buffers to keep the drive streaming for optimal performance.
Plotting raw data throughput measured with our benchmark (left) gives a macro view of relative performance. Normalizing the data (right) reveals distinctive differences in drive characteristics between the LTO-3 and SDLT 600.
The difference in throughput for the Ultrium 960 comes from two design changes: 50% higher bit density and doubling the number of recording heads from 8 to 16. That makes the Ultrium 960 capable of writing uncompressed data at a maximum 80MBps compared to 30MBps for the Ultrium 460. Nonetheless, that’s only half the story. DRM effectively makes the native throughput rate of an LTO-3 drive variable. Those adjustments in tape speed vary the effective native throughput rate for uncompressed data from 27MBps to 80MBps. As a result, optimizing throughput performance is significantly more complex than for any previous generation of LTO drive.
We designed a test configuration that would reflect a midrange IT site with storage requirements of about 1TB. With this volume of data, it is almost certain that a site will have moved beyond JBOD disk configurations and have implemented direct-attached, hardware-based RAID arrays. Furthermore, as the total volume of data at a site begins to approach 1TB, IT is able to take considerable control over storage management costs via the implementation of a basic SAN.
We therefore set up a test scenario using a typical midrange server: a dual-processor (2.8 GHz Intel Xeon CPUs) HP ML350 G3 running Windows 2003 Server. We started testing with 512MB of RAM and incrementally increased this to 2GB. The server was attached to an entry-level SAN via the installation of a QLogic 2342 host bus adapter (HBA). Using an nStor 4520 Storage Server we provisioned 3 RAID-0 arrays, each created with four Seagate 15K Barracuda Fibre Channel drives.
Next, we mapped one LUN from each RAID-0 array to our test server. For our backup tests, we placed 10GB of data in a mix of Microsoft Office files along with a mix of HTML and image files from Websites on each LUN.
For this assessment, we compared the performance of the StorageWorks Ultrium 960 LTO-3 drive to the Ultrium 460, as well as Quantum’s SDLT 600. We attached all three tape drives via an HP 64-bit/133-MHz dual-channel Ultra320 SCSI adapter, which is an OEM version of the LSI Logic LSI22320. This controller is built on LSI’s Fusion-MPT architecture, which provides support for 256KB I/O transfers. On Windows Server, this requires changing the MaximiumSGList Windows registry parameter. This adjustment turns out to be very important for getting the maximum performance from the LTO-3 tape drive.
An LTO-3 drive is very susceptible to the impact of any processing overhead. This starts with the formatting of data blocks before any data is written to tape. On Windows, the default maximum block size for I/O transfers is 64KB. Formatting a tape with 64KB data blocks on the Ultrium 960, however, costs about 7% in lost throughput performance. In our tests, native (uncompressed) throughput for a tape formatted at 64KB averaged 69MBps. Increasing the block size to 128KB raised throughput to 72MBps. At a block size of 256KB, throughput averaged 74MBps.
To test throughput on the tape drives, we used our openBench Labs tape benchmark, obltape. The benchmark generates two types of data: purely random and patterned, which is generated from a fixed set of characters in a distribution tuned to provide a 2-to-1 compression ratio using the Digital Liv Zempel (DLZ) algorithm. All data is streamed directly to the device from memory to avoid any issues of bandwidth.
The data can be streamed in block sizes of 2n KB, where n ranges from 0 to 7. This simulates the differences in the way backup applications read data off a disk drive. In particular, high-end backup applications tend to use 64KB reads when reading from disk on Windows and often can use data blocks as high as 256KB when writing to tape.
Both patterned and random data can be intermixed in the same stream to simulate the variances in compressibility from file to file, which is typical of a backup operation. In this way, the benchmark can be used to model a major problem faced by drive electronics: how to keep the drive from halting and having to reposition the tape during a backup operation.
The variation measured between the upper and lower throughput boundaries can often be on the order of 3:1. That variation is reflected in real-world backup performance. The throughput observed when writing data to tape is highly dependent on the characteristics of the data being sent to the drive as well as the ability of the drive’s electronics to handle fluctuations in data compressibility.
Measuring throughput and overhead while running file system benchmarks demonstrates that streaming I/O at a rate of 73MBps from a single LUN and 165MBps from three LUNs simultaneously imposed little overhead on our ProLiant ML350 G3 server.
During normal backup operations, differences in data compressibility from file to file make it more difficult to keep the drive’s buffer full, which makes the drive prone to halting. Also, throughput can degrade when a drive attempts to compress data that is not compressible. In such an event, the drive’s electronics can actually expand a file with ineffective metadata, cause the drive to halt as buffers are emptied, and slow throughput to less than native (uncompressed) performance.
When a drive halts, it must reposition the tape, which has been moving at up to 295 inches per second, before it can resume writing. The time lost while repositioning the tape rapidly adds up and dramatically lowers the average throughput rate of a backup operation. To test the capability of the electronics to keep the drive streaming under various data input conditions, obltape generates heterogeneous streams of data that combine compressible and non-compressible data in varying proportions.
Plotting actual benchmark results provides a good macro view of throughput performance; however, plotting normalized data presents a micro view of relative drive compression handling. In the SDLT 600, Quantum uses digital circuitry in a technique dubbed Digital Data Rate Agent (DDRA). Among its key characteristics, DDRA minimizes command overhead to maximize bandwidth for data. With highly compressible data, this technique gives the SDLT 600 an advantage.
Monitoring data as it streamed from the three LUNs on our array during backup jobs revealed an important heuristic for sizing memory requirements. While 512MB of RAM was sufficient to stream disk I/O at 165MBps, it was not sufficient to simultaneously interleave that data into a single stream directed at the Ultrium 960. The server was not able to allocate resources equally and the disk processes would have a staggered finish. As a result, disk I/O would have an extended “tapering-off” period and the Ultrium’s DRM circuitry would have to slow the drive to keep it streaming. We needed to add 512MB of RAM for each disk agent in a parallel backup job to obtain maximum efficiency interleaving data.
On the other hand, the HP Ultrium drive uses digital and analog circuitry to solve the same problem in an approach dubbed Adaptive Loss-less Data Compression (ALDC). One of the unique aspects of ALDC is the use of two compression schemes. The first is LZ1- based and uses a history buffer to achieve data compression. The second is a pass-through scheme designed to pass non-compressible data through in its native form.
To do this, the drive’s circuitry compares the size of a record after compression: If it has expanded, the original record is written to tape. As a result, the drive’s native throughput rate is the baseline for performance, even when attempting to compress an already-compressed file. Throughput for the SDLT 600, on the other hand, degrades significantly with compressed data, such as zip archives.
While super tape drive prices cluster rather closely, throughput performance for super drives now breaks into two distinct strata. The results of our benchmark for the Ultrium 960 put native throughput at 74MBps. With highly compressible data, throughput rose to 144MBps. These results placed the performance of the Ultrium 960 as 110% faster than its closest rival, the Quantum SDLT 600. Compared to the previous LTO-2 generation Ultrium 460, the Ultrium 960 has about a 150% performance edge.
Although throughput of 144MBps is very impressive, it was accomplished by streaming data directly from memory. That is hardly reflective of a real world backup.
As noted earlier, we measured I/O throughput using a single-threaded file-by-file process from a single SAN-based LUN at 73MBps. Only by simultaneously reading from three LUNs representing independent four-drive RAID-0 arrays were we able to scale file system I/O to 165MBps. This is precisely the I/O throughput level needed to keep the Ultrium 960 streaming with data that is nominally compressible at a 2:1 ratio.
While the presence of three disk volumes on a midrange server is nothing extraordinary, accessing three volumes simultaneously during a backup isn’t very likely. Many backup packages do parallel backups, but these packages use parallel backup to solve a bottleneck problem caused by slow tape drives. In this context, parallel backup is a means to stream simultaneous single-threaded backup processes to multiple independent tape drives.
With an LTO-3 tape drive such as the Utrium 960, the problem is reversed. It requires I/O from multiple disk drives to keep up with the one tape drive. Fortunately, HP has long recognized the potential problems that the LTO technology road map would be introducing. Packaged with the StorageWorks Ultrium 960 is a single-server license for version 5.5 of Data Protector. This single-server license does permit network-based clients, but only if they are running the same OS as the central server. To run Linux or Unix clients from a Windows server requires a license upgrade.
Data Protector can launch a separate backup process, or “Disk Agent,” for each logical disk volume in a backup job.
While each agent reads the file data to be backed up on its particular volume, Data Protector’s Media Agent tags the data from each disk agent and interleaves that data into a single data stream that is sent to the tape drive.
At our SAN switch, we monitored the combined data traffic generated by disk agents reading from the SAN-based LUNs during a backup job. With our server configured with 512MB of RAM, we launched backup jobs with one, two, and three disk agents.
With one disk agent, throughput frequently exceeded 70MBps, with an average throughput of 60MBps. With two disk agents, throughput reached 115MBps; however, overall backup throughput increased much more modestly to 78MBps. Similarly, three disk agents frequently delivered in excess of 140MBps, while average backup throughput rose to 94MBps.
Examining the combined I/O generated by all of the disk agents, the problem was not in reaching a peak I/O throughput level; rather, it was in maintaining that throughput level over the course of an entire backup. When the three disk agents remained in synch and evenly balanced, throughput from the disk drives kept pace with the tape drive. Unfortunately, the agents tended to get into a race condition as one or more agents were allocated an unequal amount of processing resources by the server. Then the remaining drives were not able to fill the tape drive’s buffer fast enough to keep it streaming at full speed. As a result, the DRM circuitry would engage and lower the linear tape speed.
That effectively lowered the throughput rate for uncompressed data, which defines the throughput potential for all compressible data. To prevent this situation we had to increase the amount of memory in our server. Starting with a base configuration of 512MB of RAM, we found that adding 512MB for each disk agent launched in a parallel backup provided sufficient resources to keep the processes in synch. When we configured our server with 2GB of RAM based on our rule of thumb that the server needs 512MB of memory as a base plus 512MB more for each I/O process that will be interleaved, we sustained backup throughput of 124MBps-more than twice the throughput of the closest competitor, the SDLT 600.