PCI-X and PCI Express versions of QLogic’s 4Gbps SANblade HBAs yield impressive performance results.
By Jack Fegreus
Fibre Channel SANs have evolved from 1Gbps to 2Gbps and now, to 4Gbps. Although higher performance is always good-particularly when it doesn’t come with a price premium, as is the case with Fibre Channel-the relatively rapid evolution in speeds places a burden on SAN administrators who need to manage multiple generations of switches, host bus adapters (HBAs), and disk arrays, which contradicts the original rationale for acquiring a SAN as a means to cut management overhead and costs.
This makes it particularly important to deploy a scalable HBA architecture that can address a full spectrum of issues, including cost, performance, backward-compatibility, and future scalability. Backward-compatibility of 4Gbps Fibre Channel infrastructure with 2Gbps and 1Gbps equipment gives SAN administrators the freedom to align costs with performance based on business metrics.
Given the importance of performance and scalability in SAN environments, openBench Labs recently tested QLogic’s 4Gbps PCI-X and PCI Express (PCIe) SANblade HBAs in terms of their ability to scale along both I/O dimensions: the volume of data as measured in megabytes per second (MBps) and the load of data requests as measured in I/Os per second (IOPS). In particular, we tested the performance of the SANblade QLE2462 for PCIe and the SANblade QLA2462 for PCI-X.
The HBAs were installed on an IBM eServer xSeries 366 running Microsoft Windows Server 2003. The IBM server featured a single 3.6GHz Xeon processor and support for the PCI-X 2.0 specification, which double-clocks data transfers to and from the server. Double-clocking makes a 133MHz HBA function like a 266MHz card, which is critical for active-active dual-port performance.
We managed our SAN fabric via a stackable 16-port QLogic SANbox 5602 switch, which comes bundled with QLogic’s SANsurfer Switch Manager software. Using the SANsurfer configuration wizards, we easily added the SANbox 5602 to an existing SAN fabric based on SANbox 5200 switches. We monitored I/O throughput at the switch’s ports that were connected to the 4Gbps Fibre Channel HBAs. From the port perspective, “bytes transmitted” correspond to bytes requested by a server read command and “bytes received” correspond to bytes sent by a write command.
With disk array throughput the critical factor in stressing the server’s HBA, we used a RamSan-400 solid-state disk (SSD) subsystem from Texas Memory Systems. With no rotational latency, an SSD exhibits no differences when accessing data randomly or sequentially, which simplifies modeling I/O profiles of applications such as Exchange.
To avoid any overhead associated with file structures, logical drives were created on the RamSan-400 as raw devices. These logical drives were then distributed across the four Fibre Channel interfaces on the RamSan-400. With such a configuration, the RamSan-400 could sustain a load of 400,000 IOPS and a throughput of 3GBps. That easily exceeds the maximum sustainable throughput level of a single HBA.
We used the Iometer benchmark to set up and manage full-duplex I/O streams associated with each test. These streams were equally divided between reads and writes. Each test followed a sequence of 12 iterations, over which every I/O process continuously read or wrote data for a fixed time period in a block size double that of the previous iteration. Block sizes started at 512 bytes and rose to 1,024KB. Over the test’s duration, we measured throughput scalability (bytes per second) and load scalability (operations per second).
Working with an SSD, we were able to scale both the volume of data and the load of data operations by increasing the number of I/O processes and the number of logical drives each process accessed. Specifically, we repeated each HBA stress test four times while doubling the number of processes on each run. Starting with one process reading and one process writing, we repeated the tests with four, eight, and 16 processes.
We followed this testing scheme with both PCI-X and PCIe HBAs using single-port and dual-port connections. While an active-active, dual-port configuration offers the greatest throughput, many sites prefer an active-passive configuration with a single active port to ensure no performance degradation in a fail-over.
Over the full range of block sizes, the test results have different meanings for different operating systems. Unix and Linux began in a synchronous I/O world. As a result, they bundle small block I/Os into large block requests-128KB for Linux-as an optimization strategy. As a result, performance at the upper end of the I/O block spectrum will be of paramount concern. Windows, on the other hand, began as being asynchronous I/O-centric, which dramatically changes I/O optimization. Key to asynchronous I/O is a good scatter-gather strategy, which means firing off multiple I/O requests and reassembling their results. Windows does not bundle small requests into larger ones; it defaults to a maximum I/O block size of 64KB. This means Windows can concentrate on avoiding wasted disk space, which occurs when disk space is allocated and filled with empty data blocks. As a result, Windows Server applications cluster I/O around four critical block sizes: 4KB, 8KB, 32KB, and 64KB.
Windows uses small block I/O to avoid over-allocating disk space. Since e-mail is characterized by a preponderance of short messages, Exchange uses an embedded Jet database to store messages using 4KB data blocks to minimize the mailbox footprint. Most applications, however, default to 8KB blocks. On the other hand, a number of performance-sensitive file system operations use 32KB data blocks. Server applications-such as backup, logging, and SQL Server OLAP-need to move larger amounts of data and often use 64KB blocks.
After completing the stress tests, we generated a more complex set of data streams in a pattern that simulated I/O for a heavily used server running Microsoft Exchange. Using a heuristic developed by Microsoft, our I/O model had two components: message traffic (90%) and transaction log traffic (10%). Message traffic changes its complexion with respect to reads and writes as e-mail traffic scales from light to heavy.
In a low-traffic environment, as much as 90% of message I/Os will be associated with reading archived messages. In a heavy-use scenario, however, a significant increase in sending and receiving new e-mail messages can skew the read-write ratio to 67% reads and 33% writes. This shift toward a more balanced I/O profile makes full-duplex I/O performance all the more important for a server hosting Exchange.
To represent this model, we structured Iometer around 20 processes, each having its own logical drive. To represent message I/O, we launched 12 processes to read 4KB blocks and six processes to write 4KB blocks. To represent transaction log I/O, we launched two processes to write data in 64KB blocks. We then ran this pattern in a steady state for a fixed time period with the QLogic PCI-X HBAs. Over that time period, we measured the total byte traffic and the total I/O operations processed by the server.
We began by running our stress tests using PCI-X HBAs in a server that supported the PCI-X 1.0 spec. On this server we ran with one active Fibre Channel port in an active-passive configuration. This will likely be a common configuration at sites upgrading to a 4Gbps SAN. When we repeated the tests in a server supporting PCI-X 2.0, dual-port active-active test results were precisely twice the result measured using a single port on a PCI-X 1.0 server.
Since the I/O engines on the PCI-X and PCIe HBAs are identical, throughput differences are the result of differences in the PCI interface, which are significant. PCI Express replaces the underlying shared-bus architecture of the older PCI-X with a switch-based architecture. To take advantage of that difference, however, multiple devices must be installed in the system. With only one HBA in each of our test systems, differences between the PCI-X and PCIe cards were minimal.
We started our base performance test with two simultaneous I/O processes: one reading data and one writing data. With those two I/O processes running, both QLogic SANblade HBAs performed precisely as their specifications project. As we increased the number of bytes transferred in read-and-write operations with Iometer, total I/O throughput climbed to about 790MBps. The two streams reading and writing data tracked very closely as each scaled to the near-wire speed of 395MBps.
Next, we repeated this test and increased the number of read-and-write processes from one to two, and then to eight (two, four, and 16 total processes). As in our base test, the SANblade QLA2462 and QLE2462 scaled to the same near-wire-speed limits. As the number of read-and-write processes increased, the IOPS rate for 4KB data requests scaled from 48,000 to 78,000, and then to 103,000 on the QLA2462. At the same time, throughput scaled from 188MBps to 305MBps, and then to 404MBps.
Following our PCI-X and PCIe tests of an active-passive configuration, we then tested an active-active configuration with two live ports on each PCI-X HBA. To prevent the system’s PCI bus from becoming the prime bottleneck, these tests were conducted on the IBM eServer xSeries 366. This server supports version 2.0 of the PCI specification, which provides for double-clocking of the card. As a result, our 133MHz cards were now effectively working at 266MHz.
In theory, a double-clocked active-active HBA configuration should provide twice the throughput measured with a single-port, active-passive configuration. That’s precisely what we measured when we scaled to that level by doubling the number of I/O processes to 32 and dedicated 16 processes to each port for the generation of a full-duplex I/O. More importantly for both HBAs, the dual-port performance is a 2x mirror image of the single-port performance with 16 processes, including the throughput anomalies. From these results it is clear that the ports are well-isolated and the scaling is completely dependent upon the PCI-X 2.0 implementation on the server.
We completed the testing by running our MS Exchange I/O load profile on the QLA2462. Results from the previous HBA stress tests pointed to excellent scaling of 4KB reads, 4KB writes, and 64KB writes, which both Exchange and MS SQL Server use to minimize I/O operation overhead associated with internal logging.
In particular, our results modeling a large-scale Exchange environment are of particular importance with regard to server resource utilization:
- Sustained 68,971 transactions per second;
- Average I/O response time was 0.289 seconds;
- Read throughput was sustained at 167MBps;
- Write throughput was sustained at 352MBps; and
- HBA utilization was pegged at 66%.
With server virtualization a leading strategy for improving IT resource utilization, 4Gbps HBAs can prove to be critical resources. With many virtual servers sharing a few physical storage resources, those resources need to be highly scalable. Your mileage may vary, but it’s clear that if you have the need for speed, then 4Gbps-the next generation of Fibre Channel-is the way to go.
Jack Fegreus is CTO at Strategic Communications (www.stratcomm.com). He can be reached at email@example.com.
openBench Labs Scenario
4Gbps Fibre Channel SAN
WHAT WE TESTED
QLogic SANblade QLA2462 (PCI-X 2.0) and QLE2462 (PCI Express) HBAs
- Dual 4/2/1Gbps Fibre Channel interfaces with auto-negotiation
- HBA and target-level fail-over
- Persistent binding
- LUN masking
QLogic SANbox 5602 Fibre Channel switch
- Stacking of up to four switches via 10Gbps ISL ports
- 16 auto-detecting ports (4Gbps/2Gbps/1Gbps)
- 4-port license increments
- Dual hot-swap power supplies
HOW WE TESTED
- Texas Memory Systems RamSan-400 SSD
- Four dual-port 4Gbps Fibre Channel connections
- Up to 400,000 IOPS
- Up to 3GBps throughput
- IBM eServer xSeries 366
- 3.6GHz Intel Xeon CPU
- PCI-X 2.0
- Microsoft Windows Server 2003
- With one read and one write process, full-duplex performance scaled to near-wire speed (400MBps) for both reads and writes using the QLA2462 and QLE2462 HBAs.
- Adding read-and-write processes increased read-and-write IOPS and throughput scaled to near-wire speed at smaller block sizes and, more importantly, remained at that level for larger block sizes using the QLA2462 and QLE2462 HBAs.
- In a simulation of a heavily trafficked Exchange Server, the QLA2462 sustained 68,971 transactions per second with an average response time of less than 0.3 seconds.