How To Jumpstart SAN + LAN Convergence

The leading solutions for CIOs struggling to achieve maximum data center efficiency come down to resource utilization, consolidation, and virtualization. Often these solutions involve a full-scale IT transformation that requires re-designing basic data center infrastructure. As a result, SAN and LAN technologies are now in the IT limelight along with the notion of a converged network infrastructure using 10GbE, which goes entirely against the traditional IT strategy of maintaining separate networks to lower infrastructure risks.

Thanks to multi-core processors, virtual operating environment (VOE) servers typically host eight or more VMs, with each VM needing to be provisioned with storage volumes and logical Gigabit Ethernet NICs. With multiple host servers, the number of potential TCP/IP connections among VMs grows exponentially to further stress Ethernet networking, which is essential for IT to leverage features such as VMware VMotion and Distributed Resource Scheduler (DRS) to provide load balancing and high availability for VMs.

IT has long maintained separate networks to lower infrastructure risks; however, escalating management costs due in part to VOE network complexity are forcing IT to rethink the separation of SAN and LAN topologies. That leaves many IT managers focused on Fibre Channel over Ethernet (FCoE). Nonetheless, FCoE is not synonymous with converged networking. For small to medium sized enterprises (SMEs), there is a simpler and more immediate strategy to begin to leveraging 10GbE in a data center while keeping tight control over costs. We recently examined QLogic’s QLE3142 Intelligent Ethernet Adapter (IEA), which provides a cost effective way for SMEs to implement VMware’s Virtual Infrastructure with iSCSI and VMotion over 10GbE while leveraging existing resource infrastructure and server technologies.

To offload network overhead and enable hosts to achieve higher throughput, greater scalability, and better application performance with lower CPU utilization, QLogic’s family of IEAs are built on a programmable architecture. This architecture allows QLogic to optimize performance and manageability through the combination of hardware offload for performance and host stack control of connections. As a result, openBench Labs could immediately take Intelligent Ethernet Adapters from box to host in an environment with servers running the latest VMware hypervisors or the latest Windows Server OS.

Provisioning for high I/O throughput

With server virtualization the major driver of iSCSI adoption by IT departments, openBench Labs set up a 10GbE test scenario using a converged iSCSI SAN and LAN scenario that extended an existing SAN fabric into a VOE. To implement this test plan, we set up two servers—one running Windows Server 2008 R2 and one running the VMware ESX 4 hypervisor—to act as iSCSI clients using a native software initiator in conjunction with a QLogic dual-port QLE3142 10GbE SFP+ Intelligent Ethernet Adapter.

For desktop systems, iSCSI over 1GbE is perfectly adequate as I/O throughput is governed by small-block reads and writes. That keeps typical desktop I/O under 10MBps. Only when it is streaming reads on a backup will a desktop system likely begin to encroach on the upper bounds of 1GbE throughput.

That is not the case for server I/O. Server applications, such as data warehousing and online analytical processing (OLAP), stream I/O in very large blocks. As a result, those I/O streams often reach and exceed three times the maximum throughput rate of a 1GbE iSCSI link.

While a well implemented MPIO storage interface on Fibre Channel can split and aggregate an I/O stream from a single process over multiple SAN paths without any special configuration by a storage administrator, link aggregation—aka trunking—on a LAN is far from automatic and is heavily dependent upon support by switches. In particular, each switch can introduce additional limits. Case in point: Our 1GbE Ethernet switch restricts trunking to physically contiguous ports and limits each trunk to no more than four ports.

Validating the flexibility of the firmware architecture of their Intelligent Ethernet Adapters, QLogic provides a teaming configuration service for Windows-based installations. Using the NxTeaming UI, an IT administrator can group adapter ports into Failsafe teams, as well as static and dynamic 802.3ad teams for throughput aggregation. When we grouped the two 10GbE ports on the QLE3142 in our server as a dynamic 802.3ad team, a new logical device appeared and it replaced the two previous network connections. What differentiated the new device from typical teaming configurations was the recognition within Windows Server 2008 R2 that the new virtual device had double the throughput of the two physical devices.

With all of the restrictions on teaming, iSCSI and link aggregation in a 1GbE infrastructure has not proven to be a viable alternative to a high-throughput Fibre Channel SAN for large data centers. On the other hand, lower equipment costs combined with an order of magnitude leap in network throughput makes iSCSI over 10GbE a very cost effective way to extend an existing FC SAN. More importantly, this strategy sets the stage for converged SAN and LAN networking by using existing proven technologies, and can facilitate the introduction of emerging technologies, such as FCoE, at a future date.

To extend our existing Fibre Channel fabric, we provisioned a server with a QLogic QLE3142 10GbE adapter and a QLE2462 dual-port 4Gbps Fibre Channel HBA. The dual-port QLE2462 enabled us to leverage our dual-port Xiotech Emprise 5000 storage array, which our benchmarks pegged as capable of handling bidirectional data at 1.3GBps —a throughput rate sufficient to saturate one port of our 10GbE adapter in a failover configuration. To provide the explicit bridging of Emprise 5000 volumes as targets on our iSCSI SAN fabric, we ran StarWind Enterprise 5.2 software on Windows Server 2008 R2.

Before testing the QLE3142 in our iSCSI configuration, we utilized NTttcp from Microsoft to calibrate TCP and UDP throughput between servers under controlled conditions. In particular, we specified the size and number of packets to test the effects of using standard and jumbo frames on network throughput and CPU overhead.

While 10GbE increases theoretical network throughput by an order of magnitude, it also increases overhead processing with a higher volume of TCP/IP traffic packets, which makes host offload support critical for achieving maximal throughput potential. Offloading TCP segmentation, the calculation of checksums, interrupt coalescence, along with the sending and receiving of large-block I/O can lower overhead significantly. Jumbo packet support can also have a measurable impact on the throughput of converged SAN data by lowering interrupt processing. Using NTttcp, we were able to examine these effects and set the stage for the performance of the QLogic QLE3142 in a mix of application-based protocol traffic in our 10GbE network.

Packets and frames work

In our initial tests of QLogic’s QLE3142, we utilized NTttcp, which is a Winsock-based port of ttcp. Our goal was to measure both throughput and the impact on host overhead. To make those assessments, we controlled the size and number of TCP packet streams under various adapter settings related to host offload.

We tracked all NTttcp benchmark tests with Windows Task Manager and Resource Monitor. In each test, we recorded network throughput, which we attempted to converge on 100%, and the overall processing load, which was essentially characterized by the CPU load of the benchmark and interrupt processing. Running with all of the QLE3142 offload functions enabled, processing overhead for three treads transmitting 99% of the theoretical bandwidth was 15% of our quad-core CPU and thread throughput varied by only 0.3%.

To conduct these tests, we needed two servers: One server sent multiple data streams using NTttcps, while the other server received data streams using NTttcpr. Both test systems ran Windows Server 2008 R2. On each of these servers, we set up a dual-port QLE3142 in a fail-over mode and updated the driver software included in the OS distribution package with the latest release on the QLogic web site.

To ensure that all tests were assessed in a steady state, we allotted each test run with 30 seconds to warm up, 60 seconds to collect data, followed by 30 seconds to cool down. We then collected the total amount of data sent and received by each NTttcp thread to ensure that packets were not being dropped before we calculated network throughput. We also collected the average CPU usage consumed over the testing period. With system monitoring and other concurrent processes representing less than two percent of the total processing, average overall CPU usage provided an excellent estimate of the overhead imposed by running the benchmark and handling network interrupts.

We analyzed all benchmark test results with respect to three use cases:

• Standard frames (1,500 bytes) with host offload functions enabled
• Jumbo frames (9,600 bytes) with host offload functions enabled
• Jumbo frames without host offload functions

In these tests, we used five fixed sizes of IP packets: 4KB, 8KB, 16KB, 32KB and 64KB. These packet sizes are the default sizes used by most applications when reading and writing to disk. As a result, these packet sizes provide significant insight into potential converged SAN and LAN networking and iSCSI performance in particular.

With jumbo frames and the host offload functions enabled, a single QLE3142 port sustained an average bidirectional throughput on a single port of 13,674Mbps using three read and three write NTttcp threads. Even with 8KB IP packets, we were able to exceed 10,000Mbps, which is an important watermark for iSCSI throughput in a Windows environment. Equally important, with bidirectional data, we were able to process upwards of 120,000 4KB IP packets per second.

QLogic’s family of Intelligent Ethernet Adapters allow IT administrators to extend the standard Ethernet frame size from the standard 1,500 bytes to a maximum of 9,600 bytes—by definition a jumbo frame is any frame greater than 1,500 bytes, and 9,000 bytes is a common implementation. While 9,600 bytes are “jumbo” by LAN standards, SAN data frames are significantly larger. Nonetheless, 9,600 bytes are sufficiently large enough to encapsulate the standard I/O request size—8KB—utilized by most applications running on Windows. That size frame also holds two 4KB I/O requests, which is the I/O size that Microsoft Exchange Server uses to access email boxes.

When we ran NTttcp using 4KB and 8KB IP packets with jumbo frames, the QLE3142 delivered 90% greater throughput for 4KB IP packets and 16% greater throughput for 8KB IP packets when compared to the throughput measured using standard frames. More importantly, our measured performance for 8KB packets was completely in line with the heuristic that jumbo frames provide a 15% to 20% boost in I/O throughput for applications running on iSCSI storage volumes.

As we continued to increase IP packet size, raw throughput with jumbo frames no longer commanded an advantage. On the other hand, CPU overhead for large IP packets using jumbo frames in conjunction with all host offload functions enabled was very minimal—11% for a single thread sending 64KB IP packets with host offload as compared to 24% without host offload. The minimal host CPU overhead incurred with jumbo frames streaming large IP packets has a profound implication for server applications, such as backup and data warehousing, that stream reads and writes using 32KB to 128KB data blocks.

Running Iometer with a Xiotech volume accessed via the StarWind iSCSI server, streaming 128KB block I/O exhibited a substantial throughput improvement using jumbo frames. With all TCP/IP offload functions enabled, streaming throughput doubled as it rose to 474MBps and average access time fell to 26ms.

In a converged SAN and LAN environment, the very large range in data block sizes that applications can use in access storage puts a unique strain on network infrastructure. In particular, a database-driven application running on a server is likely to invoke small block reads and writes in the 4KB to 8KB range, as well as large block reads and writes in the 64KB to 128KB range. What’s more, these applications rely on parallel operations to enhance scalability. That means application scalability can be stymied by insufficient CPU cycles, as well as by insufficient I/O bandwidth.

That made the ability of the QLE3142 to offload network processing from the host CPU the most significant aspect of our benchmark tests. As expected from the results of the NTttcp benchmark, we measured a boost in 8KB I/O of about 16% with jumbo frames. More importantly, using jumbo frames while streaming large block I/O, we doubled iSCSI throughput for a single Iometer process from 223MBps to 474MBps and cut average I/O response time in half.

iSCSI fabric configuration

With servers hosting eight or more VMs, a storage networking HBA can no longer be regarded as a simple commodity product. As the number of VMs sharing an HBA in a host server increases, SAN—whether Fibre Channel or iSCSI—fanout becomes a server issue as well as a switch issue. In a VOE, HBAs have to play the role of virtual switches for virtual fabrics created by virtual HBAs.

With the common drivers for the QLogic Intelligent Ethernet Adapters included in ESX 4 distributions, VMware hosts immediately recognize any QLogic 10GbE or 1GbE adapter. Within the VMware management GUI, each adapter port appears as a distinct virtual adapter, which an IT administrator can assign to any port on a virtual switch. We created a VMkernal port, which is used for iSCSI and VMotion, and assigned one QLE3142 port for our Iometer benchmark tests.

Virtual LAN switches are already an integral part of the VMware environment. IT administrators create virtual switches in order to connect VMs on a host to external resources via specialized ports—all iSCSI connections are made through VMkernel ports. More importantly, IT administrators assign virtual network addresses to virtual switch ports rather than adapter ports. In this way, an IT administrator is able to assign multiple virtual adapters to a port as a team for load balancing and high availability.

While VMware provides a very easy way to implement teams that virtualize connections that will be automatically balanced over multiple physical links, a scheme designed for universal automatic load balancing precludes the aggregation of links into single, virtual, multi-gigabit pipes. As a result, a team of 1GbE adapters will still limit I/O throughput for a single process to a peak of 120MBps.

Given the CPU power of today’s servers and the I/O throughput of today’s FC SAN disk arrays, 120MBps will be a bottleneck. Even if none of the business applications on a VM streams data, that limited level of throughput will adversely impact backup and replication software.

Breaking bottlenecks

We focused our testing of the QLE3142 in a virtualized environment on the ability of IT to use it to build a low-cost, high-throughput iSCSI SAN fabric that is comparable to a typical Fibre Channel SAN fabric. In our test configuration, we attempted to extend the reach of an existing 4Gbps FC SAN fabric to additional systems over Ethernet. The cornerstone of this test scenario would therefore be a central server with well balanced Fibre Channel and Ethernet I/O capabilities to maximize end-to-end converged fabric throughput.

At the heart of our assessment, we configured a quad-core server with FC SAN access through a dual-port QLogic QLE2462 HBA and Ethernet iSCSI access through a dual-port 10GbE QLogic QLE 3142 and a quad-port 1GbE QLE3044. That server had direct access of 8TB of storage on a Xiotech Emprise 5000 array, which features dual active-active 4Gbps FC ports. Thanks to Xiotech’s MPIO module for Windows, the performance improvements in the MPIO subsystem of Windows Server 2008 R2, and QLogic’s HBA drivers for Windows Server 2008 R2, we had a backend FC SAN device that could deliver 1.3GBps of data in near-perfect balance.

To virtualize and present that storage via iSCSI, we ran the StarWind Enterprise Edition server software. To maximize the functionality of our iSCSI volumes, we created and exported image files from volumes provisioned on the Emprise 5000. Alternatively we could also use StarWind’s software to simply reroute an Emprise volume as a disk bridge device. With well balanced I/O throughput to our FC SAN and the means to provide virtualized iSCSI access to storage resources, we were left with the knotty Ethernet delivery problems of load balancing and link aggregation.

With workstation applications, nearly all I/O uses 8KB I/O requests, which pegs a typical I/O throughput load at under 10MBps and peak streaming I/O under 40MBps.. A 1GbE link easily satisfies those requirements. All that is necessary to scale that model with multiple clients is a good load balancing scheme and a large number of physical 1GbE connections.

For a workstation scenario, QLogic’s teaming utilities for the Intelligent Ethernet Adapter family makes the QLE3044 an excellent choice. In conjunction with an Ethernet switch that supports 802.3ad trunking, an IT administrator can use the QLogic teaming software to configure a virtual trunk with 4Gbps aggregate throughput between the server and the switch. The switch will then automatically handle load balancing for all 1GbE workstation connections.

On the other hand, server applications often involve I/O streaming using block sizes that range from 32KB to 128KB in size. This large-block I/O generates data streams with throughput that is an order of magnitude greater than that of a workstation. Without the ability to implement 802.3ad trunking for link aggregation on a VOE host, a 10GbE adapter such as the QLogic QLE3142 is the only way to deliver the level of I/O throughput required by servers.

Streaming server data

We began testing the QLE3142 with an examination of bidirectional I/O streaming. Our primary goal was to assess the ability of a single VM application to break through the 120MBps limit of iSCSI over 1GbE. Given our previous controlled results with NTttcp, we fully expected to see results that would be comparable to running our VOE host directly on a Fibre Channel SAN.

We tested I/O streaming over 10GbE in two phases. In the first phase, one VM streamed large-block (64KB) reads from an RDM volume. In this phase I/O throughput held steady at 225MBps. Next, we launched Iometer on a second VM and streamed 64KB writes to another RDM volume, while the first VM continued reading. During this phase, we monitored perfect bidirectional I/O balance from the FC backend to the 10GbE front end. Read and write requests for the Emprise 5000array were perfectly balanced across both FC ports of the QLogic QLE2462 HBA in the StarWind server. That balance was matched on the QLE3142 adapters connecting the iSCSI server with the ESX host as simultaneous reads and writes scaled to 280MBps.

We provisioned all storage on our VOE host with volumes from our iSCSI server. For our tests, we configured two dual-processor VMs. Each VM ran Windows Server 2008 on a fully virtualized system disk and used Raw Device Mapped (RDM) work volumes. RDM volumes are physically formatted by the VM’s OS—NTFS in our case. For full VMFS support, we assigned each RDM volume a mapping file that was collocated with the VM’s system disk in an ESX datastore.

We conducted our streaming I/O tests in two phases. During the first phase, we used a single Iometer worker process to stream 64KB reads from an RDM volume. Throughput for that worker was sustained at 225MBps.

With the first VM in a steady state reading data, we launched a second Iometer process on another VM. That second process introduced bidirectional I/O by streaming 64KB writes. With two VMs, throughput for reads and writes remained in a perfect 50/50 balance for the entire test period as throughput scaled to 280MBps.

To test support for transaction processing, which includes hosting MS Exchange on a VM, we launched Iometer processes that read and wrote data in an 80/20 ratio using 4KB data blocks. Using two VMs, I/O levels were maintained in tight balance at just over 4,100 to 4200 IOPS for each VM. More importantly, host processing overhead for both VMs consumed about 15% of the available CPU cycles with the lion’s share expended on the user process.

Next, we turned our attention on the ability of the QLE3142 to support transaction processing (TP) applications on a 10GbE iSCSI SAN. In this test we made random rather than sequential I/O requests in a database rather than file access pattern. Each VM process mixed read and write requests in an 80/20 ratio of reads to writes.

Minimization of I/O latency is key for this test, and was evident in testing NTttcp with jumbo frames and small data packets. Our precursor to these tests, small-block reads using a StarWind RAM disk, was also in line with the NTttcp test.

With one VM, I/O throughput averaged about 5,000 IOPS, which is right in line with what we have measured with direct FC SAN access to the Emprise 5000. With two VMs, throughput scaled to about 8,200 IOPS with both VMs in near lock step. Given these results, our 10GbE iSCSI configuration with the QLogic QLE3142 Intelligent Ethernet Adapter should be able to support the installation of Microsoft Exchange Server on a VM with upwards of 5,000 mail boxes.

As a result, the simplest and most immediate strategy for IT to begin leveraging 10GbE in a data center, especially when dealing with a VOE, is to begin implementing iSCSI along with LAN networking for critical data management services. To facilitate full 10GbE networking over a wide range of servers, QLogic’s Intelligent Ethernet Adapters offload network-related processing from the host, which our benchmarks show significantly improved the host’s ability to deliver maximal 10GbE throughput.

Jack Fegreus is CTO of openBench Labs.

OPENBENCH LABS SCENARIO

UNDER EXAMINATION: 10GbE converged SAN & LAN networking

WHAT WE TESTED:
QLogic QLE3142 Intelligent Ethernet Adapter
— Two 10GbE ports
— SFP+ copper connector
— 9,600KB jumbo packet support
— Full range of TCP/IP host offload functions
— 802.3ad teaming support for link aggregation

HOW WE TESTED:
(3) Dell PowerEdge Servers
— (2) Windows Server 2008 R2
— QLogic QLE3142 10GbE adapter
— QLogic QLE2462 4Gbps HBA
— StarWind Enterprise Edition iSCSI server
— Iometer benchmark
— NTttcp
— (1) VMware ESX Server 4
— QLogic QLE3142 10GbE adapter
— (2) VMs running Windows Server 2008 R2
— Iometer

Xiotech Emprise 5000 storage system
— (2) Balanced ISE DataPacs
— (2) 4Gbps Fibre Channel ports
— (2) Managed Resource Controllers
— Active-Active MPIO
— Web Management GUI

KEY FINDINGS:

— Hosts supporting 10GbE with QLogic Intelligent Ethernet Adapters exhibit minimal overhead processing as QLogic drivers and programmable adapters offload functions such as TCP and UDP checksums, stateless offloading of large sends and large receives, TCP segmentation, and interrupt coalescing.

— Teaming software for adapters supports 802.3ad teaming with full link throughput aggregation in conjunction with switches that support 802.3ad trunking.

— Common drivers and utilities support all Intelligent Ethernet Adapters across major operating systems and are included in server distributions such as Windows Server 2008 R2 and VMware Hypervisors.

— TCP/IP-centric benchmark results: With jumbo frames, multiple NTttcp threads streamed bidirectional data at 13,674Mbps using 64KB IP packets.

— Comparable iSCSI and FC SAN storage benchmarks: A single process streamed data from an iSCSI virtual volume using 128KB reads at 474MBps and accessed 4KB blocks of data from random locations at 5,700 IOPS.