The lab's first foray into storage area networks posed some minor challenges and yielded some surprising performance results.
BY JACK FEGREUS
"You've got to send the cable back." That caveat sums up OpenBench Labs' first foray into the world of storage area networks (SANs). When we set out to build a Linux SAN, we had no appreciation for the "through-the-looking-glass" world we were about to enter. After all, SANs have been the coming IT panacea for nearly three years, and the first major revision of the Fibre Channel hardware-from 100MBps to 200MBps-is about to be unleashed on the market.
In addition, Linux has been a no-show when it comes to SANs. Sun Solaris has maintained a hold over the industry, but in recent months, Windows NT has made inroads and now battles Solaris for the honor of being the first operating system for which to write SAN software. Now that Linux has become the fastest growing operating system, OpenBench Labs set out to prove that a Linux SAN was feasible.
As our first criteria, we wanted to start building the OpenBench Labs SAN on a distribution of Linux that was 2.4-ready. With the new kernel set to dramatically improve a number of key characteristics of Linux for use in the data center, we wanted to be able to move to the new kernel as quickly as possible. What we discovered were a modicum of drivers and software available for Red Hat 7.0 and a greater amount of software that was still only in beta that worked exclusively with Red Hat 6.2. For example, QLogic has drivers that support its QLA2200 Fibre Channel host bus adapters (HBAs) on both Red Hat distributions, but the software application to enable automatic fail-over for systems with two HBAs is only in beta for the 6.2 version of Red Hat.
Drilling down on the SANbox icon brings up a diagram of the switch showing the GBIC connections and the performance of each port on the switch.
After much internal debate, we decided to move ahead in this first phase of our SAN project with all Linux systems running version 7.0 of Red Hat. In addition, we had intended to introduce a Windows 2000 server into our SAN later in the project; however, we found it was essential for the configuration of one of our storage arrays right from the start.
For many people, Fibre Channel is synonymous with fiber optics, but it isn't necessarily so. While fiber-optic cables with SC connectors are the de facto standard, twinaxial copper with HSSDC and DB9 connectors are also quite common. Fibre Channel deals with this dichotomy by using Gigabit Interface Connectors (GBICs), once called Gigabit Link Modules. While many devices-such as the Sony GY 8240FC tape drive that we tested-come with a standard SC connection, Fibre Channel switches and many HBAs come with a slot that requires a module that provides either a copper or fiber-optic interface. So in addition to the pricey cables, you'll need to factor in several hundred more dollars to put GBICs on one or both ends of the cables. For OpenBench Labs, dealing with the connectors was the only difficulty we faced in setting up our SAN.
In our tests, we used 64-bit QLogic QLA2200 PCI HBAs and QLogic SANbox-8 Fibre Channel switches, which support Class 2 and Class 3 connectionless services with a maximum frame size of 2,148 bytes. Any or all of the eight ports may be Fabric Ports, which connect to Fibre Channel public devices and device loops; Segmented Loop Ports (SL_Ports), which divide a Fibre Channel Private Loop into multiple segments; Translated Loop Ports (TL_Ports), which connect a private loop and devices "off-loop;" and Trunk Ports (T_Ports), which allow the interconnection of multiple chassis to form larger fabrics.
As is customary with all such devices, the QLogic SANbox switch has a Web-based GUI-dubbed SANsurfer-for administration. The GUI is quite intuitive; however, we found performance to be sluggish with versions of Netscape prior to 6.0. Using a classic drill-down paradigm, the SAN administrator can view dynamic performance statistics for each online port. Performance data includes frames-in (green), frames-out (blue), dropped frames (yellow), and errors (red). In addition, the administrator can also view name server data for each device connected to the selected chassis, the type of GBIC installed in each port, the view address, WWN, FC-4 Type, and status of each loop device connected to any port on the selected chassis.
We configured an SGI Total Perfor-mance (TP) 9100 RAID array with 12 Seagate 10,000rpm Cheetah drives on two internal FC-AL loops, a Mylex DACFFX internal RAID controller with dual i960 microprocessors, and 128MB of cache. The SGI RAID array is configured with DB9 connectors for twinaxial copper Fibre Channel cables. On plugging the array into the SAN, both servers running Red Hat 7.0 recognized and were able to use the two logical drives that the array was presenting to the SAN.
The only problem was that there was no way to administer the array. The only way to configure the array is with Mylex's Global Array Manager (GAM), and the client side of GAM runs only on Windows systems. There is a GAM server for Red Hat 6.2, but this still requires a networked Windows PC to run the client GUI. We therefore found it necessary to add a Windows 2000 server into the SAN. In addition to providing a means of configuring the TP9100, this would allow us to compare Fibre Channel performance between Linux and Windows 2000.
Our main desire in configuring the TP9100 was to ensure the array was also configured as RAID 0+1 with two logical drives in the same manner as our FlashDisk array from Winchester Systems. For sites that don't need a full-blown SAN, the FlashDisk array provides a Power PC-driven RAID device that can be ported to multiple SCSI HBAs with no special SCSI cabling.
In our configuration, performance differences between the two arrays under Red Hat 6.2 were minimal. It was only with Windows 2000 and I/O loading that performance dramatically differed. Because of its asynchronous I/O model, Windows 2000 Server was able to drive I/O rates on the order of 4,500 I/Os per second on the FlashDisk array and 7,500 I/Os per second on the TP9100 array. Both RAID arrays topped out at 1,500 I/Os per second on Red Hat.
Streaming throughput from the SCSI-attached FlashDisk array was comparable to that of the Fibre Channel SGI TP9100. With four threads, however, the TP9100 demonstrated greater headroom.
More applicable to the real world is the need for high-speed cache and streaming data. To deal with this problem, Open-Bench Labs tested a Texas Memory Systems RAM-SAN multi-ported solid-state disk. Memory capacity for the RAM-SAN ranges from 4GB to 64GB in 4GB increments. The unit we tested had 16GB that we configured as two 8GB drives on separate Fibre Channel ports. The device has four parallel memory buses, which can each have its own Fibre Channel connection.
With our 100MBps SAN, streaming throughput measured on both Red Had 7.0 and Windows 2000 Server was virtually identical. Interestingly, for both systems, writes were significantly faster than reads at 90MBps to 93MBps, compared to 66MBps to 68MBps for reads. Once again, the conventional wisdom of Linux performance in a SAN vis-à-vis Windows 2000 was shown not to be exactly correct. Nonetheless, our biggest surprise was about to come as we increased the complexity of our SAN topology.
Having laid down the foundation of a SAN with a single 8-port switch, a collection of disk devices, and two servers (which were already clustered over a shared SCSI bus), we set out to begin building a more representative SAN fabric. As its name implies, the purpose of a SAN is to create a network fabric of storage devices. The goal is to provide multiple high-speed paths to optimally access devices and maintain a high level of availability. To achieve this, multiple switches are absolutely necessary.
As with most sites that begin building a SAN, our immediate need was simply to expand the number of user ports beyond the eight ports available in our initial QLogic SANbox switch chassis. Planning for expansion, there are three basic multi-chassis topologies that can be built using SANbox switches. These topologies are the basic cascade and mesh, along with what QLogic dubs "Multistage."
The critical caveat is that you cannot mix the topologies in the same fabric. As a result, expansion needs to be planned. Here, as in any network, the issues are bandwidth between switches, routing over a minimum number of switched paths to minimize latency, and efficient use of the physical ports.
Remarkably, the performance of the RAM-SAN actually improved when the two logical devices were split across the two switches. In this configuration, total throughput using two servers scaled linearly.
The simplest multi-switch topology to implement is a cascade. As the name implies, in a cascade configuration, switch chassis are conceptually connected in a row "one-after-the-next," much like Ethernet hubs and switches are cascaded. Not surprising for a Fibre Channel SAN, the cascade configuration can optionally sport a connection from the last switch back to the first to form a continuous loop. Among its advantages, a loop provides an alternate fail-over path when only single-port connections are used between switches.
The problem for a site implementing a cascade topology, which is only partially alleviated with a looped cascade, is dealing with the latency that can be induced by excessive routing. In a cascade topology, each switch will route traffic in the direction of the least number of switch hops. Latency to any port on the same switch is defined as one switch hop. Latency to any port on an adjacent switch is two hops, again counting the source switch.
As a result, the furthest device in a fabric with n cascaded switches may require n hops from switch-to-switch. Adding a simple loop reduces that number to (n+1)/2 or (n/2)+1, depending on whether n is odd or even. Nonetheless, with a large number of switches, even that reduced number of hops could easily introduce some complicated latency issues.
To overcome these routing issues, a mesh fabric can be woven by connecting each switch to every other switch. In a mesh topology, the maximum number of routing hops to any device is always two. It should be noted that in a fabric with only two or three chassis, a looped cascade and a mesh topology are exactly the same. This was the approach taken by OpenBench Labs.
Whether in a cascade or mesh SAN topology, any port on a SANbox switch can be either a user port-in QLogic par lance, a port connected to a user device (i.e., a storage device or a server)-or a T_Port, which is used to connect one switch to another. Each port on the SANbox switch is configured to detect whether it is connected to a device or another SANbox port and automatically configures itself as either a user port or T_Port. When ports are configured as T_Ports, the SANbox switch guarantees in-order delivery of packets with any number of T_Port links between switches.
A mesh topology immediately addresses the issue of device latency brought about by hopping from switch to switch in the SAN. There are, however, the twin issues of bandwidth between switches and efficient utilization of the number of physical ports, which we have conveniently ignored up to this point.
Each T_Port link between directly connected SANbox switches provides 100MBps of bandwidth between those switches. In the case of the OpenBench Labs SAN, we had two Linux servers connected to a single 8-port switch. Each server has a single QLogic QLA2200 Fibre Channel HBA capable of providing 100MBps of throughput. In theory-and later demonstrated in practice-we should be able to push 200MBps of total throughput through the SAN.
For our SAN topology, the worst-case scenario is therefore the situation where two servers are connected to one switch and simultaneously each tries to access a device that is connected to a second switch. To avoid a bottleneck in throughput between the switches, we need to provide for 200MBps of bandwidth between those two switches.
In other words, we must devote two ports on each of the switches as T_Ports to provide as much bandwidth between interconnected chassis as would be available were the devices and servers all connected to a local switch.
In the OpenBench Labs scenario, this severely limits the scalability of the SAN mesh fabric. To provide consistent 200MBps bandwidth for our two servers, two ports on each switch must be devoted to each interconnection. A mesh fabric with four switches requires each switch to reserve six ports for T_Port connections to the other three switches. With our current 8-port SANbox switches, that scheme effectively creates the analog of a single-but geographically distributed-single 8-port switch, as each of the four switches contributes just two user ports.
While a looped cascade topology does not have the scalability issue of a geometrically expanding number of T_Ports, there is a unique bandwidth problem for such a topology. A switch in a looped cascade topology divides its interconnection bandwidth, effectively directing half of the bandwidth in each direction around the loop. That's because the routing algorithm strictly looks at the least number of hops to the desired destination. For a small SAN with just two or three switches, the topology isomorphism between mesh and looped cascade makes these bandwidth differences moot.
So OpenBench Labs began its foray into weaving a more complex SAN fabric by linking two SANbox switches over dual 100MBps T_Port paths. Having the Texas Memory Systems Fibre Channel RAM disk gave us the perfect opportunity to examine the issues of latency induced by switch hops. It has four independent 200MBps Fibre Channels over which the system's internal volatile RAM can be configured into logical disk drives. In our previous tests with a single SANbox switch, we had measured throughput on writes to peak on the order of 90MBps to 93MBps. On the other hand, performance on reads was a less-stellar 68MBps.
We began by configuring the Texas Memory Systems RAM-SAN as two logical drives, each on its own internal Fibre Channel connection. We then split the two servers, Tuxilla1 and Tuxilla2, along with the two logical RAM-SAN drives-RAM-SAN3 and RAM-SAN11-across the two SANbox switches, respectively. Next, we ran the OBLdisk benchmark on Tuxilla1 and accessed RAM-SAN11, which meant we would incur a two-hop latency. Considering that we had configured a path with a 200MBps bandwidth between the two switches, we were expecting to measure no-or perhaps negligible-latency as compared to our previous tests.
The last thing we expected to find was dramatically improved performance, but that's exactly what happened. With each of the active RAM-SAN Fibre Channel interfaces connected to an independent switch, read performance jumped to 90MBps, on close par with write performance. This also proved to be the case when we repeated the test on the adjacent RAM-SAN3 device. In both cases, the performance was virtually identical. In addition, when we used both servers simultaneously, the RAM-SAN's total throughput on reads nicely scaled to 177MBps.
We also tested the RAM-SAN for I/O loading and immediately-within two to three I/O Daemons-reached the maximum capability of the QLogic 2200 with 15,000 I/Os per second, making the RAM-SAN an intriguing database device. In its current volatile memory configuration, index files, which can be rebuilt if lost, offer the most logical choice. Future iterations of the RAM-SAN will sport nonvolatile memory, which will simplify its utilization in a database scenario.
Naturally, these performance measurements beg the important question: Why did RAM-SAN performance jump in a fabric with multiple switches? One could speculate about the generational differences between 100MBps and 200MBps Fibre Channel interconnects; however, the important thing is that both QLogic and Texas Memory Systems came up without a conclusive answer, all of which left OpenBench Labs to ponder the mysteries of SAN management.
The next step in our evolution of the OpenBench Labs SAN was to add a simple tape drive. For our first Fibre Channel tape device, we chose to install an Exabyte Mammoth-2 drive with a native Fibre Channel interface. Having already tested the Mammoth 2 over an LVD Ultra160 SCSI interface, we had sufficient data to compare any differences in performance.
We began our tests with a new version of our OBLtape benchmark, which extends the size of data blocks up to 256KB. The results with 128KB data blocks over Fibre Channel were dead-on with our benchmark results over an Ultra160 SCSI interface. With uncompressed data we measured throughput at 11.9MBps. When we turned compression on and sent simulated file data to the drive, throughput jumped to 22.1MBps. Finally, in our worst-case test scenario in which we sent purely random data that cannot be compressed, the Mammoth 2's throughput fell to 10.9MBps as the device wasted cycles attempting to compress the data.
With these theoretical parameters set with our OBLtape benchmark, we next turned to a more real-world exercise. We configured the drive as a shared device on both servers within a beta version of BakBone Software's NetVault 6.0.3 backup package. As expected, both servers could see the drive and access the drive whenever it was free. In our 4GB backup saveset tests, backup throughput proved to be a statistically insignificant 1% better than over SCSI. Nonetheless, the results are in some cases dramatically improved over the previous version-6.0.1-of Net-Vault.
We expected to see similar results to what we had measured with NetVault 6.0.1 and the Exabyte X80, with maybe a jump of 9.1% in performance. What we measured were performance increases that in some cases exceeded 25%. On a single backup process, we were able to average 19.1MBps, which represented a 7% boost in throughput. What was really eye-popping was the rate at which the new version of NetVault restored our saveset. The restore blazed away 32% faster than before and clocked in at 18.5MBps.
Our configuration of two servers sharing a tape drive-or more importantly, a large and expensive library-over a SAN leaves one important hole. The NetVault package maintains a very rich database on each server on which it runs. This data includes information about the jobs that were run on that server and, more importantly, which savesets are stored on a particular media cartridge. For NetVault on Tuxilla1 to know what is on a backup tape created on Tuxilla2, it must read the tape as a "foreign" backup.
In the OpenBench Labs scenario, Tuxilla1 and Tuxilla2 are already configured as a Convolo Cluster over a shared FlashDisk array. A natural extension would be to add the Fibre Channel TP9100 array from SGI into the pool of shared storage available to the Convolo Cluster. This was a trivial task since all of the Fibre Channel devices appear to each system as shared SCSI devices. As a result, we were easily able to configure a shared database service for NetVault residing on a logical drive presented by the TP9100.
Everything was fine except that not only is the NetVault software not "cluster aware," it is downright cluster-hostile. Ideally, we were hoping that we might be able to make NetVault itself a cluster service, which could be built from a single physical installation of NetVault that would be tied to the cluster alias. Unfortunately, NetVault is too complex a program to make that a practical alternative.
NetVault is designed to provide both client and server services in a complex heterogeneous network over a large array of operating functions. As a result, NetVault buries an innumerable number of configuration files that are tied to the physical machine throughout /etc as well as its own /NetVault6 directory. This makes simply sharing the /NetVault6 directory structure-and expecting that both Tuxilla1 and Tuxilla2 would be able to run NetVault-a fantasy. As an alternative, we examined the possibility of just sharing NetVault's /nvdb directory and creating a Convolo Cluster service that would determine which server would have access to the database in the way that we could configure MySQL or Oracle.
Unfortunately, NetVault encodes each server with a unique system ID. When we failed over the NetVault database service from Tuxilla1, which created the database, to Tuxilla2, it was unable to access any of the records. Even the existing tape drive device was hidden from NetVault running on Tuxilla2. In short, if we wanted to configure a standard NFS or MySQL service that would reside on the TP9100, we would not have a problem. Trying to provide a clustered NetVault service, however, presented us with a significant software problem.
Finally, if you are wondering about that third complex SAN topology, dubbed Multistage by QLogic, we've saved the most complex SAN issue for last. A Multistage topology consists of switches configured in one of two distinct stage types: standard Input-Output/Transfer (IO/T) stage types, which have the user and T_Ports found in a mesh or cascade topology switches, and a Cross-Connect (CC) stage type. In this topology, the T_Ports on the IO/T stage switches only connect to ports on one or more CC stage switches, whose sole function is to interconnect T_Ports. In a Multistage topology, there are only three hops to any IO/T switch. Latency is therefore theoretically worse than any mesh, in which every switch is an adjacent switch, and better than any cascade topology with seven or more switches, where the maximum number of hops may be greater than four.
In theory, a Multistage topology should also provide the best bandwidth. All T_Ports from each IO/T switch connect via a CC switch to all other IO/T switches in the same number of hops-three-no matter how large the fabric is. It is possible in Multistage topology to devote half of the 8-port SANbox bandwidth to T_Ports in order to provide 400MBps of bandwidth between any two IO/T switches. The question is how well do various switches plug-and-play in such a complex fabric. We'll be examining that question in future labs reviews.
OpenBench Labs scenario
- Linux SAN performance and functionality
- QLogic QLA2200 Fibre Channel HBAs
- QLogic SANbox-8 Fibre Channel switch (www.qlogic.com)
- SGI total performance 9100 RAID array (www.sgi.com)
- Texas Memory Systems RAM-SAN solid state disk (www.texmemsys.com)
- Exabyte Fibre Channel Mammoth-2 tape drive (www.exabyte.com)
- (Two) Dell PowerEdge 2400 Servers (www.dell.com)
- Winchester Systems FlashDisk RAID controller (www.winsys.com)
- Mission Critical Linux Convolo Cluster 1.2 (www.missioncriticallinux.com)
- Red Hat Linux v7.0 (www.redhat.com)
- OpenBench Labs OBLload v1.0 benchmark
- OpenBench Labs OBLdisk v1.0 benchmark
- OpenBench Labs OBLtape v1.1 benchmark
- Red Hat 7.0 performance in the SAN was at a par with Windows 2000 server, with the exception of I/O loading.
- SAN software for Linux is often limited to version 6.2 of Red Hat.
- Performance of the RAM disk, when split across a mesh/looped cascade SAN topology, increased over its performance with a single switch.
- With dual paths connecting switches, throughput using two servers scaled linearly.
- Sharing storage devices in the SAN was significantly easier than sharing software, even in a fail-over cluster scenario.