IT managers can achieve reliability, availability, and serviceability with fabric switches or director-class switches.
BY DINO BALAFAS
Enterprise storage requires data that is virtually 100% reliable and quickly retrievable. This article discusses two ways to build storage area networks (SANs) that have reliability, availability, and service ability (RAS) features.
As stored data grows explosively, organizations are increasingly looking to SANs to offload their network traffic onto a dedicated network that can readily handle storage applications. Early SANs were primarily adopted by industries whose need for speed-mostly for video-editing applications-outweighed the risk of early adoption. Fibre Channel, the dominant technology for SANs, is now mature, however, and more mainstream industries have implemented SANs. Today, conservative industries such as banking, retail, and telecommunications-which cannot afford operational downtime-are looking to Fibre Channel SANs as their storage solution. For these companies, building a reliable SAN that meets the needs of the company cannot be a risky proposition.
Fibre Channel and SANs have worked through the early phases of technology adoption over the past several years, decreasing-and in some cases, eliminating-problems such as loop initialization primitive (LIP) processing and Classes of Service. Even protracted debates on switch interoperability are near conclusion, with the adoption of the FC-SW-2 E-port standard-a spec designed to promote interoperability between different vendors' Fibre Channel switches. Of course, the most salient proof that SANs are past the "bleeding edge" stage is that companies such as Sprint, Ericsson, and Disney have adopted SANs. Moreover, all the major suppliers of computing and storage equipment have SAN programs and product suites.
Fibre Channel SANs offer a number of advantages over previous-generation technologies such as SCSI. In short, SANs allow IT administrators to have multiple connections between servers and storage for greater bandwidth between devices as well as redundancy for higher availability, as shown in Figure 1. These connections are not measured in feet, as in the case of SCSI, but in kilometers, and the number of devices that can potentially be connected is in the millions.
Figure 1: Multiple connections between devices provide greater bandwidth and redundancy.
High availability is a key factor in the adoption of SANs in the enterprise. A SAN must deliver RAS features required to achieve 99.999% uptime. It is clear that multiple connections between servers and storage can be created, but this alone may not be enough to guarantee virtually uninterrupted data availability.
Fabrics vs. directors
There are two primary ways to build a high-availability SAN. The first is to build RAS fabric networks that have enough redundancy to statistically achieve 99.999% uptime. The second method is to implement RAS features at the device level in a Fibre Channel switch, which has such features as redundant and hot-replaceable components, auto-fail-over, non-disruptive software upgrades, and "phone-home" features. This type of switch is referred to as a director and typically has 32 or more ports in one device. The criteria-such as cost, manageability, performance, and scalability-dictate which strategy is better.
Figure 2: End users can achieve reliability with a RAS fabric (a) or a director-class switch (b) that has redundant components.
Figure 2a shows a RAS fabric built with eight 16-port switches using a mesh architecture. Virtually all Fibre Channel switch vendors support this architecture. Figure 2b shows a RAS switch, or director. Internally, it's essentially 12 16-port switches.
Although fabric switches are less expensive to implement compared to a director that has built-in redundancy, the total cost of ownership may be more expensive.
Hardware and maintenance costs of a RAS fabric are very similar to those of a director switch. However, management costs may be higher with the RAS fabric because of the larger number of discrete components that must be managed. Directors shield administrators from the complexity of inter-switch links via the use of a passive backplane, which interconnects all the discrete hardware components. An administrator only needs to connect a management GUI to one device to check on status or to perform firmware upgrades on any device in the SAN. In the case of the mesh fabric, eight separate Ethernet connections, configuration files, and upgrade tasks must be maintained and monitored to achieve the same result.
Obviously, management costs and calculations for total cost of ownership will vary significantly between organizations; however, there is greater complexity in configuring and maintaining an eight-switch fabric compared to a single director. Additionally, because of the number of connections in the switch fabric, there are more potential connection failures.
Performance and scalability
Several factors determine overall SAN performance, including switch latency, the number of connections to a server or storage device, and the number of available routes through a SAN.
Available routes through a SAN determine the RAS fabric's reliability. Each time a server connects to a storage device, the switch must establish a route through the SAN to exchange data. Theoretically, these routes can go through any number of switches in the SAN. Realistically, it is beneficial to establish the most direct route between switches, reducing the number of hops (the number of times data must go through a switch) through the network and reducing the amount of latency (delay) incurred in a data transfer.
Figure 3 shows a mesh topology of eight 16-port switches. Data from any one switch can be routed through all the other switches in the network. In this architecture, any server can access any storage device at full bandwidth-up to 100MBps. This is a non-blocking architecture because each 16-port switch uses eight ports to connect to storage and servers, and the other eight ports to connect to the mesh fabric.
This point leads to one of the biggest disadvantages of performance and scalability in a RAS fabric design. Adding more ports to the fabric requires additional switches, which cannot be done without reducing the number of interconnects within the mesh fabric. Since there is no longer a one-to-one ratio among interconnected switches, servers, and storage devices, data that is being moved through the SAN may have to wait for a route to become available. Because the architecture cannot guarantee an available path at all times, it is called a blocking architecture and can reduce the performance of a SAN. Moreover, as the SAN continues to grow, performance is directly affected in two key ways. First, the connections between switches in the fabric are reduced, lowering the effective bandwidth between switches. Second, the number of hops between storage devices and servers increases, thus exacerbating the effects of latency, as shown in Figure 4.
Of course, these effects also occur in a director-based topology but they are less significant. Some director switches support 128 ports in a non-blocking architecture, which means the effective bandwidth through such a switch is 128Gbps, with a maximum hop count of two.
Before building a SAN, you should decide how large the enterprise is likely to grow. Fabrics of 16-port switches can be built incrementally if maximum bandwidth through the SAN is not a primary requirement.
Directors have potential RAS advantages. The total cost of ownership may be less costly, and the performance and scalability benefits may be attractive in some enterprise environments.
Dino Balafas is director of product management for QLogic's Switch Products Group Corp. (www.qlogic.com) in Eden Prairie, MN.