By John Steczkowski and Jeff Russell
For the past several years, the industry has been working on the successor to the PCI bus. In October 1999, two organizations along with seven of the industry's leaders (Compaq, Dell, Hewlett-Packard, IBM, Intel, Microsoft, and Sun Microsystems) joined forces and created the InfiniBand Trade Association to define and support a new channel-based switched-fabric I/O architecture.
This architecture not only creates many new possibilities for designing future system clusters, but it also will solve a variety of other I/O problems.
Figure 1: The typical server architecture includes a memory controller that connects the CPU, memory, and the PCI bus bridge. The PCI bus provides the connection to one or more adapter cards.
In typical entry-to-midrange servers, the standard I/O architecture uses a PCI bus to communicate to adapter cards, which in turn communicate with storage devices and networks. An adapter card that interfaces to a storage protocol is called a host bus adapter (HBA); a card that interfaces to a LAN (e.g., Ethernet) is called a network interface card (NIC). The majority of storage applications implement parallel SCSI for server-storage interconnects and Fibre Channel for storage area network (SAN) configurations. The basic goal of PCI is to provide a standard, modular interface for adapter cards that allow servers to be connected to external storage and LANs.
As illustrated in Figure 1, the typical server architecture includes a memory controller that connects the CPU, memory, and PCI bus bridge. The PCI bus connects to one or more adapter cards, which consist of a PCI interface and an I/O controller. The dashed line indicates the boundary of the typical server system.
The PCI adapter cards give the system access to various media outside the system chassis. Traditionally, PCI adapter cards are viewed as the "access point" for getting data into or out of the system.
Although the PCI bus has served the industry well for many years, several characteristics limit its performance and make implementation and administration difficult.
Figure 2: With the InfiniBand Architecture, a switched fabric replaces the PCI shared bus. The server interface to the InfiniBand fabric is called a host channel adapter (HCA).
For example, PCI uses a shared-memory architecture that prevents separation of the various I/O controllers' address spaces. Therefore, if an adapter card fails, it can potentially interfere with the memory spaces of the other adapters. Additionally, the shared memory architecture requires the CPU to interact directly with the I/O controller on the adapter card. This requires the CPU to slow down to bus speeds while manipulating data on the I/O controller, affecting performance and exposing the CPU to any failures on the PCI bus.
Adapter cards use direct memory access (DMA) to move blocks of information between system memory and the I/O medium. Each PCI adapter has its own DMA engine. This means each PCI adapter has its own register-based programming interface that complicates device driver development.
Because PCI uses a shared, parallel bus, there are electrical limitations on the speed at which the bus can run, as well as limitations on the number of PCI adapters that can be used at any given bus speed. In the case of PCI-X, for example, there is a limit of one adapter card on the bus at the highest speed. This also limits system design flexibility because the parallel bus can be only several inches long; therefore, the PCI cards must be enclosed in the server.
Because the I/O bridge is an arbiter, system integrity relies on the cooperation of all adapters on the bus. If one card misbehaves or malfunctions, the system cannot diagnose and isolate the failing adapter card, which could cause system failure. Besides fault isolation problems, the fair distribution of bus bandwidth depends on the cooperation of adapter cards instead of a central authority.
PCI's shared-bus architecture has some distinct shortcomings in terms of reliability, availability, and serviceability (RAS).
- Reliability describes the ability of a system or component to perform its required functions under stated conditions for a specified period of time. Reliability is expressed as the likelihood that a system will not fail; thus, comparisons in reliability examine the probability of a failure.
Reliability can be examined at detailed and system levels. The PCI bus consists of more than 100 signals; if any one of those signals fails, the bus will not operate properly. PCI does not have any provisions for isolating individual cards so the system can continue to operate without a particular card. And there is no way to create a redundant connection between the PCI bus bridge and the adapter card because of the shared nature of the bus.
Although the likelihood that a single PCI bus signal will fail is rather low, the probability increases if an adapter card is attached to a bus using a circuit board edge connector without a positive seating mechanism. The likelihood of a PCI bus failure is proportional to the number of bus signals times the probability of a single signal failing.
Examining reliability at a system level, the shared nature of the PCI bus means that the likelihood of a PCI bus failure is also proportional to the number of installed PCI adapter cards. Thus, reliability of a system is decreased due to the shared bus and to the large number of signals on the bus.
- Availability describes the degree to which a system or component is operational or accessible when required for use. Availability is generally improved through redundancy; if a system element fails, a second system element takes over. In the case of a parallel bus architecture, if one adapter fails, the whole bus fails. The shared bus topology makes it more difficult to implement redundant connections from the server to outside media.
- Serviceability describes the ability of a component to be installed, exchanged, or removed from a system.
PCI is not optimized for serviceability. Some systems have provisions for hot-pluggable PCI cards, but these extensions to the standards are weak from an enterprise-availability standpoint. For these reasons, most IT professionals do not use hot-pluggable PCI cards. Because the adapter cards are internal to the server, the bus must be powered down in order to swap adapters, which often requires the server to be shut down.
In summary, the shared nature of the PCI bus limits reliability, availability, and serviceability. As mission-critical applications are implemented in entry and midrange servers, a higher level of RAS will be required.
The purpose of the InfiniBand Architecture (IBA) is the same as PCI: to connect servers (CPU and memory) to external storage and LANs. In this respect, IBA simply replaces the PCI architecture. However, IBA does have a number of system enhancements. As with any new technology, there will be a migration to IBA as the system I/O interconnect, with PCI-based I/O and IBA-based I/O coexisting for a period of time.
As shown in Figure 2, the key difference between IBA and PCI is the architecture: switched fabric versus shared memory, parallel. The server interface to the InfiniBand fabric is called a host channel adapter (HCA). The HCA manages interactions between the server and external resources, including other HCAs, network devices, and peripherals.
InfiniBand adapters are called I/O units, and their basic functionality is similar to the functionality of a PCI adapter-connecting servers to storage and network resources, and other peripherals. However, unlike PCI adapters, IBA I/O units can be shared by more than one server.
In its simplest form, each I/O unit consists of an I/O controller and a target channel adapter (TCA). The end-node access points are called "channel adapters" because the IBA hardware supports high-performance communication channels between the CPU and the I/O controllers. Common IBA I/O units will be IBA-Fibre Channel, IBA-SCSI, and IBA-Ethernet.
The switched fabric uses point-to-point links and switches to connect multiple servers and multiple I/O units in a fabric configuration. The IBA fabric provides the media for server clustering and the capability to connect multiple servers to storage devices or LANs through a single I/O controller.
The InfiniBand fabric is controlled by a fabric manager that discovers the physical topology of the fabric, assigns local identifiers, and establishes routing between end nodes. Furthermore, the fabric manager controls changes, such as adding or removing nodes.
The switched-fabric architecture allows for the modularization of servers. Instead of housing the I/O units inside the server, the server "box" can consist of one or more CPUs, memory, and an HCA that connects to external IBA switches and I/O units. In this configuration, servers can be replaced or upgraded as necessary, without disturbing the peripheral configurations. As more connections are required, additional switches can be added to build larger fabrics.
InfiniBand implements a channel communication model between the server CPU and I/O controllers. Instead of a shared-memory architecture, the CPU and I/O controller are de-coupled. A reliable delivery mechanism is implemented by IBA so the CPU and I/O controller can exchange messages. This communication channel simplifies device-driver design.
The channel-based communication allows direct, protected access to host memory. This means that a misbehaving I/O unit cannot corrupt system memory and cause a system crash. Efficient HCA implementation ensures that the CPU is de-coupled from the details of I/O communication, resulting in improved performance and fault isolation in the event of an I/O unit failure.
The IBA specifies a means to construct redundant physical paths between end nodes and switches. Servers and I/O units can have two or more ports for fabric connection that allow redundant configurations to be constructed. The fabric routing tables are aware of the redundant connections and can manage fail-over should a fault occur.
Clustering will be able to be implemented as an evolution of the current technology. Vendors will not need to implement proprietary physical clustering mechanisms with IBA links available.
The IBA fabric infrastructure will provide scalable performance. In PCI, the bus bandwidth of approximately 500MBps is shared by all of the devices on the PCI bus. With the IBA fabric, each link to a device has a minimum connection bandwidth of 500MBps, scaling to 6GBps for a single link.
The IBA packet format defines a global routing header. This will enable IBA fabrics to be connected across wide-area networks through the use of edge routers on each fabric. Basic IBA links have a distance limitation of many meters, compared to PCI's several centimeters. This provides for flexibility in the placement of servers and I/O controllers in rack-mounted configurations.
One compelling reason to use IBA for I/O connections is the RAS improvement, compared to that of the PCI bus. At a detailed level, the numerous PCI signals and the printed circuit board edge connection along with the shared bus contribute to the probability of failure.
- To address reliability issues, IBA uses point-to-point links with fewer signals and a positive seating mechanism in the connector. The basic link connection is implemented with four signal wires, resulting in a smaller link failure multiplier, compared to more than 100 signals on a PCI bus.
Individual signal failure is reduced by eliminating the card edge connector. The InfiniBand connectors provide a positive seating mechanism and prevent de-rating issues due to wear on the edge connectors. The IBA standard provides for multiple ports for each I/O unit, which enhances the reliability of the system by providing multiple physical routes to a single I/O unit.
- Availability is enhanced through redundancy and a standard fabric management technique. Point-to-point links allow problems to be diagnosed and new routes to be established should a device or link fail. All IBA devices will have a common way of being managed, reducing both the learning curve and the amount of time necessary to remedy problems on the fabric.
- The IBA I/O units are designed from the ground up to be hot-pluggable, enhancing serviceability. Some I/O units may be standalone devices. In these cases, replacement may be as simple as unplugging the IBA connector from one device and plugging in a new device.
In other cases, the I/O units may be in an adapter-card format that can be easily installed or removed while the server is powered on. The IBA standard describes the adapter format, the physical signaling, and the management interface so that I/O units can be hot swapped.
The IBA fabric scales the data center. Servers can be added as increases in computing power are needed, and I/O units can be added as increases in network and storage bandwidth are needed. The IBA fabric can be used for clustering servers and as a means to share I/O units.
The IBA specification describes two basic form factors for I/O units: a standalone unit or an adapter card. Standalone units will have one or more external IBA ports and one or more media-specific ports. This implementation option will be more typical of advanced devices that provide functionality beyond a typical HBA or NIC.
I/O unit adapter cards, on the other hand, can be inserted into IBA-compliant chassis. The chassis implements a backplane that functions as a switch to interconnect the I/O unit adapter cards to the server.
The InfiniBand Architecture opens up many possibilities for advanced applications. Today, a server connects to many LAN segments via a network router and one or more NICs in the server. IBA enables a high-speed, low-latency connection between the server and LANs by embedding a TCA inside the network router, eliminating the need for a NIC in the server. This reduces the number of devices required to connect a server to its network media and consolidates this connection within the network router device.
Figure 3: An InfiniBand environment includes host channel adapters (HCA), modifications to the operating system (OS) and hardware abstraction layer (HAL), routers, switches and I/O units.
A similar advanced application that builds on the open SAN philosophy is connecting multiple servers to multiple SANs through a single storage router. By embedding the function of multiple HBAs in the storage router, fewer devices are required, while providing the functionality of routing between multiple SANs. The storage router serves as a central point from which to manage the SAN and to run applications like server-less backup and data mirroring.
Just as connecting a LAN and SAN to a server via a router provides a benefit, other applications may benefit from directly attaching RAID subsystems using IBA. Using this approach a proprietary device driver will be required for each vendor's RAID subsytem. Figure 3 shows a variety of IBA products classified into families. Host-connectivity products include the HCA and accompanying modifications to the server operating system and hardware abstraction layer (HAL).
The basic infrastructure products include switches to build a fabric, an InfiniBand router to provide inter-fabric connectivity, and an I/O chassis that provides connectivity for I/O unit adapters. The I/O unit adapters interface storage, network, and other media to multiple servers through a server-resident device driver (DD in Figure 3). The LAN router, storage router, and RAID subsystem products are also shown in Figure 3.
John Steczkowski is director of server I/O routing, and Jeff Russell is server I/O architect, at Crossroads Systems (www.crossroads.com) in Austin, TX.