RapidIO is a packet-switched intra-system interconnect that is a potential successor to the PCI and PCI-X buses.
BY DAN BOUVIER
The capabilities and throughput requirements of storage devices have been fueled by the growth of content and the need to move that content over the Internet. As a result, all aspects of storage systems are moving toward increased performance and functionality. These improvements include higher disk spindle speeds and faster storage interconnects. System-level interface performance also is increasing with new connection technologies such as Gigabit Ethernet and InfiniBand.
However, there is a bottleneck looming that will severely limit even those storage systems based on the latest technology. As reliability and speed requirements escalate, standard device-level interconnects such as PCI and PCI-X are not necessarily the best choice. Future device-level interconnects must scale in performance to meet the extensive throughput demands that will be made on system-level connections. To protect long-term hardware and software investments, the industry needs to standardize on a new technology that offers the flexibility to bridge existing systems while taking the industry into the future.
RapidIO is a packet-switched, point-to-point intra-system interconnect architecture that was developed to surpass the perfor mance of the PCI bus and provide new features without sacrificing the capabilities that made PCI successful. The RapidIO 1.1 specification defines a switch fabric-based control-plane interconnect that offers significantly greater bandwidth, more scalability, and higher reliability than existing bus-based interconnects. It is also compatible with existing architectures and is an open standard.
Initially developed by Motorola and Mercury Computer Systems, it was deeded over to the industry early in the engineering process. Today, the RapidIO Trade Association has more than 45 member companies, most of which spent more than a year contributing to the first version of the specification. The organization's steering committee includes Alcatel, Cisco, EMC, Ericsson, IBM, Lucent, Mercury Computer Systems, Motorola, and Nortel.
A typical storage system includes multiple disk interface controllers, XOR controllers, memory, processors, interface devices, and SCSI or Fibre Channel disk controllers. Data can be striped across not only multiple disks on a single link, but also across multiple links or initiators. The benefits of this approach are increased data throughput and the creation of data redundancy to protect against failed disks. A storage system also typically contains a large memory pool to act as a data buffer or cache between the host system and the disk array. A processor, or several processors, manage the storage system, set up direct memory access (DMA) activity, and communicate with the host system. The RapidIO interconnect can be used with any processor architecture, including the PowerPC processor. Devices in these systems have traditionally been connected using memory-map ped, software-transparent interconnects such as proprietary buses or the PCI bus, as shown in Figure 1.
Figure 1: A typical storage system connects devices using memory-mapped interconnects such as the PCI bus.
With increasing demands on the storage system to process more transactions faster and with greater reliability, the intra-system interconnect must evolve in throughput and capabilities. The PCI-X extensions to the PCI bus addressed some of the short-term needs of storage systems but generally do not scale well to the next generation.
InfiniBand has been promoted as being a PCI replacement but primarily only for extending the system beyond the box. The complexities of the protocol and associated overhead may make it a less satisfactory choice for communications within the storage system. The RapidIO interconnect has the attributes and scalability to make it an alternative for next-generation storage systems.
A memory-mapped interconnect typically is used within the storage system because it is the simplest and most expedient way to deal with the protocol interworking required to move data between one interface protocol and another (e.g., between Fibre Channel and InfiniBand).
Measuring a storage system
A storage system is often measured by how many host system requests it can service in a period of time. A simple storage system may handle requests from only one host computer. More-complex storage systems such as those found in network-attached storage (NAS) or system area networks must handle simultaneous requests from multiple hosts. In this clustered system approach, the host systems and storage systems communicate with each other using messages and remote DMA. Messages convey information requested from the operating system or application software running on the host. A processor in the storage system takes the requests, associates the requests to the system under its control, and issues the appropriate commands to the devices in the storage system to carry out the transactions using DMA descriptors. This process is most efficiently carried out using a memory-mapped, load-store programming model.
The variety and availability of peripheral devices, simple software model, and interoperability made PCI a natural choice for storage systems. However, as external interface throughput demands have increased, so has the demand on PCI performance. For example, Fibre Channel is now transitioning from 1Gbps to 2Gbps and in the future will move to 10Gbps. Aggravating the problem is the desire to increase port density in a system. The addition of the PCI-X specification addressed some of the performance issues by not only increasing the data rate but also making protocol modifications to get additional transaction throughput efficiency. Unfortunately, this comes at the cost of reducing the number of devices connected per link. PCI-X pushes PCI to the practical limits in frequency, bus width, and scalability.
The InfiniBand interconnect was designed to enhance intersystem throughput between smaller systems clustered to form larger enterprise systems. In contrast to the traditional LAN interface, Infini Band allows connections between the host system and the peripheral system with a simplified software protocol stack, reducing software overhead and burden on the host computer.
Each InfiniBand system is responsible for local management and data interchange within the system and communicates through channels established in host channel adapters (HCAs) and target channel adapters (TCAs). While InfiniBand increases throughput between systems, the hardware and software complexity may be overkill for device-to-device communication within the system.
RapidIO exceeds PCI performance while maintaining important features, including the load-store/read-write, memory-mapped programming model, low latency, and the ability to do peer-to-peer transactions. New features include robust error-detection-and-recovery capabilities and multiple prioritized transaction flows. Message passing and globally shared distributed memory programming models are also directly supported in the RapidIO architecture, allowing microprocessors and intelligent subsystems to be directly connected.
Figure 2: This RapidIO architecture example includes Fibre Channel controllers, each with an associated processor and memory, connected through a non-blocking RapidIO switch.
The RapidIO interconnect architecture can be applied to a storage system as shown in Figure 2. The example includes Fibre Channel controllers, each with an associated processor and memory, connected through a non-blocking RapidIO switch device. The processors set up and maintain descriptors for the Fibre Channel controllers with the appropriate instructions of which data must be moved to or from the associated disks. Since RapidIO allows direct peer-to-peer communications, each Fibre Channel controller can access descriptors in different memory without the need for arbitration or the possibility of blocking. These concurrent operations lead to increased system throughput.
Most enterprise storage applications require dependable throughput with "five-nines" (99.999%) reliability. Errors cannot go undetected and should be recoverable. PCI error-detection is not very strong, because some of the critical control signals are not protected and error-recovery is not addressed. Although partially addressed by the PCI-X extensions, the PCI transaction-ordering rules are relatively restrictive. There are no provisions for independent identifiable transaction flows on the bus. The simple arbitration scheme also makes achieving deterministic throughput very difficult.
In a RapidIO interconnect architecture, the system topology is not limited to a daisy chain or a strict hierarchy but can be anything appropriate for a particular application. Switches can be cascaded to allow tightly coupled systems with topologies of up to 256 devices and, using the extended transport addresses, more than 64,000 devices. Individual devices can be accessed though address offsets, allowing a RapidIO system to be completely memory mapped.
RapidIO uses unique device identifiers to route packets through the switches, which provides a number of useful features. For example, directly routing packets to the final destination means that a RapidIO system can easily support a distributed system such as a storage system, without involving software drivers in generating transactions between devices. Since unrelated packets can also be simultaneously transmitted on each RapidIO link, overall system throughput is far greater than that of any individual device, peaking at the sum of all devices in the system.
The storage system interconnect must handle multiple data types, each of a varying delivery priority. Typical data types might include DMA descriptor information, DMA data, and interprocessor messaging or interrupts. RapidIO packets contain a priority field that RapidIO switches use to prioritize packet routing and transmission. This priority information is used to guarantee end-to-end packet delivery ordering and avoid transaction-dependent deadlocks, while reserving bandwidth for critical data traffic.
Packet header information and link management overhead is balanced with the data payload sizes to provide greater than 90% efficiency with the modestly sized maximum 256-byte data payloads. The mix of small and large payloads is useful in the storage system for moving DMA streams or writing to registers in a device controller. Packet overhead is roughly half that of InfiniBand. This balance reduces the necessary packet buffering in a port implementation, while minimizing read-access latency due to transmission conflicts at an output port.
Reliability and robustness against data errors is a requirement for storage systems. System interconnects, such as Ethernet and InfiniBand, typically use parity or CRC mechanisms to detect failed bits, usually resorting to software mechanisms for error-recovery. Software-recovery places a complexity burden on system designers and requires extra latency. This is usually not a problem for system-to- system interfaces, but within the system-where real-time response is critical-software error-recovery is unacceptable.
RapidIO approaches error-detection and recovery from a different approach, using hardware link-based error-recovery. Packets transmitted between adjacent devices are protected with 16-bit CRC and packet sequence identifiers. If a transmission error is detected by an adjacent receiving device, the sending device is informed of the error and the packet is automatically re-transmitted by the sender, without the intervention of software or notification of the end-point device that originally generated the packet. This mechanism allows every link connection in a RapidIO system to independently survive all single-bit-and many multiple-bit-transient transmission errors.
RapidIO offers a storage system device interconnect with the same features that made PCI an attractive solution, while providing performance and capabilities surpassing those provided by PCI. The message passing and globally shared memory programming models merge the I/O portion of the system with the traditional microprocessor bus, while the error-detection-and-recovery mechanisms allow an opportunity for reliable systems.
More information, including a RapidIO architecture overview white paper and the RapidIO version 1.1 specification, can be obtained from the RapidIO Trade Association Website at www.rapidio.org.
Dan Bouvier is the RapidIO technical working group chair and PowerPC architecture manager in Motorola's Semiconductor Product Sector, in Austin, TX.