Q: In a previous column, you discussed the challenges of providing storage to high-performance computing (HPC) clusters. How does InfiniBand enable storage for HPC clusters?
InfiniBand has tremendous promise for enabling SAN, NAS, and hybrid storage solutions in HPC clusters.
Let me start with a crash course. InfiniBand began as an initiative to replace PCI, which is based on a parallel arbitrated buskind of like parallel SCSI. InfiniBand was to do to PCI what Fibre Channel did for SCSI. By ditching the arbitrated bus and moving to a serial interface that communicates through a switched fabric, InfiniBand allows multiple devices to communicate with each other through a multi-purpose I/O system.
The secret sauce in InfiniBand is a specification called RDMA (remote direct memory access). DMA enables I/O cards (like SCSI host bus adapters or Ethernet network interface cards) to put data directly into the computer's memory without bothering the host CPU. This streamlines the I/O process and saves CPU resources. RDMA is a twist on this concept: It allows one host to put data directly into another host's memory. RDMA streamlines the I/O process, which minimizes latency and consumption of CPU resources related to I/O.
In addition to low latency, InfiniBand offers extremely high bandwidth. Today, most InfiniBand products are described as 4x, which translates to 10Gbps. That compares to Ethernet at 1Gbps and Fibre Channel at 2Gbps. Moreover, 12x InfiniBand (30Gbps) is shipping today for connections between switches. When you combine the two (low-latency and incredible bandwidth), you have a very formidable interface.
Today, InfiniBand systems interface with PCI-X or PCI Express through I/O cards called host channel adapters (HCAs). You can then run a variety of protocols, including TCP and SCSI as well as MPI, which is a common protocol in HPC clusters. With bridging hardware, you can connect InfiniBand switches to Ethernet and Fibre Channel devices and switches.
In effect, InfiniBand gives you a big fat pipe (600MBps in the case of PCI Express) that you can carve into a variety of different virtual interfaces.
It is very common in HPC clusters to use proprietary low-latency, high-bandwidth interfaces to run the MPI protocol. With InfiniBand, you get this essential functionality, but you get a much more versatile interface that is capable of simultaneously covering Ethernet and Fibre Channel needs.
Depending on configurations, InfiniBand costs about the same as the proprietary interfaces that dominate the HPC market. Purchase some relatively inexpensive bridging modules and you get Ethernet and Fibre Channel effectively thrown in for free.
The SAN approach-Most big clusters don't use Fibre Channel on each node in the cluster because of cost and complexity issues or because they cannot give up the I/O slot for the Fibre Channel HBA. InfiniBand addresses these limitations. If you buy InfiniBand for your high-bandwidth, low-latency interface, you effectively get Fibre Channel for free. All you have to do is load the right drivers, plug in your storage devices, and, voilà, you have a SAN!
The NAS approach-One of the protocol layers of InfiniBand is SDP (sockets direct protocol), which allows any sockets application to take advantage of the speed and low latency of InfiniBand. Thus, you could use InfiniBand to connect to a file server or NAS device over NFS through InfiniBand, with much less overhead than you would experience with TCP/IP and NFS. An even better approach, if supported by your operating system and application, would be to use DAFS (direct access file system), which is a high-performance file system that maps directly to the InfiniBand API.
Enabling parallel file servers-In my May 2004 column (see "What are the options for cluster-based file sharing?", p. 14), I talked about having multiple file servers share data over a SAN, thus allowing them to export the same set of data. The result is a parallel or multi-head file system that allows for scalable NAS performance. Most of the architectures of parallel file systems would benefit from the low latency of InfiniBand, and the most powerful of these file systems could take advantage of the additional bandwidth provided by InfiniBand.
SAN file systems-The dream in file sharing would be for every node on the cluster to be running a SAN file system, such that each node accessed data directly over the SAN while handling metadata transactions over Ethernet. This technology is available in software, but the connectivity costs have been prohibitive. As I described above, InfiniBand allows a single purchase to cover Fibre Channel, Ethernet, and the high-speed interconnect required for communication between nodes on the cluster. Suddenly, the infrastructure for a full-blown SAN file system is affordable.
Perhaps the last thing to mention is that an even-lower cost solution is coming to market that may offer many of the benefits of InfiniBand while leveraging your existing investment and familiarity with Ethernet. The RDMA concept that enables Ethernet has been spun off as a separate specification. Now you can buy RNICs, which are Ethernet cards enabled with RDMA. While RNICs have more latency than InfiniBand and are limited by the bandwidth of Ethernet, they show great promise for the future.
I expect InfiniBand and perhaps RNICs to become very popular technologies in clusters, largely because they enable high-performance storage applications in addition to providing the low-latency connectivity needed for the cluster. The technology is available today, and the price is right; it's just a matter of end users putting these solutions into production and getting comfortable with them.
Jacob Farmer is chief technology officer at Cambridge Computer (www.cambridgecomputer.com) in Waltham, MA. He can be contacted at firstname.lastname@example.org.