What is the best way to provide file sharing for a Linux-based cluster that will start with 1,000 nodes and grow from there?
You have hit upon one of the most daunting storage networking challenges: file sharing for high-performance clusters. We are seeing clusters pop up everywhere: research and academia, oil and gas exploration, video rendering, financial analysis, etc. The compute power of these clusters grows linearly as you add inexpensive CPUs. The catch is that conventional file-sharing technology is not scalable. In fact, it is common to see a diminishing return as you add nodes due to bottlenecks related to storing and retrieving data.
Let's take your 1,000-node cluster as an example. Figure that a high-performance NFS file server or network-attached storage (NAS) appliance might be able to sustain throughput on the order of, say, 50MBps. If you divide that by the 1,000 nodes on your cluster, you get an average of 400Kbps per node. Note that was a K as in Kilo and a b as in bits, not bytes. In other words, for any given node in the cluster you've got performance on the order of dial-up and DSL! This kind of performance will cripple your cluster. In fact, this would cripple a cluster 1/10 your size.
So, how do you do it? There are a number of different approaches for high-performance file sharing and different ways to package each approach. At this stage of the game it is hard to say which way is best because the field is new and volatile. I prefer systems based on NFS, which is the least expensive and most straightforward way to share data. Odds are that NFS would provide suitable performance for any given node in your cluster. The problem is that when you aggregate the performance needs of the entire cluster a conventional NFS server will be brought to its knees. If we could make a faster or more scalable NFS server, the problem would be solved.
Silicon—One approach is to move the entire network stack and network file access protocol process into hardware—the brute force approach. It takes a lot of engineering and eventually it may prove to be difficult to scale, but for the time being it gets the job done. The only real disadvantage to users is that this approach requires proprietary hardware, but the vast majority of storage area network (SAN) and NAS systems in the enterprise marketplace are proprietary, so perhaps this will not be a major impediment. Silicon-based systems are available as turnkey NAS systems and as NAS heads to be used with existing storage. In the near future you will see them packaged as switches or blades on switches.
SAN file system—In a conventional SAN, each computer is assigned its own LUNs, and administrators go to great lengths to ensure that no computers are exposed to the same LUNs. In a SAN file system you deliberately expose the same LUNs to all of the computers that need to share files. The computers then communicate with each other or with a dedicated metadata server over the data network to ensure that files are properly secured and locked when in use. Thus, the data travels directly from host to storage device over the SAN while the metadata flows over the LAN. Other names for SAN file systems are indirect, asymmetrical, or out-of-band file systems.
SAN file systems require SANs, and if your cluster is big enough, then the cost of a SAN might be prohibitive. I haven't priced out a 1,000-node SAN, but I'm guessing the bill will come to a few million dollars. One promising alternative is to use iSCSI. There are now iSCSI host cards (accelerators) on the market that handle both TCP/IP and iSCSI processing in hardware. These cards cost a few hundred dollars but are expected to get cheaper as iSCSI adoption picks up steam. It's also very possible that you could get away with doing iSCSI entirely in software via free initiators.
Another possible problem with SAN file systems is that any given one might not be flexible enough to perform well under the various data types and access patterns used by your cluster. Do the nodes on your cluster have to share a limited set of files and directories or is the file access truly random? Are your files large or small or a mix?
NAS clusters—A really interesting way to leverage the power of a SAN file system without breaking the bank is to build a NAS cluster on top of a SAN file system. In a NAS cluster each NAS head is connected to a common storage pool over a SAN. Each of the NAS heads exports the same file system. In an ideal world, your performance would scale linearly, meaning as you add nodes to you NAS cluster you would kick up your NFS performance linearly. However, linear scalability has been an elusive goal. Be sure to determine how scalable the NAS cluster is before you hit diminishing returns on additional nodes. Also, because NAS clusters sit on top of SAN file systems, you have to take into consideration how the SAN file system deals with different file sizes and access patterns.
NAS clusters are often described as parallel file systems and are packaged in four distinct ways:
- As turnkey systems with proprietary hardware;
- As storage "bricks" that combine the file server and storage in one of many modules that make up the storage pool;
- As NAS heads to work with any back-end storage; and
- As software that runs on commodity hardware.
Alternatives to NFS—Another approach is to take NFS out of the picture. There are a number of different file systems on the market that offer superior performance to conventional NFS. Many of these file systems originated in the D.O.E. National Labs or in academia. If you are going to use them commercially, be sure that you have a reliable source of support. Also note that these systems excel with different file sizes and access patterns and that they, too, have their scalability limits.
Hybrids—Several of the next-generation SAN file systems are able to move data between hosts that are not on the same SAN. That is, they can send both data and metadata over Ethernet. These file systems are, in effect, LAN-based alternatives to NFS. Some vendors allow you to run their file system directly on your cluster nodes. Thus, you can use NFS where it's appropriate and the new file system where you need better performance.
Next-generation interconnects—Another possibility is to use InfiniBand or one of several proprietary, low-latency interconnects. InfiniBand has dropped into the price range of Fibre Channel, which makes it a bargain for some applications but too costly for other applications. In the future, watch for host cards that use a protocol called RDMA (remote direct memory access) to minimize I/O latency.
As cluster computing continues to gain momentum there will be more and more file-sharing solutions coming to market. In the meantime, you need to do your homework before buying into any one approach. The whole idea of a cluster is that it's fast and scalable. Don't let the wrong storage solution take the wind out of your cluster's sails.
Jacob Farmer is chief technology officer at Cambridge Computer (www.cambridgecomputer.com) in Waltham, MA. He can be contacted at firstname.lastname@example.org.