Clustered File Systems Go Beyond Failover Software
Clustered file systems are on the horizon, and they may boost performance, cut costs, and provide better file sharing.
By Gordon Harris
Highly available clustered solutions have been in the spotlight lately. The attention has been centered around failover software, such as Microsoft`s Wolfpack, Sun`s Full Moon, and NCR`s Lifekeeper. Failover software allows users to configure two or more computers (nodes) in a "cluster" so that if one node fails another steps in, recovers the data, and continues the workload.
All nodes within a cluster have access to the same data through a shared storage subsystem, such as a SCSI or Fibre Channel disk array. Depending on the hardware configuration, application, and failover software, service can be interrupted for 15 seconds to 2 minutes.
Clustered systems can be configured in an active/passive or active/active role. In an active/passive configuration, the passive system acts as a backup and monitors the active system for failure. If the active system fails, the passive system becomes active and takes on the workload.
In an active/active configuration, both systems process workloads while monitoring each other. If a node fails, another node takes on the added responsibility of the failed node. Some failover software offers a "failback" option, which allows administrators to redistribute the workload once the failed system has been repaired. These failover software configurations are all designed to reduce server downtime, a key concern for IS departments.
A new class of "cluster-aware" applications is being developed to take advantage of the ability to share data among nodes in a cluster. These applications will provide higher-performing, scalable systems with high availability. Today, only a handful of cluster-aware applications exist. For example, applications like Oracle Parallel Server (OPS) allow up to four nodes to share data in parallel while offering failover.
OPS is a highly specialized application that treats shared storage as "raw" data and manages all aspects of meta data, distributed locking, logging, and failover. Most applications today do not treat storage as "raw" data, depending instead on a file system for storage management.
There are a number of advantages to using a file system to manage the data stored on a raw disk. For example, file systems help reduce application complexity, increase data reliability, and make file management easier. Raw disk applications are typically used in high-demand environments where cost is a secondary factor to performance, and a single machine and storage subsystem can be dedicated to the problem set. This is not a typical environment. None of the failover packages offered to date provide data sharing through a file system.
Clustered File Systems
Sharing data through a file system in a cluster configuration requires a new breed of file system--a "clustered file system." This type of system allows multiple nodes within a cluster to read and write the same files concurrently. On the surface, the file-sharing capabilities of a clustered file system do not seem much different from sharing data through NFS or SMB/CIFS, where multiple clients share file data over a network. But there are critical differences. For example, a clustered file system has no single point of failure, provides data sharing at much faster disk speeds (compared to network speeds), and is inherently coherent. In fact, later in the article we discuss how a clustered file system can be used to provide a fault-resilient NFS or SMB/CIFS server.
A file system is a database, and like parallel databases, a clustered file system provides distributed lock management, fast recovery, and failover. The problem set for building any data-sharing application is quite complex. A clustered file system must arbitrate concurrent access to files such that the semantics of the operating system are the same as if a single system were attached to the storage.
At the same time, enough file system state must be kept so that the system can recover from a node failure and a cluster-aware application can continue without restarting. By solving these problems, a programmer could develop a cluster-aware application without worrying about the details of recovering the physical storage. The cluster-aware application would still have to contend with a fair amount of distributed complexity, but one major hurdle would have been cleared. On node failure, a good clustered file system will halt the cluster, recover inconsistent transactions, and resume processing. The entire recovery process should take from 5 to 30 seconds.
A big selling point of a clustered file system is that users do not have to wait for cluster-aware applications to be developed before they can take advantage of clustered hardware. A clustered file system can address many scalability and fault-resilient problems without failover software or cluster-aware applications. This is an important point because it may take years for software vendors to develop cluster-aware applications. Administrators can start making plans today to address the scalability and fault-resiliency needs of their users, so that when clustered file systems appear, they will be well prepared.
The following section discusses how clustered file systems can solve two common problems, without using failover software or a cluster-aware application, and how they can be used with cluster-aware NFS or SMB/CIFS servers to provide fault-resilient network file services.
Data Sharing for Web Server Scalability
Today, the only way to scale a Web server is to buy a bigger machine, to add another CPU to a multiprocessor system (if you are not already at the physical limits of the hardware), or to buy expensive, complicated, and proprietary distributed Web software. Every administrator wishes that he or she could just add another machine as required to help with the load of Internet commerce, while keeping the Web site in one common place for easy access. In addition, an administrator has to wrestle with the conflicting issues of keeping a Web site isolated from the corporate network for security considerations, while still allowing convenient updates to the site from the corporate network.
A clustered file system can solve this problem. Figure 1 shows a possible configuration using existing off-the-shelf hardware. A round-robin address substitution switch redirects incoming requests to an available server. Since the HTML files for the Web server are visible by all nodes in the cluster, any node can handle the request.
Note also that the entire Web site is physically disconnected from the rest of the corporate network for security. When the Web site needs to be updated, a single node leaves the cluster, disconnects from the Web, and connects to the corporate network where the local disk can be safely updated with the new Web site. Once the new site is loaded, the node disconnects from the corporate network, rejoins the cluster, and updates the Web site on the shared disk. The connection and reconnection of the networks can be done with network switches that physically isolate and secure the network. If a node in the cluster fails, the address switch redirects the failed request to the remaining nodes. As the load increases, administrators can add additional nodes to the cluster.
Using a CFS for Pipeline Processing
In corporate data centers, it is not uncommon to run several applications in a pipeline configuration, where each application does some pre-processing for the next application. This can be a delicate balance, as changes in the pipeline could obsolete the hardware. Rather than having to buy all new, more powerful hardware, administrators would probably prefer the flexibility of being able to quickly add a new piece of hardware to assist with the problem.
A clustered file system can significantly speed up this type of processing while using much less expensive hardware. For example, suppose you have eight applications in the pipeline, which may require an expensive eight-way symmetric multiprocessor system to run efficiently.
With a clustered file system, eight less-expensive uniprocessor systems could be configured as shown in Figure 2, where each system processes input for the next. This configuration not only outperforms the eight-way multiprocessor system (because it doesn`t have the multiprocessing overhead of locking, bus contention, etc.), but it makes the pipeline scalable.
CFS for NFS or SMB Servers
The success of the client/server computing paradigm has created a critical dependence on network file servers. Because of the importance of these servers, they need to be made as fault-resilient as possible. A clustered file system makes it possible to build a fault-resilient NFS or SMB server--if one node fails, the network request is switched over to an existing active node.
The hardware setup would be similar to the scalable Web server example provided above. Service may be briefly interrupted while the cluster state is restored. Note that a fault-resilient NFS or SMB/CIFS server would require a cluster-aware version of the server software.
Because NFS is a stateless protocol, it can be made cluster-aware much more easily than SMB/CIFS, which is a stateful protocol. A cluster-aware NFS server must keep a copy of its reply cache on the shared disk so that the new NFS server can handle non-idempotent operations (that is, operations that when repeated produce different results) correctly when a node fails.
A full discussion of the SMB/CIFS state that would need to be recovered on node failure is beyond the scope of this article.
Build Your Own Distributed Apps
A clustered file system allows you to build your own cluster-aware applications that share data through the file system. These applications stand to benefit from being fault-resilient and from performance improvements due to parallel processing. To develop practical cluster-aware applications, several key application tools are required. For example, a distributed debugging environment that includes distributed break points, message tracing, etc., is necessary. In a multi-threaded application, a good debugger can control the threaded environment so that the state of all threads are frozen on an event. A good distributed debugger gives the same control but across the nodes in a cluster.
Another key component is a library to support distributed, fault-resilient transactions. Such a library would simplify the recovery of failed nodes, which is the most difficult type of problem for cluster-aware applications to solve.
Application programmers must also be trained for this new programming paradigm. Distributed computing requires new algorithms and data structures that allow parallel processing while supporting fault-resiliency. As with any new programming paradigm, a significant up-front investment is required, but the potential gains should justify the investment.
Several major systems, software, and storage vendors (such as Sun and Veritas Software) are promising clustered file systems by year-end. Programmed Logic`s clustered file system, Clustered HTFS, should be available by the end of the first quarter.
Clustered file systems provide an attractive alternative to conventional solutions to scalability and fault-resilience problems. They can be immediately used to solve problems--no failover software or cluster-aware applications are required. In fact, they enhance today`s failover packages. In addition, clustered file systems provide shared access to files at higher speeds than network file systems; they don`t encounter the same hardware bottlenecks found in symmetric multiprocessing systems; and they are more cost-effective than solutions requiring special-purpose hardware.
Fig. 1. In a clustered file system architecture, a round-robin address substitution switch can redirect incoming requests to available servers. Since the HTML files for the Web server are visible by all nodes in the cluster, any node can handle the request.
Fig. 2. In a clustered file system configuration, uniprocessors can be configured so that each system processes input for the next processor, as opposed to a more expensive multiprocessor configuration.
Gordon Harris is vice president of research and development at Programmed Logic Corp. (PLC) in Somerset, NJ.