NT Drives Cluster Adoption
Clusters are beginning to take hold at Windows NT sites. Here`s how to take advantage of Microsoft Cluster Server (MSCS).
By Paul Massiglia
Clustering technology has been commercially available for more than a decade. With the introduction of Windows NT Enterprise Edition, however, Microsoft has paved the way for its evolution--from a predominantly high-end vendor-unique technology to one that serves virtually any enterprise. As a result, systems integrators and end-users may soon be able to create their own cluster configurations using "best-in-class" components from a variety of vendors that have been pre-validated by Microsoft.
Physically, a cluster is a collection of two or more servers connected to a common body of data storage and to a common set of clients. Logically, a cluster is a single management domain, in which any server is able to provide any service to any client. This requires both common data access and a single security model. With today`s technology, this generally means that the servers that make up a cluster need to have a common architecture and must run the same operating system version.
While implementations differ, clusters provide three fundamental benefits:
- Better application and data availability
- Scalability, enabling applications to grow beyond the capacity of a single server
- Simpler management of large or rapidly growing systems
Loss of application availability is often caused by hardware, software, or operational failures. A cluster can protect against many of these failures because it has redundant computing, I/O, and storage components as well as the functionality for a secondary component to "take over" for a failed one. Clusters are failure tolerant, or more accurately, they provide rapid failure recovery.
A simple cluster consists of two servers, two I/O buses, and two disk subsystems with mirrored data (see figure on next page). If a server hardware or software component fails, causing a crash, a second server takes over. If an I/O bus or adapter fails, data is read and written using the alternate I/O or adapter. If a server`s network connection fails, clients can obtain services from the other server. If a disk fails, its data can be read and written using the mirrored copy.
In all of these scenarios, there may be a perceptible gap in service while corrective action is taken. However, with clusters, the gap is typically measured in seconds rather than minutes or hours.
An important benefit of some cluster architectures is application growth, or "scaling," beyond the capacity of a single server. Many applications consist of multiple autonomous threads of activity that interact infrequently.
These threads run either pseudo-simultaneously in single-processor servers or in parallel in symmetric multiprocessor (SMP) systems. In some clusters, threads can execute in different servers. Thus, if an application outgrows its server, a second server can be installed, forming a cluster and increasing the capacity of the application.
Simultaneous data access by multiple servers requires coordination, either by applications or by a distributed file system. In clusters that lack one of these mechanisms, access to any disk or file is restricted to one server at a time. However, even these clusters provide some scaling for applications that can be partitioned.
Simplified System Management
The third major benefit of clusters is simpler system management. Because clusters support more applications, data, and users within a single system, system management (e.g., operating system and application maintenance, user management, configuration management, and backup) is easier. Complexity and expense increase with system size, but more importantly, with the number of systems that are managed.
Clusters also reduce system management costs since there are fewer separately managed servers. Typically, a clustered system has one user account file, one file access policy, one backup policy, etc. Different cluster architectures present this single system image to a greater or lesser degree, but managing a cluster is generally less costly than managing an equivalent number of unrelated servers.
Clusters for Windows NT
In September 1997, Microsoft announced the Enterprise Edition of its Windows NT Server operating system. Aimed at complex distributed applications, Enterprise Edition contains four major new components:
- 4GB tuning for memory (4GT)
- Microsoft Transaction Server (MTS)
- Microsoft Message Queuing Services (MSMQ)
- Microsoft Cluster Server (MSCS).
4GT increases application memory address space, particularly for database caching. MTS makes it easy to link application components ("objects") into transactions with all-or-none semantics. MSMQ provides a robust mechanism for transmitting messages among application components in environments in which all components are not necessarily available at all times. Finally, MSCS provides a robust server environment in which the above-mentioned services and applications can execute.
MSCS: Clusters for the Masses
MSCS is likely to significantly change client-server computing. MSCS could eventually make clustering accessible to an installed base of well over a million servers and for less than $20,000 per system. And the near universality of the Windows user interface significantly simplifies training for new users. By 2000, these factors may transform clustering from the exclusive province of large data centers to one that serves average server system configurations.
MSCS is a service layered on NT. Initially, MSCS supports clusters of two servers, although its architecture allows for larger ones. The servers in an MSCS cluster are connected by one or more common storage buses and by at least one common network connection, although two are recommended (see figure on next page). MSCS uses Windows NT network and I/O services; therefore, the disk subsystems and network interface cards it supports are a subset of those supported by Windows NT.
MSCS makes groups of resources available to clients. Software to manage the most common types of resources, such as disks, file shares, network names and addresses, and applications, is shipped with MSCS. An optional software development kit enables application developers to create additional resource types.
MSCS exports resource groups as virtual servers. To clients, virtual servers are identical to physical ones. A file share named SHARE1 might be associated with virtual server VIRTA (see figure on bottom of next page). Clients would refer to \\VIRTA\ SHARE1, with access provided by server RIGHT. If server RIGHT fails, clients would continue to address \\VIRTA\ SHARE1, but server LEFT would provide the access. A system administrator would typically organize related MSCS resources (e.g., an application, the disks containing its data, and the network addresses for accessing it) as a group. The primary significance of MSCS is its ability to transfer (or "fail over") all the resources in a group from one server to the other as a unit. For example, MSCS prevents an application resource from running on one server while its disk resources are being controlled by the other server. This exclusive control by one server at any instant is sometimes referred to as the shared-nothing model.
The primary benefit of MSCS today is application and data availability. When an application belongs to an MSCS resource group, a server failure results in a momentary outage while the cluster is being analyzed. MSCS then restarts the application on the other server. For most applications, this degree of failure tolerance is adequate. Since MSCS runs on relatively inexpensive servers, clustering makes failure tolerance affordable to a wider range of applications than ever before.
To a lesser degree, MSCS also allows for application scaling and minimizes management costs. As currently implemented in MSCS, the shared-nothing model does not easily scale applications beyond a single server. However, a suite of related programs that operate on different data may benefit from MSCS, as may applications based on an MSCS-aware database manager. MSCS-aware versions of Oracle Parallel Server and SQL Server are planned. Microsoft also intends on supporting more general application scaling in the future. MSCS helps control management costs because it manages cluster resources from either server. Additional management efficiency stems from MSCS servers` Windows NT Network domain membership, which offers such network-wide services as user and name service management.
I/O for MSCS Clusters
The primary benefit of first-generation MSCS clusters is enhanced application availability. If a server fails, MSCS software detects the failure and restarts applications on the remaining server. But servers aren`t the only parts of a computer system that can fail. Disks and I/O buses can also experience either component failure or operational error. To deliver optimum value, the I/O subsystem and cluster interconnect must be as robust as the cluster itself. RAID and mirrored disk subsystems are ideal companions to clustered servers. With its data protected by RAID or mirroring technology, a cluster can survive any single system component failure.
There are three basic ways that RAID can be implemented in MSCS environment, each of which has advantages in particular market segments:
- Host software
- Bus-based embedded RAID controllers
- External controllers
Host Software RAID
In the host software approach, a host driver program provides the RAID function. Application read and write requests to a virtual disk are intercepted by this driver and are routed to physical disk drivers as required by RAID or mirroring algorithms. The Windows NT Server FTDISK program performs this function for local disks, but not for the shared disks in an MSCS cluster.
Host software RAID is the least-expensive way to implement disk failure tolerance, but the CPU overhead of RAID XOR computations restricts its use in busy servers. Host software mirroring imposes a smaller load on the server, but requires more disks to provide a given amount of usable storage. In either case, the servers in a cluster environment must be coordinated so that both servers have an up-to-date picture of the mirrored or RAID array state at all times.
While neither host software based mirroring nor RAID is available for the MSCS environment, Microsoft or another software vendor may offer it in the future. In either case, the host software approach is likely to represent the lowest-cost and lowest-performing approach to failure- tolerant on-line data in the MSCS envi- ronment.
Several I/O component vendors have taken the approach of off-loading computationally intensive RAID functions onto modules that bridge servers` PCI buses and Ultra-SCSI disk connections (see above figure). In addition to microprocessors, these bus-based RAID controller modules often include special-purpose ASICs for XOR computation and substantial amounts of write-back cache. Some vendors have even taken the hybrid approach of using ASICs for XOR computations and host driver software for less time-critical functions. This reduces the cost of bus-based RAID and improves the performance of a given controller when it is used with a faster host.
Because they provide a short path between disks and host memory and offload RAID and basic I/O functions from the host, bus-based RAID controllers offer very high performance. Because they share their hosts` power, cooling, and packaging systems, these controllers are very cost-effective RAID solutions. On the downside, the controllers offer limited scaling and their failure tolerance is constrained by the host environment.
The cluster environment presents special challenges for bus-based RAID controllers. It is easy to connect two bus-based controllers to common Ultra-SCSI disk buses. With the MSCS shared-nothing model, each array is controlled by one server at a time. However, both controllers must have up-to-date know- ledge of the state of all disk arrays so that corrective action can be taken in the event of a failover. Either the RAID controllers` driver software must communicate with each other (perhaps using the cluster`s network interconnect) or the RAID controllers themselves must use their device buses to intercommunicate. Both solutions require development effort to implement. To date, bus-based controller vendors have adopted the device bus intercommunication approach, perhaps because it provides a host-independent solution that can be used in any cluster environment.
Because of their low cost and high performance, bus-based RAID controllers are expected to dominate the low-end MSCS cluster I/O subsystem market. While they are not the most failure-tolerant cluster I/O solution, bus-based RAID controllers do protect against common I/O subsystem failures. With the intercommunication feature described above, bus-based controllers are the most cost-effective failure-tolerant I/O subsystem alternatives for MSCS clusters.
External RAID for MSCS Clusters
The third RAID implementation architecture for MSCS clusters is the external RAID subsystem. Typically, the external RAID subsystem`s controller is co-located with the disks and uses a disk I/O bus such as Ultra SCSI to communicate with its hosts. When communicating with hosts, external RAID controllers typically emulate one or more SCSI disks, making them very attractive to users because special driver software is generally not required.
In a typical external RAID configuration, servers have Ultra-SCSI host-bus adapters (HBAs), just as if you were directly attaching disks. The two external RAID controllers` host ports are connected to these buses. The RAID controllers` disk buses are connected to a common set of disk drives, which are organized as RAID or mirrored arrays. Each array appears to the host servers as one SCSI disk. External RAID controllers use host SCSI addresses sparingly--in some implementations, hundreds of gigabytes can be exported to hosts as a single disk. External RAID subsystems, each with its own enclosure, can easily be added to a system as storage requirements increase. This makes external RAID the clear choice for systems with rapidly growing or unpredictable storage requirements. In a completely failure tolerant configuration (see figure below), any single component, including server, HBA, host I/O bus, external RAID controller, disk I/O bus, or disk, can fail without reducing the accessibility of the data or the application.
Packing requirements tend to make external RAID systems more expensive than embedded bus-based RAID solutions. On the other hand, because they are packaged separately from the host systems, external RAID subsystems are not susceptible to functional or environmental failures. Moreover, when two external RAID controllers share a common package, it may be possible for each to access the other`s write-back cache, shortening failover time considerably. Some vendors have even implemented dual-port cache memories and have used them to mirrored cache contents between controllers. In either case, with external RAID controllers, it is possible to use controller write-back cache to improve performance without fear of losing committed data if a failure occurs with unwritten data in cache. This is an important advantage of external RAID controllers over host software and bus-based solutions.
The expandability, failure tolerance, and high-end performance advantages of external RAID controllers do not come without a cost, however. Many applications do not justify external RAID, opting instead for less costly bus-based solutions. For the growing number of applications that require these benefits, high-end external RAID subsystems are the answer.
With the introduction of MSCS clusters, Microsoft has created a new point in the server market. Positioned above typical (nonfailure-tolerant) Intel-based servers and significantly below clustered Unix and proprietary systems, MSCS clusters make failure-tolerant computing available for more applications. It is easy to visualize MSCS clusters expanding in both directions: downward as increased integration of server and I/O components reduces cluster hardware cost, and upward as Microsoft increases MSCS functionality and higher-performance interconnects become available.
One exciting development for current MSCS clusters is the substitution of Fibre Channel for Ultra SCSI as a shared storage interconnect. While Fibre Channel offers higher data transfer rates than currently available versions of SCSI (100 MBps vs. 80 MBps), the primary advantages of Fibre Channel for clusters are connectivity (126 devices vs. 16) and bus length (10 km vs. 25 m). When Fibre Channel is used to connect MSCS shared storage, it is possible to separate the cluster`s two servers so that a degree of disaster tolerance is achieved. Moreover, the cluster`s shared storage (which can be RAID-protected using Fibre Channel-to-SCSI external RAID controllers) can be remote as well, located at a physically secure site, for example.
Microsoft plans to increase the number of servers supported in a single cluster and to improve MSCS`s cluster scaling and load-balancing capabilities. To some degree, the company`s ability to do so depends on the emergence of a generation of high-bandwidth, low-latency system-area networks.
A simple cluster consists of two servers, two I/O buses, and two disk sub- systems with mirrored data.
Servers in an MSCS cluster are connected by one or more common storage buses and by at least one common network connection.
MSCS exports resource groups as "virtual servers." To clients, virtual servers are identical to physical servers.
Bus-based RAID offloads RAID functions onto modules that bridge servers` PCI buses and Ultra-SCSI disk connections.
An external RAID configuration can provide full failure tolerance.
Paul Massiglia is technical strategist for cluster system products at Adaptec, Inc.`s Technology Center in Longmont, CO.