Backup and Recovery for NT Clusters
Server clustering presents data protection challenges. Solutions are on the horizon.
With the increasing deployment of Windows NT Server Enterprise Edition, clustering is becoming a popular way of achieving high availability without the expense of deploying fault-tolerant redundant systems. Unlike redundant systems, which simply mirror the functions of the primary server and perform no additional tasks, clusters play a dual role. They run their own applications, such as e-mail or web access services, and they are capable of temporarily taking over the functions of any failed server in the cluster.
Clusters are classified as "highly available," with an average up-time of about 99%. Redundant systems are classified as "fault tolerant," with 100% up-time under all but the most cataclysmic circumstances. With greater backup requirements, shrinking backup windows, and the high costs of deploying fully redundant systems, many network managers are opting for Microsoft`s clustering system and its promise of near-continuous availability.
Clustered servers share a common client group and common data storage media, including local disk drives on the servers themselves and a shared disk subsystem--typically a RAID array for additional fault resilience. Clustering software gives two or more servers the capability to manage applications and files for the clustered group. Each node in the cluster hosts a particular set of applications or services and broadcasts its operational state by sending out packets, called "heartbeats," to all other nodes. If the other nodes in the cluster fail to detect the heartbeat for a specified interval, responsibility for running an application or an operation passes to another server, which runs its own operations in addition to those of the failed server.
Microsoft`s Clustering Service provides the framework for determining how a cluster will handle the failure of any component. For example, assume that Server A is running file and print functions and Server B is handling database services. If Server A fails, Clustering Service directs Server B to assume control over file and print functions for the cluster, while maintaining its primary function as a database server.
To clients, the server cluster appears as a single group of available services. Users interact with the cluster as a unified entity. If a clustered server fails, users experience a brief interruption--usually 30 seconds or less--in network response while another server in the cluster restarts the application. Clustering automatically maintains the IT infrastructure for 24x7 environments, restarting applications and services without human intervention.
As Clustering Service technology matures, additional benefits accrue to network administrators. In addition to high availability at reduced cost, clustering provides scalability, enabling applications to grow beyond a single server. Because applications often comprise many processes that can be set up to execute on multiple servers, clustering will enable network administrators to simply add a new server when an application outgrows its original host. Clustering Service will also handle process coordination.
Clustering Service will also simplify network management because, logically, a cluster will comprise a single management domain whose components share a common architecture and operating environment. Clustering will let network managers treat the server group as a single system for operations such as configuration, operating system and application maintenance, and user account management. This will reduce administrative overhead and lower overall system management costs.
Data Protection Issues
While Clustering Service greatly reduces the cost of building highly available systems, clustering magnifies the complexity of backup and recovery. In a non-clustered environment, network architects can either deploy a dedicated tape backup system for each server or they can back up over the network to a backup server, depending on the backup window and the amount of data.
In a clustered environment, however, data protection issues become considerably more complicated.
The two standard backup methods--backing up over the network or to a directly attached device--are either impractical or inadequate to safeguard all data in a cluster. While backing up smaller amounts of data can be conducted over the network, standard networks are inadequate for large backups. To achieve acceptable performance on large data sets, network administrators need to back up the cluster`s physical nodes to a directly attached tape system. But this approach presents its own set of problems.
Clustered applications are packaged as "groups," which include all the resources required to run the applications. Included in a group is the virtual server name for the application, which lets a given application run on different systems at any given moment. To get a complete system image containing all the information needed to rebuild the cluster, each physical node must be backed up, not just the virtual nodes that represent the application groups.
Because clusters are inherently dynamic, responsibility for managing data and applications moves from server to server as needed to keep the system running. Such shifting of responsibility is called "failover." And failovers cause uncertainty in backup. A physical node`s configuration can change depending on where the virtual node resides. Therefore, an application could execute on one system at the time of backup and on another at the time of restoration. Unless the configuration of the physical nodes during backup is identical to the configuration during restoration, inconsistencies may arise. In the event of a failover that results in changes to the physical configuration of the nodes, there`s no guarantee that all critical data is backed up and recoverable.
A further complication arises from the fact that a successful recovery operation requires a consistent set of backup images for each physical node. Shared and local volumes must be restored in a prescribed sequence. Volumes must be set up first, then restored in the proper order.
Data Protection Strategies
Because clustering is a relatively new technology, solutions for protecting clustered data are still in the evolutionary stage. The remainder of this article deals with available options and offers a projected road map for technologies that can ease the burden of backup in clustered environments in the short term, as well as technologies expected to be available over the next year.
For networks with smaller storage requirements, over-the-network backup is usually the best option. Enterprise environments with large storage requirements, however, should consider directly attached tape backup devices for each node, since over-the-network backup in these sites consumes an unacceptable percentage of bandwidth.
In this scenario, each server in the cluster has its own tape backup system. Administrators can manually back up each node in the cluster, but the backup applications don`t know that the node is part of a cluster. Directly attached tape backup devices only take a snapshot of the cluster at the time of backup; they don`t automatically adapt to changes in cluster configuration caused by events such as failovers.
Clustering Service employs a "Quorum Disk," a virtual volume residing on the shared storage device. This volume contains the mechanism servers use to communicate cluster information with each other and must be backed up along with the contents of local and shared drives. The Quorum Disk is actively owned by one of the clustered nodes at any given time. Backing up physical nodes in the cluster doesn`t necessarily guarantee the Quorum Disk is also backed up. Backup of the Quorum Disk must be consistent with the backup images from the other physical nodes. Clustering Service makes this difficult because backup of the Quorum Disk takes place at a different time than backup of the nodes and shared storage system.
Solutions for the Short Term
Within the next six months, tape backup software will become "cluster-aware." In other words, the software understands the dynamics of the clustered environment and makes API calls into the cluster to determine where data resides and how to back it up.
Inconsistent images of a cluster`s Quorum Disk and physical nodes can lead to trouble at restoration time. If restoration is needed, cluster-aware backup applications understand how to interrogate the cluster to determine storage configuration. They also know the proper order in which to restore the various elements to maintain consistency. Cluster-aware utilities will be able to interrogate the cluster to get a consistently clean backup and use that information to properly restore the system with little or no manual intervention.
Choosing among the pool of tape devices attached to the physical nodes in a cluster, the application specifies the proper tape device dynamically at time of backup. The backup application, because it has kept track of clustered activity, knows exactly where the relevant data is located. This scenario provides optimized backup, doesn`t require backup over the network and frees IS staff from manual intervention in cluster backup operations.
The most cost-effective and efficient cluster backup solution will be achieved by building on prior developments--cluster-aware backup software and intelligent directly attached backup devices--to allow the cluster to share a single backup device, such as a tape library. While such a connection could be achieved over existing technology such as SCSI, future clustered servers are likely to be attached to each other and to shared storage and backup devices via a separate network--called a storage area network (SAN)--using Fibre Channel technology.
The primary advantages of a SAN include shared access to storage, a dedicated high-bandwidth network separate from the production network, and the basis for building additional intelligence into storage management. With the advent of efficient, affordable SAN technologies and cluster-optimized backup control software, network planners will finally be able to take advantage of the best of both worlds: clustered servers backing up to a single SAN-attached tape library system.
In most backup configurations, each server requires its own backup device.
In an NT cluster configuration, if Node A fails, Microsoft`s Clustering Service directs Node B to assume control over Node A`s applications. Meanwhile, Node B continues to run its own applications.
SAN-based backup offers a number of advantages: lower costs, shared access to storage, and a dedicated high-bandwidth network separate from the production network.
Jeff DiCorpo is R&D project manager at Hewlett-Packard`s storage systems division facilities in Santa Clara, CA.