Mapping a Disaster Recovery Plan

Mapping a Disaster Recovery Plan

A bulletproof disaster recovery plan requires backup and restore, disk disaster recovery, off-site media management, virus protection, data replication, and more.

T. M. Ravi

Computer Associates International, Inc.

People don`t plan to fail, they just fail to plan. Disaster planning is key to avoiding the loss of data, loss of access to data, and contamination of data in the event of a natural or man-made disaster.

What constitutes a disaster? In this context, a disaster is any event that prevents corporate employees from accessing business-critical IT functions or data. Examples not only include such events as hurricanes, earthquakes, and floods, but also user error, employee sabotage, and computer virus attack.

A company`s information system is the engine that drives its business, and data is its most valuable asset. A disaster of any type can result in lost employee productivity and lost revenue. In fact, the statistics are grim. The chances of a disaster striking your company are 1 in 100, and half of all companies that experience outages of 10 days or more are out of business within five years of the outage, according to FIND/SVP. In the case of the World Trade Center bombing in 1993, only 150 of the original 350 companies were in business a year later.

Equally alarming are the estimated costs of replacing data once it`s been destroyed. The National Computer Security Association in Carlisle, PA, estimates that a mere 20 MB of engineering data costs $93,000 to replicate. Downtime is even more expensive. The Yankee Group consulting firm in Boston, MA, estimates that one hour of downtime costs businesses anywhere from $1,000 to $50,000.

The solution? A solid disaster recovery plan--one that includes backup and restore, disk disaster recovery, off-site media management, virus protection, and data replication.

Backup and Restore: The Basics

A common misconception is that LAN disaster prevention and recovery simply means performing proper backups and recovering from file-server crashes. In reality, however, LAN disaster prevention and recovery must deal with anything that can affect the continuing operation of a network and its components. Reliable backup and recovery software is essential to any complete disaster recovery program. Without this first step, a disaster recovery/prevention plan is worthless.

Because backup software plays an integral role in your disaster recovery plan, it should be strong in the areas of data integrity and media management and it should support various media types. Open file backup--that is, the ability to back up or restore a file with integrity while the file is being modified by an application--should also be considered, especially in organizations where around-the-clock access to data is critical.

In addition to allowing you to modify the type of daily backup (full, differential, or incremental), your software should allow you to configure a custom rotation cycle, to view a calendar to set days for backup, to handle exceptions, and to save your configuration as a job script for repeated use.

Your backup software should also provide tools for device management, maintenance, and media retirement. Media support includes support for multiple tape devices, autochangers, and tape groups or arrays. In addition, the software should maintain a history of activity for devices and media.

Application and database servers present special challenges to administrators because files on these servers are often locked and inaccessible to standard backup systems. Further complicating matters, these servers usually house mission-critical data and must be accessible to users on a 7 x 24 basis. Backup software thus needs to be able to back up and restore database and messaging servers, as well as Web servers and line-of-business application servers, while they are on-line.

Several other backup technologies should also be considered when evaluating storage management tools. Tape RAID utilities, for instance, allow multiple servers to be backed up simultaneously to one tape device. When used with two tape drives, RAID software provides tape mirroring and an added layer of fault tolerance. In addition, image backup products take a "snapshot" of server data, bypassing the file system and reading data directly for very high-speed backup and restore.

Backing up data is only half of the equation: Tape software also needs to have flexible restore options. A backup software package should allow you to retrieve your data by directory, tape session, query (search), media, and image.

Disk Disaster Recovery

Keep in mind that a disaster typically results in one of two scenarios: the loss of data or the loss of access to that data.

The backup process is a solution to lost data. If data is lost, you can perform a restore operation from tape to recover it. While traditional backup strategies will help you recover data from a head crash, they are not adequate in performing a total system recovery. Why? Because to even load the backup/recovery software, you need to have the same operating system, operating system configuration, tape drive, tape backup software, and the same number and type of disk drives.

Loss of access encompasses everything from a crashed server to not being able to get into your office or building. In the event of a disaster that prevents you from accessing data, you will be faced with the task of reconstructing your data center and attempting to restore your backup tape onto a new system. Loss of access to data needs a different type of solution--one that will bring your server back to an operational state.

Bootable-disk disaster recovery can be a valuable tool in your disaster recovery arsenal. It`s inexpensive and it`s one of the simplest disaster recovery methods to implement. With disk disaster recovery, you create a set of diskettes--typically just four or five--that allow you to boot your server and backup software into a recoverable state.

Boot disk recovery software can also be used to easily generate data set and configuration information for spare "cold-site" servers.

In the event of a disaster, a system administrator would follow these basic steps:

1) Replace the necessary hardware.

2) Boot up into the boot-disk recovery program.

3) Follow the wizard guides through the steps for rebuilding the hard drive, creating and formatting partitions, and restoring the network operating system and vital configuration information. When this is done, the backup application restores the last full backup.

Off-Site Media Management

Partial restores of data should be conducted at least once a month, as a matter of policy, and backup tapes should be taken off-site--ideally, outside potential "disaster boundaries." If possible, tapes should be delivered to off-site vaulting locations once a week. In-house storage procedures are highly effective in covering short-term disruptions, but they leave LANs vulnerable in the event of a natural disaster or theft.

Virus Protection

Until recently, virus infections only threatened data residing on storage media, such as hard drives and floppy disks. Today, viruses can be spread through e-mail, the Internet, and the sharing of document libraries. It is estimated that macro viruses reside in the computing systems of over 90% of all companies.

Effective enterprise-level anti-virus security is now possible, how- however, through the deployment of a wide range of software. Since viruses are detected by anti-virus software in two ways--through a full scan of hard drives or in real-time as each file is accessed--your software must offer both options. The software should also offer automated virus signature file updates and should support multi- ple platforms.

The most effective strategy for virus security across the network combines an enterprise management approach with anti-virus software that can be managed from a central location. More robust anti-virus solutions not only clean up after an infection but also prevent the initial infection from ever reaching users.

Effective deployment of enterprise-wide anti-virus software also requires ongoing vigilance and management`s attention. Although an anti-virus strategy must be supported by technical experts and powerful software, the ultimate responsibility for maintaining a virus-free enterprise rests on the shoulders of IT management.

Data Replication

Although disaster recovery products simplify and speed up the recovery of a failed server, the process typically takes several hours to complete during which time users cannot access their data. This potential downtime is unacceptable to many users, particularly when access to mission-critical systems is involved.

Data replication products are designed to fill this gap. A secondary server mirrors the data of the primary server and immediately "stands in" for the primary server if the primary should fail. When the primary server is fixed, the secondary server re-syncs files to their current state and the primary server resumes its original tasks.

With this type of utility, the changes made to the primary server are replicated to the standby server in real-time to provide server fault-tolerance. This replication should occur at the transaction level so that only actual changes--not whole files--are replicated.

Some final statistics: On average, a company`s systems will go down about nine times each year, lasting an average of four hours and costing approximately $330,000 per incident, according to FIND/SVP. Given these stats, a good backup and disaster recovery plan is more than a security blanket for network administrators; it provides the protection needed to recover critical computing resources, often ensuring a company`s very survival.

Click here to enlarge image

Click here to enlarge image

Disk drives, computer hardware, and software are the most common causes of system outages, according to a survey of Fortune 1000 companies. (Due to some companies reporting multiple causes, percentages add to over 100%.)

Click here to enlarge image

More than 40% of Fortune 1000 companies surveyed have experienced at least one system outage lasting between 4 and 8 hours.

T.M. Ravi is a vice president of marketing at Computer Associates International Inc. in Islandia, NY.

This article was originally published on October 01, 1997