Disaster recovery begins at the disk drive level

By Steve Schwartz

The task of enterprise-wide data protection is daunting. IT organizations have been forced to view their existing disaster-recovery and business continuity plans in the larger context of information and network security.

Data protection is at the core of a good information security and disaster-recovery plan. However, solutions must be comprehensively designed and implemented. To fully protect your corporate data, you have to look closely at current procedures, identify potential points of failure, and determine an acceptable cost/benefit ratio in addressing your challenges.

This article may serve as a refresher course for some readers, but at a time when it is more important than ever to keep your data safe from a wider range of potential threats, it may be beneficial to review these issues.

Murphy's Matrix

All organizations are subject to points-of-failure in their enterprise data systems. Think of it as "Murphy's Law meets the data center." Failures can range from simple human error and hardware issues to deliberate intents to destroy data. "Murphy's Matrix" outlines common protection strategies and potential pitfalls (see table).

Click here to enlarge image


Data-protection planning must be performed from an enterprise-wide perspective. Points of attack are present throughout your network; viewing potential liabilities from a department-only standpoint is insufficient. By taking an enterprise perspective, you can more easily recognize your organization's potential vulnerabilities and do a better job of protecting your data.

Hardware fail-over strategies

It's not a matter of if, but when, your hardware will fail. Even the best-designed hardware comes with an MTBF (mean time between failure) rating. The most common failure is a head or disk crash.

While there have been improvements in this area by drive manufacturers, with rapid positioning of heads to zero position on normal power down, not all disk drives have this technology. Heat is also a major cause of hard disk failure. Blocked fans and inadequate cooling can shorten the life span of disks.

Another common situation is the "power-on syndrome," which refers to restarting a computer before its disks have stopped spinning from a previous shutdown. This can cause components to burn out and fail because of a voltage spike caused by residual voltage in the circuits.

To minimize hard disk problems, first recognize the most common reasons disk drives crash and implement policies to prevent them. It can be something as simple as a 10- second rule: No computer or hard disk system may be turned on unless at least 10 seconds has elapsed from the time the unit was turned off. This gives the components and circuitry adequate time to fully discharge, lessening the potential power stress of turning the system back on.

Second, keep spare drives on-site. This may seem obvious, yet many shops using RAID or mirroring still do not have adequate replacement drives to cover common failures.

The problem with mirrors

Other than following basic preventive measures, what is the best way to protect your data against hard disk failures? One approach is to mirror the system. This is simple to do through either software or hardware.

If the primary hard disk fails, you can run off the mirror without missing a beat. However, there are several potential problems. Mirrors replicate bad data as well as good data. Since the most common cause of a boot-up failure is software failure or human error, the mirror only replicates the problem.

The second drawback to mirrors is that they are not cost-effective in terms of disk space, even with the rapidly dropping price of disks. Mirrors typically cost as much as the primary disk space. In some companies, as much as 50% of disk space is not used. As such, protecting data with mirroring can be expensive because you're purchasing much more disk space than is actually used.

RAID to the rescue

In terms of disk space purchased, other RAID levels are less expensive than mirroring, typically requiring only a 20% to 33% premium in disk costs, while providing the same level of data protection. If a drive fails in a RAID configuration, there is enough redundancy to deliver the information from the remaining disks.

Although RAID is cost-effective from a strictly disk storage point of view, it does entail additional costs such as enclosures and a controller. Most high-end users will pay a premium for the full range of hardware and software required to deliver a robust RAID solution. On the low end, ATA/IDE RAID is becoming very affordable, and many motherboard manufacturers even include RAID control as a built-in feature. Generally, RAID is a very cost-effective solution for hardware fail-over.

Problems with RAID

However, RAID is not a panacea. RAID crashes occur for various reasons, including the following:

  • Central cable failure, resulting in data getting scrambled across all RAID systems;
  • Problems in RAID card BIOS, causing scattered bits of corruption across multiple drives;
  • Replacement drives installed with the wrong SCSI parity setting (does not match that of other drives);
  • RAID card does not rebuild the data correctly;
  • Disk drive fails, and a user pulls out a good drive rather than the faulty one, causing the RAID system to become irreversibly corrupted; and
  • Replacement drive is a different size than the other drives in the RAID set, causing problems with rebuilds.

In short, RAID is not the magic bullet to solve hard disk crashes. On the high end, though, a well-designed RAID system should be an essential part of any company's data-protection strategy.

Snapshot mirrors

One recent backup solution blends mirroring and snapshots, which involves copying the core operating system to a removable disk drive, with automatic duplications performed daily.

Snapshot mirrors solve the problem of a mirror becoming rapidly corrupted due to a memory problem, human error, or virus attack. Since the snapshot mirror is taken when the system is operating, if a fail-over operation is needed, it performs like the primary system did when the snapshot was taken.

Most operating systems are self-maintaining, and it does not hurt to move them back 24 hours. You may lose a few log files, but generally this is not critical.

Clusters and backup servers

Other types of hardware failure may affect application availability, such as a bad memory chip, memory controller chip, or power supply. You can protect against these failures by having either a standby backup server or by clustering. Both use redundancy to provide continuous application services to users.

How do you decide if you need a method of application service protection? Consider these factors:

  • Cost of lost data-How important is the most recent data?
  • Critical data-protection window-This is the interval of time elapsing between your last backup and the time the system crashes. The longer this interval, the more costly the data loss.
  • Cost of lost application services-How much new and current business is lost due to unavailability of application services?

Look at the data-protection window in terms of both application services and the data itself. Your business is always inputting customer data into its information systems. Determine the amount of data your company can truly afford to lose.

Now that you have addressed issues such as points of failure, the next step is to focus on a backup/restore and bare-metal recovery strategy. We'll examine those issues in a future article, along with a look at off-site backup versus off-site replication and security issues.

Steve Schwartz is president and CEO and Unitrends (www.unitrends.com) in Myrtle Beach, SC.

This article was originally published on August 01, 2002