By Phil Goodwin
Cataclysmic events are often required to permanently change human behavior and cultures. Even corporate cultures can be slow to adapt without a sufficient catalyst. The terrorist attacks in 2001 caused an immediate, serious assessment by nearly every IT organization of their disaster preparedness. Yet, two years after these events, most organizations have only made modest incremental improvements to their disaster preparedness. In fact, disaster-recovery (DR) projects are often the first line items to be crossed off during budgetary tradeoffs, leaving many IT organizations (ITOs) to employ the Clint Eastwood DR strategy: "Do you feel lucky?"
Those organizations that have performed disaster-preparedness assessments have discovered significant disaster vulnerabilities, such as single-site systems, third-party supplier inadequacies, and incomplete off-site data storage. Such self-assessments are often the most cost-effective method to begin improving DR preparedness.
In fact, inadequate preparedness is often more related to process than product, and a process improvement project may yield significant benefits without a single budgetary expenditure (other than staff time). Fortunately, DR preparedness can be improved incrementally and with relatively modest expenditures.
However, unfortunately, there is no single step or product that can render a company disaster-ready. Organizations need to develop a DR plan based on a continuum of products and processes—similar to an investment portfolio. And the continuum of solutions should address the range of potential losses, from small through large. In other words, the goal is a flexible, resilient system. After all, DR plans are really an insurance policy, whereby a small investment now can avoid a catastrophic loss later.
Establishing a portfolio
When creating a personal investment portfolio, a needs analysis is the first step. What expenses must be met? How many kids are going to college? What resources will be needed for retirement? What timeframe must be met? Once the needs are determined, an appropriate investment portfolio can be created. Generally, the plan uses a building-block approach, which starts out small and grows and diversifies over time. Similarly, a DR portfolio reflects the needs and resources of the organization using the same building-block approach.
The foundation for this transition to a highly resilient system is the move to networked storage (e.g., storage area networks and network-attached storage), which will account for 70% of all data-center storage by 2007-08. Networked storage facilitates broad connectivity, which can be used for real-time remote replication. Networked storage capabilities combined with bandwidth costs declining through 2005 will make remote operations widely affordable. In addition, storage virtualization maturation over the next 24 to 36 months will enable real-time policy-driven replication and transparent recovery at remote sites. This transparency will include automated failure detection and re-routing of data traffic to the most available device.
Networked storage also facilitates a range of recovery mechanisms. The base level of DR for any storage architecture is keeping duplicate tapes at an off-site location. The second level of recovery, which is usually but not necessarily associated with networked storage, is synchronous or asynchronous replication between sites.
The third level of recovery, most often associated with networked storage, is local replication (e.g., metadata snapshots and physical replication). Although the third level of recovery does not protect an organization from site disaster scenarios, it does provide the fastest possible recovery from the most common causes of data loss (e.g., user error, data corruption, or device failure).
Matching risks to infrastructure
As mentioned above, the basis for a recovery portfolio should be an assessment of an organization's specific risks. This should include both generalized risks (e.g., user error, system failure, and fire) and regionalized risks (e.g., earthquakes, tornadoes, and hurricanes). While a 30-mile (50km) separation may be sufficient enough to protect data centers from loss due to tornadoes, it may be insufficient in hurricane-prone locales.
After categorizing the risks, ITOs should evaluate risks relative to the application environment. Identifying critical applications obviously leads one to the critical data. Here, storage resource management tools can identify the data dependencies. We recommend that applications/data be divided into four categories as a best practice:
- Mission- and time-critical data;
- Organization-critical data (e.g., important data but not time-sensitive);
- Operational data (e.g., non-critical); and
- Remote/mobile data.
Thereafter, recovery policies and technologies can be matched to the relative importance of the data and associated cost metrics (see table). These policies establish the methodology and service-level requirements for each category of data.
Creating business continuance
As business requirements become more stringent and data-center operations become more sophisticated, DR will become an extension of data recovery rather than a separate practice as it is now. By 2005, four-hour recovery will be considered best practice for critical applications; by 2008, transparent recovery will be the standard among Global 2000 organizations. For mission-critical applications, synchronous remote replication will be required to achieve this level of transparency.
Transparent operation will be relatively straightforward in file-oriented systems where communications latency is unimportant. However, commercial relational database systems (RDBMS) will require certain modifications. RDBMSs currently suffer significant performance degradation due to the communications latency experienced when remote operations exceed 100 miles when replicated in synchronous mode. The latency issue arises when the primary site awaits a commit from the secondary site before continuing. Therefore, RDBMS vendors must develop capabilities that support remote operations while maintaining transactional integrity. Moreover, these databases must support instant fail-over between sites with active/passive clustered systems.
Organizationally critical data shares many of the same characteristics as mission-critical data with the exception that it does not require the same time-based recovery service level. The main difference is that organizationally critical data typically does not demand synchronous replication; lower-cost asynchronous replication is sufficient. Nevertheless, given the likely cost reduction of bandwidth and the probable improved capabilities of the RDBMS, the distinction will be immaterial by 2008.
While most organizations consider end-user systems and data to be non-critical in nature, the destruction of the 2001 terrorist attacks highlighted the fact that, in many cases, data on laptops can be organizationally critical. ITOs have been reluctant to include end-user systems in their data-protection strategies, largely because they have had little control over how those systems are backed up and because of the incremental personnel cost to include another 10,000 systems in larger organizations. However, by 2004-05, we expect most ITOs to assume control of protecting the data on end-user systems and for providing a recovery mechanism for those systems.
Because end-user systems are usually outside the data center, disaster-recovery plans typically do not include these systems. However, company-wide loss of individual data could nearly halt business operations for some organizations just as certainly as the loss of key applications. Therefore, the typical methods used by most organizations to protect desktop and laptop systems (e.g., removable floppy drives/CD-ROM or replication to a LAN server) are insufficient to protect the organization from the loss of an entire site. This data should be replicated or moved to a remote site (either via disk or tape) using a commercially available management product. Generally, time to recovery is not the critical issue for user data. Although the recovery of this data can be measured in days, its recovery must be ensured to avoid loss of significant intellectual property.
Similar to user data, security is an area that must not be overlooked in the context of disaster recovery. Disaster-recovery operations can open one critical back door that most organizations overlook. DR tapes usually include everything necessary to re-establish the system (e.g., operating system, application, and data). Not all backup-and-recovery (B/R) software contains sufficient security mechanisms to prevent these DR tapes from being loaded on unauthorized systems. There may be no security at all, or the security may be limited to the idea that the system must be licensed for the B/R software. However, B/R software vendors routinely make fully functional demonstration versions of their software available on the Web, making this "security" non-existent. It is imperative that ITOs verify that their B/R software not contain this breach and verify it for themselves.
IT organizations should begin with a thorough, honest self-assessment of disaster preparedness, focusing especially on processes. The second step is to characterize the data-loss risks faced by the organization, their likelihood, and the consequences. After these steps, creating a recovery portfolio becomes a relatively mechanical process and can be balanced against financial realities. As with investors, the goal is not immediate results but rather permanent, continual improvement.
Phil Goodwin is senior program director, specializing in storage infrastructure issues, at the META Group (www.metagroup.com) in Stamford, CT.