Disaster-recovery management comes of age

By Christine Taylor

Corporate disaster recovery (DR) for critical applications is a big, serious business—or it should be. But the poor state of many critically important DR environments says otherwise. It’s not that the elements of a DR architecture are missing. Large enterprises commonly invest in replicating data between geographically remote primary and secondary hosts, ensuring that if the primary site goes down, then the secondary site will flawlessly fail-over. Unfortunately, that flawless fail-over degenerates over time to a flawed, or even non-existent, fail-over. Why? Because changes to the primary site are not made to the secondary site. Over time, minor changes to the primary environment—adding a volume here, updating an application there—diverge the two sites to the point that recovery objectives will fail.

DR testing helps to uncover the gaps. But such testing is costly and disruptive, and DR testing terror tempts IT to avoid the pain at all costs. Yet without consistent DR testing, recovery configuration failure rates between primary and secondary sites can easily reach 75% or more during the course of a year. If the company has delayed its testing longer than that, or even skipped it entirely, then the percentage and the threat will only worsen. And even when companies do test, the complex dependencies between primary and secondary sites make mitigation difficult and uncertain. But without comprehensive testing, vulnerable or even failed replication continues without anyone knowing it—until IT goes to restore data from the secondary system and cannot.

Yet the challenge of managing complex DR environments is real, and environmental complexity can cripple manual change management. Critical production data is produced by different applications, housed on different storage devices, protected by different technologies, and replicated using different software. This same complex environment must be reproduced at the secondary site for every set of critical applications and data that must be quickly available from the secondary site.

For example, a critical Oracle database is housed on a massive EMC Symmetrix array and replicated via SDRF, while an equally critical SQL Server application is stored on a Veritas Cluster and replicated accordingly. Three more applications are also hosted at the primary data center and use snapshot technologies to copy to the secondary site. Making the matter even more complex, the secondary site must not only store current data and applications, but must also use the same RAID types in the same configurations as the production environment. RAID is a data-protection necessity, but mixing RAID types between primary and secondary environments can lead to sub-optimal storage utilization and performance issues. For example, if a production database uses RAID 1 for logs and RAID 5 for table spaces, then the secondary site must use exactly the same mix of RAID types. In reality, however, it often does not, leading to delays and difficulties when attempting to recover replicated production data within an urgent timeframe.

DRM for testing

Anything that IT can do to avoid manual involvement in this process is an advantage to DR testing, documentation, compliance and, of course, disaster recovery. This is where disaster-recovery management (DRM) comes in: By using tools that automate testing and change management, DRM can reliably and cost-effectively mitigate mismatches between primary and secondary replicated environments.

DRM continually monitors characteristics such as failed dependencies, inconsistent data, incomplete data sets, and breaches of service level objectives. These abilities also increase DRM’s value to highly regulated industries, which can use it not only to protect DR settings, but also to test and prove compliance.

The DRM application works by scanning the primary and secondary-site configurations and dependencies. Working from a knowledge base of product-specific interactions and best practices, it runs a dependency analysis and mitigates gaps by repairing and reporting them. The resulting topology becomes the baseline for continual testing of the DR environment for deviations. DRM tests critical dependencies between hosts—including OS, hardware, and network resource parameters—taking a holistic view of the multi-vendor combination of products that realistically comprise a “replicated solution.” Ideally, DRM should support not a single replication product path, but a comprehensive set of operating systems, databases, cluster configurations, storage technologies, and communication protocols. The DRM package should also be capable of supporting virtual environments as well as physical environments.

DRM automatically collects information from the IT infrastructure and scans for issues that will impact recoverability. Without this level of infrastructure discovery, replication between sites will be inconsistent. This can lead to long fail-over times and even unrecoverable data, forcing IT to recover from backup and losing hours to days of changes to the production data. This of course is an unacceptable level of risk for mission-critical production environments.

Topology map of replication environment form Continuity Software.

The level of DRM support for multi-vendor environments also comes into play. Corporations rarely have just a single replication software solution, and it is common to have as many as seven replication products running in a large data center. DRM ideally protects multiple replication operations from various vendors by sensing gaps throughout the end-to-end protection process. This ability enables IT to keep the replication tree utterly consistent, thus protecting an exact replica of data in case of data loss.

By identifying and mitigating critical gaps between hosts and deep layers of host interaction, DRM solves the problem of complex change management in the DR infrastructure. Business can be confident that it will meet the recovery point objective (RPO) and recovery time objective (RTO) for which a given replication solution was originally designed and deployed.

Another important aspect of DRM is that it should not add to environmental complexity, but should simplify it by centralizing information and management across the entire replicated environment. This makes the complex DR infrastructure far more transparent and manageable, especially if the DRM application leverages existing configuration management databases such as BMC Remedy or HP OpenView.

Let’s look at a typical scenario for DRM. A large financial institution installs a DRM package that tests for gaps across multi-vendor DR software. The application runs for 48 hours on the infrastructure between a production center and a “warm” secondary site. Even though the company previously spent money and resources on DR configuration and testing, the DRM software uncovers nearly two dozen dependency gaps that would have crippled its RTO and RPO. This is a shock to the company, both because of the level of DR risk and also because of the level of non-compliance. By using the detailed DRM analysis, the company not only identifies numerous serious gaps, but is also able to quickly mitigate those gaps. The DRM application now runs automatically at scheduled intervals to identify and close any subsequent gaps created by changes to either environment.

The competitive landscape

DRM is related to technologies such as data-protection management (DPM) and storage resource management (SRM), as well as professional DR testing and change management methodologies. DRM differentiates itself by automating a high level of risk mitigation in replicated DR environments, a complex setting that is under-served by DPM, SRM, and manual change management and test operations. DRM and DPM may potentially develop in parallel as both are concerned with monitoring and managing the recovery management space, but at present their distinction is clear.

Existing vendors in this emerging segment primarily provide gap testing only on their own replication products. EMC offers DRM functionality for SRDF and Veritas Cluster Server for its replication operations. IBM weighs in with TotalStorage Productivity Center (TPC) that manages replication for the ESS 800 (Shark), DS8000 and DS6000 arrays, and SAN Volume Controller. These replication management functions are quite useful for these specific replication paths, but leave complex multi-vendor replication environments subject to DR failure. For these environments, Continuity Software, for example, offers multi-vendor capabilities across a wide variety of replication paths, components, and software.

Protecting the critical DR environment requires complex change management and comprehensive DR testing, but all too many corporations have failed to invest in these operations to protect their critical replicated data. The new DRM technology class is stepping up to automate these manual operations, providing the promise of predictable recovery performance across multi-vendor DR solutions.

DRM can dramatically reduce the costs and time of manual DR testing by locating recoverability gaps, analyzing root causes, and mitigating the problems. This results in consistent, comprehensive, and cost-effective change management and testing in the DR infrastructure. And DRM does not make an already-complex environment even more complicated, but rather centralizes information on multiple replication paths and renders them transparent. This up-and-coming technology will prove fundamental for protecting and optimizing operations between primary and hot/warm secondary sites.


Christine Taylor is an analyst at The Taneja Group (www.tanejagroup.com). She can be reached at christine@tanejagroup.com.

This article was originally published on September 01, 2008