In my previous life I was in charge of disaster recovery (DR) planning for a division of Avery Dennison. We were ready for a disaster – we hoped. The plan looked good; the secondary systems were in place, we had corporate support. But in the end there was just no good way to adequately test the DR plan without downing critical systems. We prepared for the worst and hoped for the best, but we just could not be sure.
Fast-forward some years. As an analyst I sat down with two IT administrators from a large financial services company. They did quarterly disaster testing because they had to. It was expensive and resource-hungry and accounted for a large part of the IT budget. And in spite of the expensive tests – you guessed it — they just could not be sure.
Why is DR testing still the same time and money sink that it always was?
One reason is that although there have been advances in disaster recovery testing, they have not kept pace with the growth in DR technologies. Companies invest not only in backup and recovery protection but also in replication, mirroring, snapshots, failover technologies, and more. Supporting and testing this plethora of data protection technologies is a complex issue. Add the problem of duplicating primary system changes to the secondary system, and you have a DR environment that is prone to failure at the worst possible time.
Consider recent findings from Symantec, which surveyed IT professionals on estimated DR costs in the enterprise. Global respondents estimated a DR cost of more than $285,000 per downtime incident. U.S. respondents reported significantly higher costs. The more regulated the industry — especially finance and healthcare – the higher the estimated costs of recovering from downtime.
Why are the costs so high? First of all, Symantec asked the respondents to measure every process that DR will impact. The kitchen sink aside, costs include IT time as they battled the outage, the cost of users unable to access applications, cost from lost sales, cost of compliance failure, and costs of any software or hardware necessary to recover. These add up fast. Second is that respondents reported their lack of confidence to handle a major incident. Most had DR plans in place and tested the DR environment, but on average one in four DR tests failed. And these were tests made in controlled environments with orderly system shutdowns. In an actual disaster situation, recovery would be much more difficult.
One major reason for the test failures is configuration drift, where changes made to the primary environment are not made to the secondary. Upgrades, patches, RAID configurations, new applications, service level changes, disk type replacements – these are all minor changes that occur rapidly in storage environments. These same changes may or may not be made to the secondary server even in a cluster configuration, let alone a geographically remote hot site. Over time the primary and secondary environments eventually cease to mirror each other. The gap continually widens until the secondary server no longer fails over properly to the primary server.
Even when IT realizes what happened, it is a never-ending headache to mitigate the problem. The configuration drift happens as a result of dozens or hundreds of unduplicated changes, and fixing them is time-consuming, costly and complex.
So is DR testing. Many companies will test DR when they deploy new software and equipment, but that number quickly dwindles when it comes to ongoing testing. Even the companies that do periodic tests spend vast amounts of money and resources doing them. Full DR testing requires downing systems, which makes it impossible in the first place for 24×7 operations. And they can only approximate a disaster. The IT department does not go around yanking out the plugs of critical systems; they shut down the systems themselves on an orderly basis. Even when they clone systems and applications to a test environment, they are usually limited to testing single systems at a time.
The more complex the environment, the harder this process gets. In complex environments, critical production data is produced by different applications, housed on different storage devices, protected by different technologies, and replicated using different software. This same complexity exists at the secondary systems, both local and remote. IT is expected to have a high degree of confidence that it can recover these systems, but IT often lacks the resources to do so.
The Solution: DRM
For all of the above reasons, we recommend disaster recovery management (DRM). We define DRM as the ability to automate testing and change management in complex DR environments. DRM is not the same thing as data protection management (DPM) or storage resource management (SRM).
DRM automates testing, change management and risk mitigation in replicated environments. This includes activities such as monitoring systems for failed dependencies, incomplete data sets, workload balancing, and service level breaches. When it locates inconsistencies, the DRM product alerts administrators with clear and actionable reports, making it possible to quickly close gaps.
Let’s take a look at how it works in a multi-vendor configuration. The DRM product scans the DR environment and collects configuration data. It expands the topology map it is building by analyzing dependencies between servers, storage and databases, and by adding its own knowledge base of known dependency and configuration issues in multi-vendor environments. The completed map becomes a baseline for changes down the road between environments and for any resulting fixes between primary and secondary environments. By scheduling any further activity against the topology map, the DRM product can present inconsistencies and gaps between the systems for immediate remediation.
When we talked to IT about DRM, they were positive about its advantages but concerned that it added to environmental complexity. It does not. DRM should run transparently to test and mitigate between clustered and replicated systems. It should not make a performance impact, and might be able to report to existing framework management products such as HP OpenView. Far from adding complexity, DRM replaces manual effort by analyzing and mitigating configuration gaps. This lifts a huge load off IT’s back in the DR environment.
DRM Vendors
The majority of vendors providing replication also provide some means of testing it. That is good as far as it goes, but there is a great deal more to disaster recovery and change management than just checking that last night’s replication occurred. The vendors below all offer features that go beyond mere replication reporting. Reporting is a basic feature and all of the vendors have it; additionally, EMC and Symantec offer testing via production clones, Steeleye verifies replication paths before starting the replication process, and Continuity Software mitigates configuration gaps in multi-vendor environments.
Continuity Software. Continuity is still the only DRM vendor that offers DR testing and gap mitigation across multiple replication applications and storage system vendors. The company’s RecoverGuard runs in the background to scan storage, databases, servers and replication configurations for vulnerabilities. Once the baseline scan is complete, IT can schedule periodic rescans to provide change management in local and remote DR environments. RecoverGuard’s data collection engine is agentless and can be set to continuous or scheduled scanning. The engine works by scanning the IT infrastructure and collecting information from key assets using standard protocols and APIs. Using this data plus its continuously updated knowledge base, RecoverGuard calculates dependencies and relationships between applications, databases, file systems, servers, storage volumes and replicas. The engine tests the baseline against gap signatures in its knowledge base. Upon locating a gap, the system issues a ticket containing details for remediation and proactive improvement.
EMC. EMC tests and reports on its own replication procedures, but it also has robust DRM procedures in place for specific application environments. For example, EMC’s Disaster Recovery for Oracle E-Business Suite adds Replication Manager and TimeFinder to Oracle’s RapidClone to test and verify the application’s disaster recovery readiness. Full Oracle protection is an involved process that includes replicating the entire stack — application, middleware and database. The joint EMC/Oracle process enables discovery and configuration, clones full copies of the production environment, and manages the disk-based replicas. It works with local replication but can use EMC SDRF to provide synchronous remote replication services. In the SDRF configuration, Replication Manager runs at the remote site to create full copies or snaps of the Oracle stack. The testing process utilizes knowledge bases of referenced architecture and best practices.
Steeleye. Steeleye’s LifeKeeper sends redundant signals between clustered nodes to gauge application and system status before replicating between systems. For example, SAP NetWeaver may run on IBM System x in a clustered configuration. LifeKeeper monitors and automatically responds to SAP NetWeaver failures, including database records, application instances, data storage, or operating system I/O components. The new Steeleye Protection Suite (SPC) for SAP expands in-place testing by allowing administrators to switch SAP to and from the recovery site testing environment. Administrators can still test SAP in-place, as in the IBM System x example. This in-place process redirects a single test user or subset of users to verify data recovery at the local cluster site.
Symantec. Symantec weighs in with Disaster Recovery Fire Drill on the Veritas Cluster Server. Fire Drill allows VCS users to proactively discover and remediate issues within their DR plan using 1) host-based replication with Veritas Volume Replicator, or 2) array-based replication with EMC’s SRDF or Hitachi Data Systems’ TrueCopy. Both architectures use a point-in-time snapshot on the remote site. (Host-based replication uses a small point-in-time snapshot of changes only, while array-based replication uses a full data snapshot.) Once the point-in-time snapshots are stored at a recovery site, Fire Drill uses the data’s actual applications to automatically test the recovery site’s DR quality and usability. The feature works by initiating application startup and shutdown, and testing the snapshot’s disaster recovery success. Users can set up a script to automate point-in-time copy creation and to test specific applications at the recovery site.
Conclusion
Disaster recovery testing and change management are complex processes. The reality is that IT will never have the resources to manually test every business process, every service, and every change. This is why disaster recovery management as an important technology. DRM saves dramatic amounts of cost, risk and resources by keeping a replication tree utterly consistent. This critical capability turns DR and DR testing from a high risk maneuver into a useful, compliant, and highly manageable process.