Replication: The next step in BC planning

With new replication technologies, business continuity doesn’t have to be a costly and complex endeavor.

By Brad O’Neill

he need for business continuity has driven demand for quick and robust disaster-recovery solutions, with data replication rising to the fore as a key enabling technology.

Unfortunately, many traditional replication approaches have forced customers to make difficult trade-offs in acquisition cost, management complexity, data consistency, distance, and performance. Enterprises can no longer afford to make the sort of compromises they have made historically when implementing replication.

The data replication market is currently at a major inflection point, with advanced technologies available that cut through the traditional cost and complexity to enable full replication without tradeoffs. This makes it a logical time to consider implementing remote replication, but users need to be aware of the continued challenges involved in replication and should be aware of the newer replication technologies available before they jump on the replication bandwagon.

Replication drivers

Over the past five years, replication technologies have enjoyed heightened adoption. Based on the Taneja Group’s research, there are several factors behind this trend. At the highest level, corporations are using replication technologies across a much broader swath of infrastructure than ever before. We see the four main drivers as

  • Heightened awareness of outages: The tragedies of September 11 and Hurricane Katrina have driven home the need for adequate site-to-site disaster recovery;.
  • Regulation compliance: State and federal governments have enacted legislation mandating businesses, particularly in the areas of financial services and healthcare, to maintain remote disaster-recovery sites;
  • New global workflows: Insourcing and outsourcing have given rise to the need to synchronize data across multiple data centers worldwide; and
  • Consolidation: The trend toward consolidation of IT assets increases the need to safeguard those systems in the event of an outage.

Although the topic of disaster recovery is on everyone’s mind and replication adoption has increased, it can still be difficult and costly to implement a remote replication solution.

Overcoming the challenges

Disaster recovery in today’s environments is a complicated equation that must weigh the risk of an outage and the importance of the application against the cost to administer and procure the technologies. Based on the results of our end-user surveys, three major challenges emerge when one is considering a replication solution:

  • Total cost of ownership (TCO);
  • Management complexity; and
  • Network utilization/bandwidth requirements.

Challenge: TCO

Traditionally, disaster-recovery solutions, including replication, have been costly. Replication and mirroring technology built into storage systems from the large vendors typically comes with a high price tag, not to mention the cost of the overall infrastructure needed to support site-to-site disaster recovery.

We recommend that all IT purchasers conduct a thorough return on investment (ROI) calculation before purchasing replication solutions. Specifically, users must understand the cost tradeoffs between using Fibre Channel or IP connectivity, including understanding the costs per network port and the costs of individual replication and mirroring solutions.

Questions to ask:

  • Do I have to invest in a complete duplicate hardware configuration, or can I implement a SAN that uses lower-cost storage at the remote location to help cut costs?
  • Can I replicate over IP, using my existing lines, and save money on Fibre Channel-to-IP converters?
  • Is asynchronous replication a viable solution for my environment?

Challenge: Management complexity

Deploying replication technology has often been fraught with pitfalls. In site-to-site recovery scenarios, two operations must occur flawlessly for a successful recovery to occur. Synchronizing two copies of data and ensuring the backup site always has a consistent recovery point is a non-trivial exercise.

There is also significant complexity in ensuring the environment is preserved perfectly between locations. Applications on hosts must have the exact same operating environment (including operating system and application images) as the primary. Moreover, on a recovery at the secondary site, the storage volumes must be quickly mounted and accessible to the proper hosts to hit low recovery time objectives (RTOs). If this is not done properly, a lengthy recovery and consistency checking process must occur before the system can come up. In short, replication and site-to-site disaster recovery involve a complex set of manual tasks.

For these reasons, we recommend that end users do a full evaluation of any replication product and establish metrics on how long it takes to accomplish specific administrative tasks, such as mounting a volume on the secondary site. Although ease of use can be a soft-value criterion, it is absolutely essential that products be benchmarked against management complexity since replication must work flawlessly when it is needed most.

Questions to ask:

  • Are snapshots and replication technologies integrated to eliminate the consistency problem in site-to-site recovery scenarios and reduce the complexity of recovery?
  • Is there a single, unified control point for replication management, scheduling, and recovery verification?
  • Are there scheduling wizards, or is manual scripting required to establish a replication schedule?

Challenge: Network utilization

Overall replication performance and recovery point objective (RPO) is gated by the bandwidth and latency of the link between the two sites. Provisioning the link and determining the bandwidth requirements of any replication deployment are critical planning items.

In general, there is a direct relationship between how often data changes on the primary system and the amount of bandwidth that is consumed. Some vendors replicate using thin-provisioned volumes, which means only actual changes to the data-and not allocated but unwritten capacity-are replicated. The algorithm that a replication vendor uses will also dramatically influence how much bandwidth is consumed and whether a lower-cost link will be sufficient.

We recommend that users look for a replication product with advanced bandwidth shaping algorithms. In some replication products, administrators can set quality of service (QoS) thresholds to either throttle or speed up the replication. We also recommend looking at solutions that replicate thin-provisioned volumes, as this can also have a significant impact on bandwidth requirements.

Questions to ask:

  • Is there a way to determine how much bandwidth a given replication will require beforehand?
  • Are there tools available to prioritize specific volumes or cut bandwidth requirements for replication during the busiest production hours?
  • Is it possible to replicate without full volume clones?

Disaster-recovery planning shouldn’t have to involve a harsh set of tradeoffs. If users follow some straightforward recommendations and examine replication solutions in-depth before implementing them, then the cost and complexity of the overall solution can be held in check, allowing much more widespread adoption of replication technologies across a broader spectrum of applications and companies.

Brad O’Neill is a senior analyst and consultant at the Taneja Group research and consulting firm (www.tanejagroup.com).

This article was originally published on January 01, 2007