By Daniel J. Budiansky
The ability to provide cost-effective disaster recovery (DR) is a challenge facing most organizations today. IT managers and administrators must deal with the problem of storing multiple copies of data to protect against all likely failure scenarios. These copies must be kept on storage systems independent of the primary copy, and geographically dispersed enough to ensure that disaster events do not affect all copies of the data.
While local snapshots are valuable as point-in-time recovery mechanisms, they offer limited space efficiency, and are exposed to corruption or failure affecting the shared data. Replication, both synchronous and asynchronous, between remote storage systems can meet the requirement for geographic separation, but the complexity and costs associated with such mechanisms often limit their use to only a subset of critical data (for which the investment in such business continuity solutions is not optional).
In comparison, when the WAN bandwidth required to replicate a data set in its native form is too high to justify, IT organizations often choose to leverage existing tape backup processes instead of replication, and ship tapes offsite in an attempt to provide disaster recovery. This is where data deduplication can be applied to help solve these DR issues.
The Storage Networking Industry Association (SNIA) defines data deduplication as "the replacement of multiple copies of data—at variable levels of granularity—with references to a shared copy in order to save storage space and/or bandwidth."
Organizations of all sizes have validated the space-saving effect of data deduplication as it can significantly reduce physical storage requirements, along with the corresponding reductions in power, space, cooling, and management costs. To understand the full impact of this approach to data reduction, one should also examine the ability for deduplication to significantly reduce the amount of bandwidth required to replicate data for disaster recovery purposes. Using deduplication-enabled replication, backup data can be replicated between sites more efficiently. With the data available on disk in alternate sites, organizations can improve their ability to develop, test and document viable, cost-efficient DR processes.
Understanding deduplication ratios
In general, the effect of any data reduction technique, including deduplication, is typically expressed as a ratio of the original size of the logical data divided by the resultant size stored or transmitted. For example, if a system is storing 10TB of logical backup data (representing eight weeks retention of a 1TB data set backed up using a weekly full/daily incremental policy), and as a result of deduplication it consumes 1TB of physical storage, the reduction ratio is 10x (or 10:1).
In order to understand what a deduplication ratio of 10x really means, it is important to consider the temporal context as well. In the typical case of backup, data is periodically written and then deleted after a retention period has been met. For a given dataset undergoing some average rate of change, the aggregate deduplication ratio will increase linearly in relation to the retention period, as shown in Figure 1.
The plot shows the aggregate data reduction achieved by a deduplication system as a function of time (retention period of data). It can be calculated for a given backup dataset as the difference between the blue (logical data stored) and red (physical disk used) lines at a given retention period.
When the retention period of the data is reached, the amount of logical data added by new backups, and deleted due to expiration of the oldest backups is approximately equivalent. Thereafter, the aggregate deduplication ratio for that system will stay approximately the same.
When analyzing scenarios where only the new deduplicated data is replicated, it is incorrect to assume that the aggregate space-saving effect equals the reduction in bandwidth. Instead, the significant metric is the deduplication ratio achieved on the new data written.
Using the same example, while the aggregate effect after eight weeks is 10x - which in this case would mean transmitting 100GB to replicate a 1TB backup - each weekly full backup may be reduced 40x or more. Thus, only 25GB of the original 1TB would be transmitted across the WAN in order to replicate each logical full backup. Furthermore, this benefit is achieved as soon as deduplication occurs, and does not need to aggregate over time.
With deduplication-enabled replication, many businesses find that their existing WAN links have sufficient bandwidth to meet their requirements. Some vendors offer tools to model these effects, which allows IT organizations to understand both the space saving and bandwidth reduction effects of deduplication.
Simplifying low-cost DR
For organizations that have built their disaster recovery plans around their backup data, the use of replication with deduplication requires very little change to the procedures they use currently to recover the data. The primary advantage is improved availability since data is now accessible from disk at a DR site that allows restores to begin as soon as the data has finished replicating. This can lead to significant operational efficiencies, particularly by reducing the handling and movement of tape. The frequency of recovery tests and audits can be increased without requiring tapes to be recalled from storage and loaded into tape libraries. Additionally, the replicated copy of data can be used for tape consolidation and operational restores; e.g., to refresh a development/reporting environment.
The combination of deduplication and replication for disaster recovery maximizes the operational and cost saving benefits. The space reduction effect enables the storage of more data for longer periods of time, using less physical disk. This reduces the overall cost and footprint of using disk for backup. The bandwidth reduction effect extends the savings by enabling efficient replication, eliminating dependency on the manual and labor-intensive processes otherwise required to move data offsite. By properly identifying the positive effects on storage and bandwidth, the full potential of deduplication to improve operations and reduce costs become clear.
Additional information can be found in the work of the SNIA Data Management Forum Data Deduplication and Space Reduction Special Interest Group (DDSR-SIG). The DDSR-SIG is dedicated to advancing space reduction in all networked storage technologies. Th group produces educational white papers, webcasts, and tutorials addressing the benefits and uses of data deduplication. Related resources can be found by visiting: www.snia.org/forums/dmf/programs/data_protect_init/ddsrsig.
DANIEL J. BUDIANSKY is a co-chair of the SNIA DMF Data Deduplication & Space Reduction SIG, and an enterprise applications technologist with Data Domain.