Evaluating TTR in data deduplication environments

Posted on May 27, 2009

RssImageAltText

By Eric Burgener

-- The whole point of data protection operations is to be able to reliably and cost-effectively recover your data in a timely manner should that become necessary, regardless of whether it is local or remote.

If you're like most IT professionals, you've probably at least thought about how data deduplication technology could improve your data protection operations.  You've heard vendors trying to differentiate their products based on a number of metrics – source- vs. target-based, inline vs. post processing, file-level vs. sub-file-level fingerprinting, fixed- vs. variable-length windows, single vs. global deduplication repositories, forward vs. reverse referencing, etc. – making it a difficult decision.

When it comes to disaster recovery (DR) operations, a key metric to focus on in order to make that decision easier is time to recovery (TTR).

A lot has been written about data deduplication technology, but the topic of how different approaches impact DR time to recovery has not been addressed adequately.  As data protection operations become more and more disk-based, enterprises should heavily leverage technologies such as deduplication and replication as they move away from tape-based infrastructure. 

In this article, we'll take a look at a common set of processes for moving backup data from distributed sites to centralized locations and then to remote sites for DR purposes, evaluating the impact of the two main target-based deduplication architectures – inline and post processing - in terms of recovery point objectives (RPO) and recovery time objectives (RTO). 

In pursuing this comparison, it is important to keep in mind that it is the end result impact on recovery time and cost of the entire data protection process that is important; comparisons that focus only on one or more of the intermediate steps in the process provide insufficient information to understand your ability to achieve the bottom-line objective:  meeting your RPO/RTO requirements in a cost-effective manner.

Inline vs. post processing

The best way to illustrate the differences between inline and post-processing approaches is to describe how they work during the backup process.  In both cases, you have a backup source (often called the backup "client") and a backup target.  With inline processing, the deduplication appliance is defined as the backup target, and data is de-duplicated on the fly before it is written to the target; data is only stored after it has been deduplicated.  In the post-processing approach, the deduplication appliance is also defined as the backup target, but data is first written to disk in its native form.  Then a secondary process picks that data up, deduplicates it, and writes it back to the appliance in its deduplicated form.

With inline processing, the deduplication process potentially adds some amount of latency, and there has historically been a concern that inline appliances may impact backup performance.  In post –processing deduplication, more storage capacity is required up front (to write the backup data in its un-deduplicated form) and more time may be required (since backup ingest and deduplication are two separate, sequential processes) to process the backup into its deduplicated form. 

Vendors have been working on these issues, and we are now at a point where there are inline deduplication appliances on the market that can ingest backup data (and turn it into its deduplicated form) at single-stream speeds of 500MBps or more, which makes backup ingest performance less of a concern (given the limitations of network backup) in all but the highest performance environments.  Post-processing approaches have been modified so that post processing may be able to run concurrently with backup ingestion, significantly shortening the time required to both ingest and process the data into its deduplicated form.  Nightly backups are normally divided into a number of backup jobs that run sequentially, and in the latter case some post-processing vendors can now begin deduplicating a completed backup job while they are ingesting others. 

Moving backup data from the backup source to a DR site is typically a three-step process (back up, deduplicate, replicate), assuming the data stays on disk the whole time.  Note, however, that inline approaches perform backup and deduplication concurrently, effectively resulting in a two-step process.

Figure 1 shows the process we will use to compare the impact of inline and post-processing deduplication on time to recovery.  The operative measure of TTR is defined as the earliest point in time at which you could fully recover a file (or a full system) from the DR site, and this measure will have a certain RPO and RTO associated with it. 

Figure 2 shows key factors that must be evaluated to understand TTR.  Let's look more closely at a few of these factors:

Network bandwidth
When/where deduplication is performed can significantly impact backup times. 

In backup environments, deduplication will often reduce the size of a backup job by 10x to 20x or more.  If deduplication is performed close to the source, such as with a smaller, less expensive deduplication appliance acting as a local backup target at a remote office/branch office (ROBO), then the benefits of deduplication can be leveraged to reduce the amount of data that has to be transferred from the backup source across a LAN/WAN to the primary site, a factor which can reduce bandwidth and time requirements significantly.  The trade-off here would be any additional time required to perform deduplication during the backup process vs the time that would be saved by having to transfer significantly less data across the network.  Network bandwidth is a key variable in this comparison. 

Backup ingest performance
Backup ingest performance is an important factor, but it is not the only factor.  Granted, you'll want to minimize the time period during which the performance of applications being backed up may be degraded due to backup operations, but you'll need to understand both the backup ingest performance of the deduplication appliance and the maximum data rate at which your network can deliver data to the appliance.  Only then can you understand if it is relevant to compare backup ingest performance between inline and post-processing deduplication appliances.

A broad set of metrics should be evaluated for inline and post-processing deduplication approaches to accurately predict how quickly data can be recovered from a remote DR site.

Concurrent dedupe processing
For inline deduplication appliances, it has been implied that ingest rates are slowed by the fact that deduplication is performed as the backup is being ingested.  Given the laws of physics, this makes sense, but you would also expect that post-processing deduplication appliances that are concurrently ingesting one backup job while deduplicating another would suffer from a similar performance degradation. 

Although it's not the only point of comparison, to compare inline and post-processing "performance" (defined as backup ingest + deduplication rate) on a level playing field you'll need to understand what the deduplication performance of the post-processing appliance is, not just the backup ingest performance, and you'll need to understand how this varies when backup ingestion is being performed concurrently.  Inline vendors openly quote this performance figure, while post-processing vendors generally quote only the backup ingest performance, not the deduplication performance.  You don't need to know both of those data points to know when a backup will complete locally, but you do need to know them both to understand what your TTR from a remote DR site is since the backup cannot be transferred to that remote site until it has been both ingested and de-duplicated.

More network bandwidth issues
If you are keeping your data on disk at the DR site, deduplication performance may have another impact on TTR that may not be readily apparent.  Certain recovery operations, such as the need to fully recover multiple systems due to a comprehensive disaster, may require the transfer of so much data that you will not use a WAN to do this.  Most vendors provide an option to dump data to "shippable" devices (an appliance, tapes, etc.) to get the data to the recovery location faster (if it's different than the DR site location).  Often, the data at the primary site may have been destroyed, but the site has not and the primary site is the preferred recovery location.  If the data is being stored in deduplicated form on disk at the DR site, you'll want to know how fast that data can be dumped to these devices in its native form. 

What we've found in investigating read performance differences between inline and post-processing appliances is that inline appliances are generally much faster (on the order of 3x to 4x) than post- processing appliances when data is not sitting in a disk cache in its native form.  A difference of being able to read data at, say, 400MBps from an inline deduplication appliance vs. 100MBps from a post-processing appliance could potentially generate a difference of many hours when downloading data, depending on the amount of data.  That difference can have a potentially large impact on your TTR.

Cost and management issues
Different approaches impose different costs and management issues, so make sure you understand these.  A prime example is when comparing single-stream performance it's important to understand the size and cost of the different vendors' configurations required to achieve a given level of performance.  If a certain approach requires more disk spindles to achieve a certain level of performance, that will have implications not only on cost, but also potentially on management and other issues (floor space, energy consumption, etc.).  When addressing large single backup jobs, such as databases, single-stream (not aggregate) throughput is the operative metric to understand TTR.

Replication time
You also need to know exactly how the replication works.  Is replication performed when a file is backed up, when a backup job is completed, or only when all backup jobs are completed? 

Clearly, you want to replicate data in its deduplicated, not native, form to minimize the bandwidth required between the primary and DR site, so you'll have to wait until deduplication has been performed.  During the initial deduplication process, most vendors deduplicate and compress the data prior to storing it, but not all vendors retain the data in this form as it is replicated.  If vendors have to uncompress data before they replicate it, this will have an impact on the bandwidth required to complete that replication, a factor which potentially has an impact on TTR. 

It's also important to understand when the necessary metadata to reconstruct a de-duplicated file is replicated.  A file cannot be recovered from the DR site until this metadata (which basically provides the formula for finding the file and converting it back into its native form) is also at the DR site. 

Some approaches replicate the metadata along with the file, whereas others do not replicate the metadata for all the files until the end of the replication process for a particular backup job or, in some cases, even later.  For large data transfers, the difference between replication complete times, defined as when the file is recoverable from the remote site, may vary by hours.  This will have an obvious impact on TTR. 

There are benefits to having the replication functionality integrated with the deduplication appliance, as opposed to using a separate replication capability, such as array-based replication, to perform it.  Integrated options can make the file immediately available on a read-only basis at the DR site after its metadata has been transferred, whereas more static options such as array-based replication will require that the file be manually mounted on a server before it can be accessed.  Integrated approaches may make it easier to manage the integrity of deduplicated data, but even so you'll want to understand how vendors address the data integrity issues.  In de-duplicated environments, the loss of a single chunk of data may affect hundreds or thousands of different files, so you'll want to make sure you're comfortable with the reliability of vendors' approaches in this area.

Recovery time
When comparing recovery times, vendors clearly want to present very rapid recovery capabilities.  Inline approaches do not cache data in native form, whereas most post-processing vendors recommend maintaining at least the most recent backup in native form to facilitate faster restores. (Statistically, most restores come from the most recent backup if it is available). 

When comparing recovery times, make sure that you understand the speed with which data can be restored from the remote site, both when it is cached at the primary site and when it is not.  Maintaining a large cache at the primary site, which includes data that has already been transferred to the DR site, may result in better recovery times, but it may not be a realistic representation of the types of recoveries you have to perform in your environment. 

Disk is getting less expensive, but a larger disk cache at the primary site costs more than a smaller or non-existent one.  Will you be caching multiple days' worth of backups at your primary site?  What is the chance that, if you have to perform a recovery from your DR site, the data you need will be available from a disk cache at your primary site?  The data not only has to be there, but also has to be retrievable, and in the event of a disaster that may not always be the case.

When thinking about these issues, some other realizations become apparent.  If you have stringent RPO/RTO requirements, the latest versions of inline deduplication technology that can operate at very high speeds may very well produce TTRs that are hours shorter than those achievable with post processing deduplication technologies, even when the backup ingest (not the deduplication) performance of a post-processing approach outperforms the inline appliance by a factor of 2x to 3x.  This becomes more of an issue as you are dealing with large backup sets, and when comparing against post-processing technologies that can ingest and deduplicate at the same time, realize that your single largest backup job will likely determine the TTR difference between inline and post- processing approaches.  This is because most post-processing deduplication approaches cannot start deduplicating a backup job until that backup job is complete.  This means that large enterprises with stringent recovery requirements may actually be better served with inline deduplication technologies, provided the deduplication performance of those solutions can at least keep up with your network's ability to deliver data.

When crafting a DR strategy that leverages deduplication technology, integrated replication is a key piece of functionality to meet enterprise recovery requirements that not only simplifies data movement but can also contribute to more automated recoveries and offer TTRs that may be hours shorter.

And finally, while all the architectural differences between various vendors' offerings do not necessarily have a big impact on TTRs, there are a few key ones that do.  By understanding your needs, and then understanding the implications of different vendor implementations, you'll be able to focus on the metrics that should make your technology decision straightforward.

At the time this article was written, Eric Burgener was a senior analyst and consultant with the Taneja Group research and consulting firm.

Related articles:
Evaluating time to recovery in de-duplication environments
NetApp shells out $1.5 billion for Data Domain
Primary storage optimization moves forward


Comment and Contribute
(Maximum characters: 1200). You have
characters left.

InfoStor Article Categories:

SAN - Storage Area Network   Disk Arrays
NAS - Network Attached Storage   Storage Blogs
Storage Management   Archived Issues
Backup and Recovery   Data Storage Archives