By Eric Burgener, Taneja Group
-- The whole point of data protection operations is to be able to reliably and cost-effectively recover your data in a timely manner should that become necessary, regardless of whether that need is local or remote. If you're like most IT professionals, you've probably at least thought about how data de-duplication technology could improve your data protection operations. You've heard vendors trying to differentiate their products based on a number of metrics – source vs. target-based, inline vs. post processing, file-level vs. sub-file-level fingerprinting, fixed vs. variable-length windows, single vs. global de-duplication repositories, forward vs. reverse referencing, etc. – that may make it difficult to make a decision.
When it comes to disaster recovery (DR) operations, a key metric to focus on in order to make that decision much easier is time to disaster recovery (TTR).
A lot has been written about data de-duplication technology, but the topic of how different approaches impact DR time to recovery in mid-size to large enterprise environments is one that has not been addressed adequately. As data protection operations become more and more disk-based, enterprises should heavily leverage technologies such as de-duplication and replication as they move away from tape-based infrastructure.
In this article, we'll take a look at a common set of processes for moving backup data from distributed sites to centralized locations and then to remote sites for DR purposes, evaluating the impact of the two main target-based de-duplication architectures – inline and post processing - in terms of recovery point objective (RPO) and recovery time objectives (RTOs).
In pursuing this comparison, it is important to keep in mind that it is the end result impact on recovery time and cost of the entire data protection process that is important; comparisons that focus only on one or more of the intermediate steps in the process provide insufficient information to understand your ability to achieve your bottom-line objective: meeting your RPO/RTO requirements in a cost-effective manner.
Inline vs. post processing
The best way to illustrate the differences between the two approaches is to describe how they work during the backup process. You have a backup source (often called the backup "client") and a backup target. With inline processing, the de-duplication appliance is defined as the backup target, and data is de-duplicated on the fly before it is written to the target; data is only stored after it has been de-duplicated. In post processing, the de-duplication appliance is also defined as the backup target, but data is first written to disk in its native form. Then a secondary process picks that data up, de-duplicates it, and writes it back to the appliance in its de-duplicated form.
With inline processing, the de-duplication process potentially adds some amount of latency, and there has historically been a concern that inline appliances may impact backup performance. In post processing, more storage capacity is required up front (to write the backup in its un-deduplicated form) and more time may be required (since backup ingest and de-duplication are two separate, sequential processes) to process the backup into its de-duplicated form.
Vendors have been hard at work on these issues, and we are now at a point where there are inline de-duplication appliances on the market that can ingest backup data (and turn it into its de-duplicated form) at single-stream speeds of 500MBps or more, which makes backup ingest performance less of a concern (given the limitations of network backup) in all but the highest performance environments. Post-processing approaches have been modified so that post processing may be able to run concurrently with backup ingestion, significantly shortening the time required to both ingest and process the data into its de-duplicated form. Nightly backups are normally divided into a number of backup jobs that run sequentially, and in the latter case some post-processing vendors can now begin de-duplicating a completed backup job while they are ingesting others.
Figure 1. Moving backup data from the backup source through to a DR site is generally a three-step process (backup, de-duplicate, replicate), assuming the data stays on disk the whole time. Note, however, that inline approaches perform backup and de-duplication concurrently, effectively resulting in a two-step process.
Figure 1 shows the process we will use to compare inline and post-processing de-duplication impact on TTR. The operative measure of TTR is defined as the earliest point in time at which you could fully recover a file (or a full system) from the DR site, and this measure will have a certain RPO and RTO associated with it.
Figure 2 shows key factors that must be evaluated to understand TTR. Let's look more closely at a few of these:
• When/where de-duplication is performed can significantly impact backup times. In backup environments, de-duplication will often reduce the size of a backup job by 10x to 20x or more. If de-duplication is performed close to the source, such as with a smaller, less expensive de-duplication appliance acting as a local backup target at a remote office/branch office (ROBO), then the benefits of de-duplication can be leveraged to reduce the amount of data that has to be transferred from the backup source across a LAN/WAN to the primary site, a factor which can reduce bandwidth and time requirements significantly. The trade-off here would be any additional time required to perform de-duplication during the backup process vs the time that would be saved by having to transfer significantly less data across the network. Network bandwidth is a key variable in this comparison.
• Backup ingest performance is an important factor, but it is not the only factor. Granted, you'll want to minimize the time period during which the performance of applications being backed up may be degraded due to backup operations, but you'll need to understand both the backup ingest performance of the de-duplication appliance and the maximum data rate at which your network can deliver data to it. Only then can you understand if it is relevant to compare backup ingest performance between inline and post-processing de-duplication appliances.
Figure 2. A broad set of key metrics should be evaluated for inline and post-processing approaches to accurately predict how quickly data can be recovered from a remote DR site.
• For inline de-duplication appliances, it has been implied that ingest rates are slowed by the fact that de-duplication is performed as the backup is being ingested. Given the laws of physics, this makes sense, but you would also expect that post-processing de-duplication appliances that are concurrently ingesting one backup job while de-duplicating another would suffer from a similar performance degradation. Although it's not the only point of comparison, to compare inline and post processing "performance" (defined as backup ingest + de-duplication) on a level playing field you'll need to understand what the de-duplication performance of the post-processing appliance is, not just the backup ingest performance, and you'll need to understand how this varies when backup ingestion is being performed concurrently. Inline vendors openly quote this performance, while post-processing vendors generally quote only the backup ingest performance, not the de-duplication performance. You don't need to know both of those data points to know when a backup will complete locally, but you do need to know them both to understand what your TTR from a remote DR site is since the backup cannot be transferred to that remote site until it has been both ingested and de-duplicated.
• If you are keeping your data on disk at the DR site, de-duplication performance may have another impact on TTR that may not be readily apparent. Certain recovery operations, such as the need to fully recover multiple systems due to a comprehensive disaster, may require the transfer of so much data that you will not use a WAN to do this. Most vendors provide an option to dump data to "shippable" devices (an appliance, tapes, etc.) to get the data to the recovery location faster (if it's different than the DR site location). Often the data at the primary site may have been destroyed, but the site has not and the primary site is the preferred recovery location. If the data is being stored in de-duplicated form on disk at the DR site, you'll want to know how fast that data can be dumped to these devices in its native form. What we've found in investigating read performance differences between inline and post-processing appliances is that inline appliances are generally much faster (on the order of 3x to 4x) than post- processing appliances when data is not sitting in a disk cache in its native form. A difference of being able to read data at, say, 400MBps from an inline de-duplication appliance vs 100MBps from a post-processing appliance could potentially generate a difference of many hours when downloading data, depending on the amount of data. That difference can have a potentially large impact on your TTR.
• Different approaches impose different costs and management issues, so make sure you understand these. A prime example is when comparing single-stream performance, it's important to understand the size and cost of the different vendors' configurations required to achieve a given level of performance. If a certain approach requires more disk spindles to achieve a certain level of performance, that will have implications not only on cost, but also potentially on management and other issues (floor space, energy consumption, etc.). When addressing large single backup jobs, such as databases, single stream (not aggregate) throughput is the operative metric to understand TTR. Make sure you compare apples to apples and that you're comparing the right apples!
• You also need to know exactly how the replication works. Is replication performed when a file is backed up, when a backup job is completed, or only when all backup jobs are completed? Clearly, you want to replicate data in its de-duplicated, not native, form to minimize the bandwidth required between the primary and DR site, so you'll have to wait until de-duplication has been performed. During the initial de-duplication process, most vendors de-duplicate and compress the data prior to storing it, but not all vendors retain the data in this form as it is replicated. If vendors have to uncompress data before they replicate it, this will have an impact on the bandwidth required to complete that replication, a factor which potentially has an impact on TTR. It's also important to understand when the necessary metadata to reconstruct a de-duplicated file is replicated. A file cannot be recovered from the DR site until this metadata (which basically provides the formula for finding the file and converting it back into its native form) is also at the DR site. Some approaches replicate the metadata along with the file, whereas others do not replicate the metadata for all the files until the very end of the replication process for a particular backup job or, in some cases, even later. For large data transfers, the difference between replication complete times, defined as when the file is recoverable from the remote site, may vary by hours. This will have an obvious impact on TTR. There are key benefits to having the replication functionality integrated with the de-duplication appliance, as opposed to using a separate replication capability, such as array-based replication, to perform it. Integrated options can make the file immediately available on a read-only basis at the DR site after its metadata has been transferred, whereas more static options such as array-based replication will require that the file be manually mounted on a server before it can be accessed. Integrated approaches may make it easier to manage the integrity of de-duplicated data, but even so you'll want to understand how vendors address the data integrity issues. In de-duplicated environments, the loss of a single chunk of data may affect hundreds or thousands of different files, so you'll want to make sure you're comfortable with the reliability of vendors' approaches in this area.
• When comparing recovery times, vendors clearly want to present very rapid recovery capabilities. Inline approaches do not cache data in native form, whereas most post-processing vendors will recommend maintaining at least the most recent backup in native form to facilitate faster restores (statistically, most restores come from the most recent backup if it is available). When comparing recovery times, make sure that you understand the speed with which data can be restored from the remote site both when it is cached at the primary site and when it is not. Maintaining a large cache at the primary site, which includes data that has already been transferred to the DR site, may result in better recovery times, but it may not be a realistic representation of the types of recoveries you have to perform in your environment. Disk is getting less expensive, but a larger disk cache at the primary site costs more than a smaller or non-existent one. Will you be caching multiple days' worth of backups at your primary site? What is the chance that, if you have to perform a recovery from your DR site, the data you need will be available from a disk cache at your primary site? The data not only has to be there, but also has to be retrievable, and in the event of a disaster that may not always be the case.
When thinking about these issues, some other realizations become apparent. If you have stringent RPO/RTO requirements, the latest versions of inline de-duplication technology that can operate at very high speeds may very well produce TTRs that are hours shorter than those achievable with post processing de-duplication technologies, even when the backup ingest (not the de-duplication) performance of a post-processing approach outperforms the inline appliance by a factor of 2x to 3x. This becomes more of an issue as you are dealing with large backup sets, and when comparing against post-processing technologies that can ingest and de-duplicate at the same time, realize that your single largest backup job will likely determine the TTR difference between inline and post- processing approaches. This is because most post-processing de-duplication approaches cannot start de-duplicating a backup job until that backup job is complete. What this means is that large enterprises with stringent recovery requirements may actually be better served with inline de-duplication technologies, provided the de-duplication performance of those solutions can at least keep up with your network's ability to deliver data.
When crafting a DR strategy that leverages de-duplication technology, integrated replication is a key piece of functionality to meet enterprise recovery requirements that not only simplifies data movement but can also contribute to more automated recoveries and offer TTRs that may be hours shorter.
And finally, while all the architectural differences between various vendors' offerings do not necessarily have a big impact on TTRs, there are a few key ones that do. By understanding your needs, and then understanding the implications of different vendor implementations, you'll be able to focus on the metrics that should make your technology decision straightforward.
Eric Burgener is a senior analyst and consultant with the Taneja Group research and consulting firm.