Data de-duplication: Questions and answers

Posted on February 01, 2008


Eight questions that every IT organization should ask about data de-duplication before they deploy or upgrade.

By Heidi Biggar

Data de-duplication is arguably one of the most important new technologies to hit the storage market in years, and it's a game-changing technology that can have an immediate impact on end-user environments.

By reducing the amount of physical disk capacity that is needed to store information, data de-duplication allows organizations to keep more information on disk-based systems-making it more accessible to the people and applications that need it. For those of you who are already de-duplicating their secondary data, de-duplication immediately paves the way for wider application use inside the data center and, equally important, at remote sites (see figure).

While there is no doubt that data de-duplication will play a large role across all classes of storage moving forward, some important technological and business considerations remain when you're evaluating potential products. Addressing these factors will go a long way toward ensuring a "best fit." This article identifies eight technology-related questions every organization needs to ask.

1. What data de-duplication ratios can I expect?

While ratios of 50:1, 100:1, 200:1, and higher are possible, we've found through conversations with end users, recent ESG Research, and hands-on ESG Lab testing that ratios of 10:1 to 20:1 are more typical (see figure).

Data reduction ratios depend on a number of variables, including the type of data being backed up or stored, retention periods, the frequency of full backups, and the specific data de-duplication technology being used. To get an idea of the ratios you can expect in your environment, we encourage organizations to provide potential vendors with detailed information about their environments, backup processes, applications, retention SLAs, and data types.

2. Will data de-duplication affect my existing backup-and-restore performance? And if so, how?

This is an important question to ask, especially when you consider that one of the primary objectives of implementing disk-based backup solutions is to improve overall backup-and-restore/recovery performance. In many cases, performance will depend on factors such as the backup software that is used, as well as the systems and networks that support it, so it's important to ask a couple of follow-up questions:

  • What is the single-stream backup-and-restore speed? This refers to how fast a given file or database can be backed up, restored, or copied to tape for archiving. The numbers can be different because read-and-write speeds may be influenced by different variables. Backup throughput is what most users ask about, although restore time is often more significant for most SLAs.
  • What is the aggregate backup/restore throughput per system? In other words, how fast can a given controller perform with many streams? The answer will determine how many controllers/systems you need.

Data de-duplication can provide significant efficiencies within the backup environment, but it is not a panacea. If the backup environment is "broken," it is unlikely that data de-duplication alone will fix it. Existing system and network capabilities, as well as bottlenecks, must be factored in. Data de-duplication may allow more backup data to go to disk versus tape-yielding significant improvements in performance-but it won't fix a poor overall design or implementation.

As for any performance impact from the de-duplication solution itself, some performance degradation can be expected with inline approaches (since data is being de-duplicated in the data path as it is being ingested). The actual impact depends on a number of variables, including the de-duplication technology itself, the size of the backup volume, the granularity of the backup process, the aggregate throughput of the architecture, and the scalability of the solution.

Of course, there are also trade-offs to doing the de-duplication process post-process (or out of the data path after the data has been ingested)-notably, the capacity reserve that is initially needed to store the full backup job before it is de-duped. And there are other "performance-related" issues, including the disaster-recovery (DR) window (see "Data de-duplication trade-offs: Recap," below). However, performance issues of any kind are only relevant if they actually occur. Users won't get any benefit from paying more for a faster solution when a less costly one handles the job just fine.

3. How will data de-duplication impact my DR window?

The effect that de-duplication can have on an IT organization's DR windows is an important consideration-one that can have significant implications depending on the specific environment. De-duplication benefits-such as increased retention and lower tape costs-are important, but their value can quickly erode if the de-duplication process is difficult to use or if it impacts DR readiness. This is where "time to protection" (T2P) comes into play. T2P refers to the time it takes to get application data backed up and moved off-site for DR. The length of this process-from start to finish-depends on the data de-duplication approach (inline versus post-process) and the speed of the de-duplication architecture, as well as the DR method (e.g., is the data written or exported to tape, or is it de-duplicated and then replicated over a WAN to an off-site location?).

It's important to sketch out this process, assigning time values to each leg of the process. Doing so will help ensure organizations aren't exposed in the event of a disaster. Reclaiming disk space is great, but it shouldn't come at the expense of T2P.

4. Is de-duplicated remote replication supported?

De-duplication remote replication support will become more and more important over time (see "De-duplicated replication: A hidden jewel," above). Minimizing the amount of redundant data moved over the WAN reduces overall network traffic-allowing users to enable, improve, or even expand disaster recovery and remote backup efforts.

Today, remote replication means different things to different people. As a rule of thumb, any product that supports "multi-site de-duplicated remote replication" should be able to de-duplicate data across the entire storage environment-i.e., at each remote site and again at the central site. This type of functionality is not widely supported by disk-based backup vendors today, so if it is a requirement for your organization, make sure that if it's not currently supported by your vendor, then it's at least on the road map.

5. Is it easy to implement and use data de-duplication?

One of the compelling things about de-duplication is that it is easy-or at least it should be-and this should hold true for both small- and large-scale installations. It should be invisible to the backup-and-recovery process, and it should be combined with disk backup solutions (e.g., purpose-built disk backup appliances, virtual tape libraries, or VTLs, etc.) that are also easy to use and implement. IT organizations should also have the flexibility to turn de-duplication "on" or "off" depending on network demands, user environments, data types, etc. Make sure to ask vendors for references!

6. How am I protected from data loss or corruption?

This is a very important question on a couple of different levels. It applies to both the disk backup system itself and the de-duplication technology. The first thing you need to understand is how "bullet-proof" the disk-based backup system itself is. Find out what technologies it has to ensure data integrity and to protect against system failures. Second, if the system de-duplicates data, then you need to find out what the system does if the source data becomes corrupt or inaccessible for some reason. After all, there may be 1,000 backup images that rely on a single copy of source data.

7. How scalable is the solution?

Again, this question applies to the disk backup solution itself, as well as the de-duplication technology. Make sure that you size your environment to meet current capacity and performance requirements, but also consider future demands. Choose a vendor that will make it easy for you to scale in terms of technology and cost. Also, make sure to ask vendors about any performance considerations for both their systems and de-duplication technology as their environments scale.

8. What types of applications are supported?

Flexible application support may not seem like a big deal initially, but as environments scale and more data types and sites are added, it becomes increasingly beneficial, if not critical, for de-duplication solutions to support multiple applications. In particular, it's important that these solutions support multiple backup applications and preferably have the capability to de-duplicate and store other types of persistent data in the same system. The greater the flexibility of these systems, the more consolidation is possible using less physical infrastructure. This in turn reduces cost in terms of management, purchasing, and energy consumption.

The above questions are important, but they only cover technology considerations. There is also the business side to consider. One of the greatest attributes of data de-duplication is that its value is easy to quantify. It is relatively easy to put a dollar amount on the cost savings of reducing the amount of capacity needed to store backup data by 10:1, 20:1, or greater.

While these numbers can be significant and may be enough for some organizations to move forward, they only tell part of the data de-duplication cost-savings story. A complete return-on-investment (ROI) analysis should include both the hard and soft cost savings of deploying de-duplication. In fact, the soft costs alone- the value of increased retention, operational efficiencies, and time to protection-can be very compelling. Finally, you should remember that while you may gain a 50:1 advantage on Day 1, new data will be added over time, and sooner or later you'll be right back to where you started in terms of capacity under-management.

Data growth is the primary cause of many of the issues IT professionals face-and it causes downstream issues at every layer. Data protection is the easiest area in which to justify deploying de-duplication since it affects only "copied" information. However, de-duplication will eventually play a role at every point of the data lifecycle, as the benefits of "less" are clear at each level. The sooner you start implementing de-duplication-at any level-the better off you will be.

Heidi Biggar is an analyst with the Enterprise Strategy Group research and consulting firm (

De-duplication trade-offs: Recap

Currently, there are two distinct types of data de-duplication available: inline and post-process. Which is which can be determined by the answer to the following simple question: When is backup data de-duped? If it's done before it is written to the target, then it is inline de-duplication. If it's done after, then it is post-process.

There can be some performance degradation with inline de-duplication approaches as data is being ingested, and there is an up-front capacity consideration with post-process approaches. The performance impact of the inline approach depends on a number of variables, including the de-duplication technology itself, the size of the backup volume, the granularity of the de-duplication process, the aggregate throughput of the architecture, and the scalability of the solution. Some inline functions occur at the server, some as a "bump in the wire," but most take place at the target itself.

With the post-process approach, more disk capacity is needed up-front to store the backup volume. But the size of this capacity reserve also depends on a number of variables, including the amount of data being backed up and how long the data de-duplication technology needs to hold onto the capacity before releasing it. Solutions that wait for the entire backup process to complete before releasing capacity have a greater "capacity overhead" than solutions that start the de-duplication process earlier as backup data is being stored.

De-duplicated replication: A hidden jewel

This year, users are going to hear a lot more about de-duplicated replication. In fact, ESG research shows that users are already warming up to the concept, with 31% of respondents to a recent ESG survey reporting they have already implemented de-duplication technologies at both remote sites and within the corporate data center.

Ten percent said they had implemented the technology at remote sites only, and 52% said they implemented it within the data center (see figure).

So, while data de-duplication's initial foothold is within the walls of the data center, it's a natural-and easy-progression to roll it out remotely over time, as comfort levels with the technology increase. This has a number of potential significant benefits for end users, including the following:

  • For some, it may mean the difference between replicating data over the WAN or not. De-duplicated replication can reduce WAN traffic by a factor of 10x or more, which can have important cost, performance, and DR considerations. The cost savings come from reducing the amount of WAN bandwidth needed for the same backup volume. Similarly, backup-and-recovery performance is improved; and
  • For other users, it may allow them to increase the amount ofapplications or sites that they can protect in this fashion.

From a technology standpoint, it's the same premise: A data de-duplication engine analyzes and removes redundant data blocks before they are moved over the network. Again, the level of de-duplication users can expect to see depends on the implementation as well as the flexibility of the replication process (i.e., whether or not it does true "multi-site" de-duplication).



  • "Survey shows rapid adoption of de-dup"
  • White Papers:

  • IDC White Paper: Deduplication in Data Protection
  • Data Deduplication Best Practices
  • Webcast:

  • The Data De-Duplication Effect

  • Comment and Contribute
    (Maximum characters: 1200). You have
    characters left.

    InfoStor Article Categories:

    SAN - Storage Area Network   Disk Arrays
    NAS - Network Attached Storage   Storage Blogs
    Storage Management   Archived Issues
    Backup and Recovery   Data Storage Archives