Data deduplication: When, why, where and how?

The advantages of data deduplication are obvious, but there are a variety of deployment options.

By Russ Fellows

October 26, 2009 -- Data deduplication has received a lot of attention from vendors and IT administrators alike. Both are looking to reduce the pain associated with exponential data growth that most data centers are experiencing.

Data deduplication began to emerge roughly ten years ago, but has only recently become a mainstream technology. In the next several years, data deduplication will likely become as common as point-in-time copy and RAID technology are today.

Data deduplication is an exciting concept, due to its ability to dramatically lower costs for storing and moving data. Many vendors explain the benefits of their products, while highlighting the weaknesses of alternatives. Potential customers are left wondering, "Should I deploy data deduplication? Where should I use it, and what products are best suited to my environment?"

Without an objective analysis of the technology, and the advantages and disadvantages of each approach, IT administrators, managers and CIOs must rely on the marketing and sales claims of the vendors. This article examines the options and the leading product offerings.

When to consider deduplication
The cost savings provided through elimination of redundant data can have ripple effects throughout a data center. Savings may occur by delaying the purchase of new storage, or expanding the effective capacity of existing storage. As a result of having less physical storage with spinning drives, power, cooling and space requirements are also reduced, further adding to the costs savings. Another potential benefit is a reduction in the amount of data transmitted over local, wide-area and storage networks, thereby reducing the need for networking equipment and decreasing bandwidth requirements.

Once data deduplication is understood, it is easier to know when, where and how to deploy solutions. Not all data is well suited for deduplication, nor are the benefits uniform. Thus, potential buyers should ask a few questions:

When should I deduplicate data?

Where should I deduplicate data?

What options are available?

One of the major architectural questions to answer is, "Should I deduplicate all of my data, or only part of it?" Another question is, "Should I deduplicate my data as I am storing it, or later?" All of these issues should be explored and understood prior to choosing an overall architecture for deduplicating data, and then selecting a vendor.

How deduplication works
Data deduplication builds on the ideas and methods used to compress data, including duplicate data set elimination and other techniques. Data deduplication takes the concept of looking for redundant information used by compression, and expands it to much broader scale. Data deduplication operates at the terabyte or petabyte scale, rather than the kilobyte scale of compression technology.

All data deduplication solutions look for redundant information in data, whether at the file, object or sub-object block level. Early versions of data deduplication were file-based and eliminated duplicate files. These methods still exist and are known as single instance storage. Recent enhancements look for duplicate data across multiple data types, finding duplicates in variable length.

Data deduplication typically works by analyzing data and computing a shorthand or unique identifier for a piece of information. Typically, sub-file or block-level deduplication technologies break data up into chunks for deduplication purposes. Each segment is fingerprinted, using a cryptographic hash to see if any of those pieces have been previously stored.

Whenever common strings of data exist, they are replaced with a reference to the original data, thus saving space. The mathematical algorithms that compute the shorthand fingerprint reference are known as a "cryptographic hash." There are many hash algorithms, including MD5 and SHA-256, along with proprietary algorithms. 

The concept of eliminating redundant data may sound risky. Typically, new technologies introduce some additional risk, but as products are improved the issues are addressed. Many of the ideas behind data deduplication have been used for decades with data compression.

New technologies such as logical block addressing of disk drives, RAID, point-in-time copy and replication all remap data and change the physical layout of data. Initially, many of these technologies were viewed as risky, but as they matured and vendors delivered reliable products, the adoption and acceptance of these technologies grew. Data deduplication has been maturing for nearly a decade, and there is very little risk with the current generation of products.

Another issue currently facing users looking to deploy data deduplication with archive and compliance storage products is official acceptance of data deduplication in compliant archive products. Governmental regulations typically lag technology by many years. Thus, it should be expected that just as WORM tape drives and WORM disk storage devices have gradually gained acceptance by regulators, so too will data deduplication technology.

Product options
Architecturally, there are several approaches to data deduplication. Deduplication may be included with backup applications, or delivered in a storage appliance. Understanding the differences between these architectures should be the first consideration when selecting products.

Where to deploy deduplication
After understanding the deployment options, the next set of questions typically revolves around whether to utilize a virtual tape library (VTL) or a disk-to-disk appliance. Organizations that choose to utilize deduplication in backup software may still want to use a VTL or D2D appliance to speed up their backup and restore operations.

Data deduplication is offered as an add-on feature for many D2D and VTL products. A few vendors provide data deduplication for primary storage, others provide software that turns generic hardware into D2D appliances, while others include it in backup applications.

Ultimately, data deduplication will become a service that may be used in many different locations throughout a data center. Until then, IT administrators and architects will have to create solutions that utilize data deduplication where it is most beneficial. Typically, data backup processes contain the highest levels of duplication.

For this reason, the backup process is the area where most vendors have focused their attention for data deduplication, due to the high levels of duplicate data associated with this process. While some backup applications utilize incremental backups after an initial full backup, there remains a significant amount of duplicate data, and deduplication can reduce the storage requirements with these data sets as well.

Due to the processing overhead associated with deduplication, it is also common to deploy deduplication for backup or archive data, rather than primary storage. As a result, most products that provide data deduplication are associated with backup and archiving, including backup applications and disk-based backup and archiving platforms such as NAS appliances or VTLs.

The choice between using a D2D appliance or a VTL depends on the IT environment, including other storage systems in use, the amount of physical tape utilized, and other factors. Environments that have a significant investment in tape, and primarily utilize block storage systems, are often better served by deploying a VTL. In contrast, environments that don't have a significant amount of tape drive or media investment, and utilize a large amount of file or NAS storage, may find D2D appliances are a better fit.

Deployment choices
After deciding on how and where to deploy data deduplication, there is still an important decision to make regarding when deduplication occurs. One option is to deduplicate data as it is being sent to a backup device, and another option is to deduplicate the data at a later time. Real-time or streaming data deduplication is known as "in-line" while data deduplication that occurs later is commonly referred to as "post-process" deduplication.

For administrators looking to minimize data backup time, typically the best option is to use a post-process method. This has the advantage of backing up data faster, reducing the backup window. The disadvantage of this method is that additional storage space is consumed. That is, backup data is sent to a temporary holding area in order to speed the backup process. Once that completes, the data is re-examined for duplicates, with duplicate data removed at a later "post-process" time.

An alternative to deduplicating after a backup is to perform deduplication "in-line" as data is being sent to the backup device. The advantage with this method is that no extra space is required. Another advantage is that once the data is deduplicated and stored, the process is done, and data may be replicated to off-site storage. As a result, the time to complete the entire backup process, including replicating to off-site systems, can be reduced by using an in-line deduplication approach.

Product comparisons
The majority of deduplication features delivered to customers are some combination of software and hardware. Data deduplication uses significant amounts of CPU and memory when calculating hash values for data. Appliances make decisions regarding capacity, I/O performance and the compute power available. Thus, software-based systems typically provide more flexibility in choosing the appropriate amount of CPU, memory and storage capacity than appliance-based solutions.

Many IT users prefer to purchase integrated hardware and software, leveraging the appliance model for the ease of deployment and support. Others prefer software using generic hardware for a more flexible approach. Neither model is better, with both having advantages and disadvantages.

Provided below is a comparison of data deduplication vendors, products and features.
Data deduplication is able to dramatically decrease the amount of disk space required for backup data, while retaining the significant performance improvements that disk-based backup devices have over tape. Thus, data deduplication should be considered for any IT environment looking to contain storage costs associated with backup and archive, while delivering high service levels for data protection.

There are many options available for deduplicating data. Some products allow the use of their systems as a backup target via NAS protocols as D2D appliances, delivering flexibility for deduplicating data outside of the traditional backup scenarios. Other products are geared specifically for use in conjunction with a VTL, in order to help the VTLs compete on a cost basis with traditional backup-to-tape deployments.

Over time, data deduplication will become a feature offered in conjunction with multiple product types and deployment scenarios. Ultimately, deduplication will find its way into multiple storage products. Within several years, data deduplication will likely be deployed in most products that store backup or archival data. The next step is to apply deduplication techniques for storing primary data, without impacting performance. 

In the future, nearly all data will be deduplicated when it is stored or transmitted. Until then, IT departments should carefully evaluate their company's cost, performance and data retention goals prior to choosing how and where to deploy data deduplication. By choosing carefully, IT organizations of all sizes will be able to deliver improved performance at a lower cost.

Russ Fellows is a managing partner with the Evaluator Group research and consulting firm.

Related articles:
Lab Review: Turning the tables on data deduplication
The role of deduplication in disaster recovery
Evaluating TTR in data deduplication environments

This article was originally published on October 23, 2009