Before jumping into data de-duplication . . .

By Ann Silverthorn

—A group of storage professionals recently gathered at a Wikibon Storage Research Meeting, to discuss the topic of de-duplication via a phone conference. Wikibon is a worldwide community of practitioners, consultants, and researchers who are dedicated to improvement of technology through the sharing of knowledge. David Floyer, a Wikibon co-founder, was the speaker.

As a backgrounder, Floyer says there are three main approaches to de-duplication.

1. Inline, variable block: All the work is done in a single path and reversing the process is easy. A disadvantage is that this type of de-duplication requires high processing power to run algorithms and there is a single point of failure. The primary vendors in this space, according to Floyer, are Data Domain and Diligent.

2. Block hashing: This gives the user faster initial processing with a data integrity drawback because the hashing keys are not unique. This method also presents a single point of failure. Vendors in this space include Data Domain, FalconStor and Quantum, along with NetApp and IBM.

3. Logical construct: Data can be looked at from an internal point of view. The metadata can be used by applications, (e.g. if you want to do e-mail tracing and discovery). Those links could be used to find, for example, e-mail files that are located in different places. Floyer says representative vendors in this space are Hewlett-Packard and Sepaton.

"The nice thing about A-SIS is that the de-duplication is integrated with the filer head," says Floyer. "This means that error recovery, file corruption, power failure, or file inconsistencies are dealt with by the same operating system."

Moving de-duplication from the servers to the storage, as NetApp has done, addresses the issue of performance, because the servers' processors are not burdened with the task. But generally, the applications that can benefit most from de-duplication will have the greatest amount of processing issues.

When considering de-duplication, it's important to have realistic expectations about how much space savings you'll really get. For example, a medical PACs application has many unique files, so the rate of savings might be less than 10%. Database backups see the greatest amount of data reduction, but they require a high amount of processing. See the chart from IBM below:

Click here to enlarge image

One statement that is often heard with de-duplication involves the extinction of tape. Floyer doesn't agree, especially when one considers the issue of data recovery. He created a scenario to determine which technology had the better recovery point objective (RPO): a VTL with de-duplication or traditional tape backup. To determine which method would lose the least amount of data, his scenario assumed a 12TB system with synchronous copy to a remote site. Backups would consist of 12 hours or online financial-system data and eight hours of batch data. Traditional tape backup took a total of eight hours to recover and the VTL took nine hours.

"In other words, if you need to be able to recover critical data, using transmission techniques, even with very high levels of de-duplication is not that fast," says Floyer. "So getting rid of tapes seems to be an overstatement, particularly in terms of data recovery and the best type of system."

One of the professionals on the Storage Research Meeting call pointed out that with A-SIS de-duplication, administrators can de-dupe the volume, then do a snapshot of that volume, and finally mirror the snapshot to a second location.

Still, Floyer maintains that when you add up the total amount of time to get that data to the other site, perform the de-duplication, snapshot it, and then transmitted it, the total amount of elapsed time is usually less for tape.

Floyer added that doing continuous backup using remote-replication techniques and using the data immediately within the storage controller to transmit it, reducing the data, then using de-duplication methods (inside the storage controller) and making it part of remote replication, would have a major impact on the RPO.

Discussion also revolved around the need to reverse the de-duplication before the data can be recovered. In most cases, the software that was used for the de-duplication is proprietary and so the de-duplication can't be reversed without the software that manipulated the data in the first place. Floyer believes that vendors should open the code for their de-duplication software, or risk seeing the technology stall in the marketplace. He maintains that other applications should be able to bypass the software to access the de-duplicated data for quicker restores.

Wikibon will hold another Storage Research Meeting on Tuesday, June 5, at 12:00 eastern (9:00 Pacific). This meeting will take a second look at de-duplication and will include officials from Storage Markets, the predictive market Website, who will share their findings with the group.

For more details on Tuesday's Storage Research Meeting visit: www.wikibon.org.

For more information on Storage Markets visit: www.storagemarkets.com.

For more information on A-SIS technology see: NetApp extends de-dupe beyond backups.

This article was originally published on May 31, 2007