Data de-dupe duplicity

InfoStor has sponsored a number of Webcasts (presented by ESG and the Taneja Group, and archived at www.infostor.com) that delve into data de-duplication. In every case, the most frequent question from attendees is: “Realistically, what kind of de-duplication ratios can I expect?”

This is because users are confused by the various claims made by vendors, who routinely claim ratios of 20:1 and sometimes claim ratios of greater than 100:1. According to an end-user survey conducted by ESG, one-third of the respondents experienced less than a 10× reduction in data, while almost half (48%) were getting 10× to 20× reduction (see figure). The remaining 18% realized greater than 20× reduction, with only 2% reporting reduction rates greater than 100×.

Granted, the survey population was low—only 48 respondents—but it is clear that you can expect data-reduction ratios of less than 20:1 in most cases.

But data-reduction ratios aren’t the only issue with de-duplication. On the surface, the technology seems simple (and isn’t new at all): Eliminate data redundancy with sophisticated algorithms and, bingo, you get capacity and cost reduction, faster backup/restore and data-transfer rates, and reduced bandwidth requirements.

It is simple, but vendors have implemented data de-duplication in many different ways, making product evaluation a complex task. Users have to grapple with where, when, and how issues, each involving trade-offs, such as:

  • Hardware- versus software-based de-duplication;
  • File- versus block- versus byte-level de-duplication;
  • De-duplication at the application or file servers vs. at the backup server; and
  • Inline vs. post-processing de-duplication (an argument that has been debated in the storage industry more than McCain vs. Obama has on CNN).

In almost every instance, the trade-offs revolve around cost, performance, flexibility, scalability, and the actual capacity reduction to be gained or sacrificed.

Although you can get to a short list of vendors via complex calculations, making the final decision will almost inevitably require in-house testing—an arduous task, but well worth the effort given the potential advantages of data de-duplication.

Dave Simpson

This article was originally published on July 01, 2008