What’s hot? D2D, data reduction, data classification

Disk-to-disk (D2D) backup and recovery, data reduction, and ILM-based data classification are relatively new technologies that will eventually become mainstream.

By Dan Tanner

Data protection, business continuance, compliance, and disaster-recovery requirements all will intensify this year, and the mountains of data involved will continue to grow. The good news is that there are a variety of products already on the market-and newer technologies coming over the next year or so-that will help you get a grip on the situation. Broadly speaking, the products fall into the areas of disk-based backup/recovery, data reduction, and data classification. And they will eventually overlap to enhance one another and add up to more than simply the sum of their parts.

Disk-based backup

Disk-based backup has come down in cost while the spectrum of benefits it offers has broadened. This is just in the nick of time, because tape-based backup has hit the wall. Disk-to-disk (D2D) backup now offers economy, reliability, and high performance.

Disk-based backup is available in numerous flavors. The “simplest” flavor is a virtual tape library (VTL), which emulates a physical tape library. The appeal of VTLs is that they can be used with existing backup software with no procedural changes. And, because VTLs are disk-based, they can do things that tape can’t do, such as merge incremental backups into full backups in the background.

Although recovery operations appear the same as if tape were being used, they typically proceed much faster and more reliably from disk. And the economics are interesting, too. A case can be made that VTLs are less expensive than tape libraries, especially when you factor in media savings.

Eventually, D2D will become the norm for backup. Then, backups can be made directly mountable, simplifying and speeding recovery.

D2D backup/recovery is the foundation for continuous data protection, or CDP, which delivers continuous (or near-continuous) backup and high-speed recovery from virtually any point in time. (For more information on CDP, see “Making sense of CDP,” this month’s Special Report, p. 24.)

D2D combined with data-reduction technology will mean that all backups and archives can be online forever (with tape serving as an optional “life preserver”). And D2D combined with data-classification technology can assist organizations trying to implement information lifecycle management (ILM) by providing active migration of content to appropriate storage tiers.

Data reduction

Data reduction is the ultimate extension of single-instancing in storage, leading to content-optimized storage (COS), and is also being applied in WAN acceleration products, including those from vendors such as Cisco, Juniper, Orion, Riverbed, Swan, and Tacit Networks.

Data reduction springs from a file-reduction technique wherein a file is stored only once. But a tiny change requires saving the entire new file. Many files (e.g., Word files) contain a huge common code overhead. So why not change only the storage blocks within the file that changes? Why not automatically save common blocks and the few ones that are unique to data and reconstruct any file from mostly common parts? Or, why not scan within storage blocks and assign a code key that represents unique bit strings?

Those things are already being done. They only require computing horsepower, and there’s plenty of that to spare. Implementations of data reduction vary widely (e.g., building block or hash-key generation), as do the implementation points (e.g., at the source, in the storage network, or at the storage end-point).

However, data reduction is the Wild West as far as standards are concerned, although that may change as the technology gains more widespread acceptance.

Data reduction will enable you to have all backups and archives available online forever. The technology can also allow relatively slow networks to efficiently carry large changes to numerous files (e.g., cost-effective replication over the Internet).

Importantly, data reduction is “lossless”-unlike compression, which is only effective at small ratios (say, 2:1 or 3:1, for example) and prone to data loss at higher ratios. Data reduction usually yields better than 20:1 reduction and can range to more than 100:1, making it possible to store multiple backup generations in scarcely more than a few percentage points of additional storage than the original. Storage allocation and multiple-instancing problems could become a thing of the past, and content-optimized storage can also provide the privacy and “digital notarization” immutability demanded by many compliance specifications.

Examples of vendors using data reduction-or similar-technologies include Avamar, Data Domain, Diligent Technologies, and EMC (in its Centera platform).

Data classification

ILM is a worthy goal, because it aims to permanently and dynamically manage the placement of files onto the most appropriate storage device to meet service level agreements, compliance requirements, etc., under policy control. ILM itself rests on two key pillars: a tiered storage infrastructure and data classification.

Data classification is the hard part. Data classification must completely abstract the notion of physical storage and present users with an interface that specifies required degrees of protection, performance, and even security for data grouped according to business requirements.

Data classification must be able to do so in a manner that enables management and operational departments to use it easily and without error. Then, storage managers can concentrate on the health of storage without having to worry much about either storage allocation or data placement.

A number of vendors are currently shipping data classification products, including Arkivio, CommVault, EMC, Index Engines, Kazeon, Njini, Scentric, and StoredIQ, and more are expected in the coming months. These products take different approaches and have different user interfaces.


I believe that, just as with storage resource management (SRM) software, there will be an outcry for standardization in the data classification market.

Dan Tanner is an independent industry analyst and consultant, and founder and principal of ProgresSmart (www.progressmart.com).

This article was originally published on April 01, 2006