Data reduction improves backup, reduces capacity

Posted on November 01, 2006

RssImageAltText

Data reduction technologies help end users deal with the “capacity bloat” of disk-based backup, and more.

By Heidi Biggar

Data reduction technologies, including data de-duplication, reduction, and compression, have quickly become one of this year’s most-talked-about new technologies. But before you put data reduction into the category of “just another over-hyped technology” or dismiss it as vendor “vapor-ware” or “marketecture,” take a closer look.

Click here to enlarge image

If you’ve already been backing up to disk, then data reduction technologies can help you effectively deal with the “capacity bloat” that can result from backing up the same data multiple times. And if you’re currently weighing backup-and-recovery options (tape versus disk or one disk-based backup product versus another), data reduction may be the tipping factor.

Many IT organizations are already using data reduction technologies, and the Enterprise Strategy Group (ESG) expects those numbers to grow steadily over the coming months, leaving little doubt regarding the effects the technology will have in secondary-and eventually primary-data environments going forward.

In ESG’s opinion, data reduction is different from previous years’ buzzwords in three key ways:

It’s real. Data reduction products are available from a variety of vendors (about 15 at last count), and many more are on the way. In fact, recent ESG research already shows strong interest in-and, importantly, use of-data reduction technologies. For example, 65% of the survey respondents said they were either already using, or believed there was a need for, data reduction technologies.

While the number of respondents actually using data de-duplication is likely to be significantly lower than the 37% reported in the survey (we believe the difference is due to end-user interpretation of the provided definition), we believe the survey results are still a good barometer of end-user awareness of data reduction technologies, in general. Data reduction technologies include data compression, delta differencing, and data de-duplication. The nuances of these technologies are described in greater detail below.

It has immediate and significant benefits for secondary storage. It has the immediate ability to reduce the amount of disk capacity needed for local backup, and, similarly, for organizations with remote locations it can reduce the amount of data that needs to be moved over the WAN during remote backup and/or replication processes, which can translate into significant cost savings and performance benefits. In fact, for some organizations data reduction will be the means to a remote data-protection end. The less data that goes over the WAN, the lower the WAN-associated costs and the better the WAN performance.

It’s easy. Unlike many other technologies that promise big benefits but are complicated to deploy, implementing data reduction for backup is relatively easy. This is especially true with appliances that plug in and are ready to perform their first backups in less than an hour. No huge planning is required, nor is there a need for extensive changes to existing environments: You install the appliance and set up a backup policy, and the appliance performs data reduction.

ESG Lab has validated a variety of data reduction technologies from several vendors, including Asigra, Avamar, Data Domain, and Diligent. Although the amount of data reduction does vary from vendor to vendor and according to the type of data being de-duplicated (compressed files contain less duplicate data than, say, e-mails), the change rate of the data (the slower the change rate, the more duplicate data is created), as well as end-user backup and retention policies (the more frequent the backup and the longer the data is retained, the greater the duplicate data), we expect organizations to see an average 10x to 20x reduction in capacity by using existing data reduction technologies; however, reductions in excess of 100x are not out of the question.

What is data reduction?

As with most new technologies, data reduction suffers from definition abuse. Data reduction has come to mean different things to different people largely due to vendor positioning. Data reduction technologies include data de-duplication, delta differencing, incremental backups, and data compression (see definitions of the four different technologies on p. 30).

Click here to enlarge image

Data de-duplication, or single-instance storage for file-level de-duplication, is not the same as delta differencing, or “differential backups” as some vendors refer to it. Both technologies fall under the general “data reduction” category, but their functions are very different. However, they can be used together-and with data compression-to further improve disk efficiency as well as backup and network performance.

Avamar and Asigra are examples of two disk-based backup providers with products that do all three: data de-duplication, compression, and delta differencing.

Data de-duplication eliminates redundant files, blocks, or chunks of data (depending on the vendor), which ensures only unique data is stored on disk. Delta differencing, meanwhile, monitors data (again, files, blocks, or chunks, depending on the vendor) to ensure only changed data is written to disk after the initial full backup. This compares to “incremental backups,” which back up files that have changed since the previous backup, not the initial full backup.

All three technologies (data de-duplication, delta differencing, and incremental backups) differ from traditional backup processes, which back up everything at every scheduled backup whether the data has changed or whether or not it is redundant. The differences among these technologies are illustrated in the four tables, below.

Data reduction has the immediate effect of reducing back-end disk requirements. This means it can lower disk-related backup costs, which ESG Research shows is a leading obstacle to disk-based backup adoption. However, there are two other important benefits of reducing data, which are often overlooked:

Extended retention policies: Data reduction frees up disk space, which means existing backup data can be kept online longer to meet more-demanding SLAs for data recovery.

Cost-effective and efficient remote backup/replication: Data reduction, again because it reduces the backup load, allows for more-efficient backup and replication between local and remote sites. Less data to push over the network means lower WAN costs and better performance. This compares to traditional remote backup or replication processes, which do not filter out redundant data.

Data reduction can be done at the file, block, or byte level (e.g., Diligent, ExaGrid, and Sepaton) either on-the-fly during the backup process (e.g., Asempra, Asigra, Avamar, Data Domain, ExaGrid, and Symantec) or post-process after data is written to disk (e.g., ExaGrid, FalconStor, and Sepaton). Byte-level data reduction is the most granular, and file-level is the least granular. There can also be performance vs. capacity tradeoffs depending on where the data reduction is done.

For example, Symantec does the data reduction at the host on-the-fly, which consumes some CPU cycles. This compares to ExaGrid and Sepaton (when it makes its data reducing technology available early next year), both of which do the data reduction post-process after the backup data has been written to disk.

Doing the data reduction post-process is more efficient from a performance (i.e., CPU and network bandwidth) standpoint, but it does require users to “reserve” disk capacity for the full backup stream; this capacity is “released” once the data has been reduced. Capacity is used and released in an accordion-like fashion.

The following two examples illustrate the real-life power of data de-duplication:

Case 1

An end user creates a PowerPoint presentation, which gets sent out internally to 20 people as an e-mail attachment.

With traditional backup methods, this PowerPoint attachment is backed up 20 times at the end of the day, even though no changes were made to the attachment by any of the 20 recipients. Depending on the length and complexity of the presentation, the capacity drain could be in the range of 20MB to 60MB daily. Assuming a retention period of 30 days, this translates into upwards of 1GB of capacity as a result of this single e-mail attachment. In a medium or large organization where PowerPoint presentations abound, it is easy to see how traditional backup methods could quickly drain online disk capacity and send associated disk costs skyward.

A data reduction appliance or software would analyze the data, and only one copy of the PowerPoint presentation would be backed up. More-sophisticated approaches that reduce data at the byte level would provide even greater data reduction.

Case 2

A user starts out backing up 1TB of data. On average, the user creates about 2GB of new data each week and also has 200GB of mission-critical data that is growing at a rate of 2GB per day. The user does a full backup of the mission-critical data each day to disk, and keeps it online on disk for 30 days, and then performs a weekly full backup of the non-critical data.

The table on the right illustrates how capacity demands can quickly get out of hand with traditional disk-based backup solutions, where data is not de-duplicated and full backups are done of non-mission-critical data weekly and of mission-critical data daily. At the end of 12 weeks, disk capacity tips the scale at about 21TB.

In this example, the same backup schedule is used (i.e., weekly full backups of non-mission-critical data and daily full backups of mission-critical data); however, the data is reduced. The first full backup would result in a little more than 1TB of data being backed up (the same as the “traditional” approach); however, as more full backups are performed the rate of data growth is magnitudes less than in the traditional ­approach. In fact, assuming data reduction of 20x, disk requirements drop from 21TB to 1TB.

Bottom line

Obviously, data reduction can have a huge impact on capacity requirements. Data reduction can significantly reduce the amount of storage capacity required since it only stores unique data.

Click here to enlarge image

In addition, combining de-duplication with data compression and delta differencing can make the capacity savings potential even more compelling.

If you can reduce the amount of capacity required through de-duplication technologies by a factor of five, for example, and then compress that data by another factor of two, you have effectively reduced your capacity requirements by a factor of 10. Imagine 100TB becoming 10TB with all of your data still available.

The Enterprise Strategy Group, which has spoken with many end users and has completed its own hands-on testing, has found that it is not uncommon for data reduction solutions to provide 10x, 20x, or greater reduction in backup data. This means that end users can back up 2TB of data on just 1TB of disk capacity. Those are powerful economics.

There are many ways to reduce data, but all approaches are a significant improvement over traditional disk-based backup approaches.

Data reduction will eventually become a requisite check-off item, or feature, of all “1DR” (disk-based recovery) data-protection products, including backup software, replication technologies, disk targets (VTL, NAS, and CAS), and CDP products.

Heidi Biggar is an analyst at the Enterprise Strategy Group (www.enterprisestrategygroup.com).

Snapshot: Data reduction technologies defined

Data de-duplication - Eliminates redundant files, blocks, or chunks. Stores only “unique” data.

Delta differencing - Copies only new, changed, or modified data blocks since the last full backup.

Incremental backups - Copies only changed or modified data since the previous backup.

Data compression - Reduces the number of bits required to represent the actual data.

Source: ESG


Comment and Contribute
(Maximum characters: 1200). You have
characters left.

InfoStor Article Categories:

SAN - Storage Area Network   Disk Arrays
NAS - Network Attached Storage   Storage Blogs
Storage Management   Archived Issues
Backup and Recovery   Data Storage Archives