Turning the tables on data deduplication

By Jack Fegreus, openBench Labs

With many new business strategies based on the notion of storage being available at a fraction of the traditional cost per gigabyte, system and storage administrators must be able to manage multiple terabytes of data and plan operations that scale into petabytes efficiently. Further complicating that burden, IT must also help assuage growing numbers of government mandates that require corporate officers to safeguard business data. Compliance with these complex and often confusing regulations puts IT governance and risk management in the spotlight and is forcing CIOs to rethink how data is stored, accessed, secured, and managed.

To counter the risks of storing data online, IT developed a deceivingly simple retention plan for storing daily incremental and weekly full backups to tape. Dubbed grandfather-father-son (GFS), the plan mitigates risk quite well; however, the scheme consumes about 25TB of off-line tape storage for every TB of online disk storage. In terms of a traditional tape library, that's more than 16 LTO-4 tape cartridges for every TB of online data. Worse yet, when IT looks to remedy backup pain points via the adoption of a disk-to-disk (D2D) backup scheme, it must do so in a way that scales in both capacity and performance, while remaining transparent to current operations.

The Sepaton S2100-ES2 virtual tape library (VTL) provides IT with the means to garner the benefits of D2D backup through the introduction of a virtualized tape environment with minimal change to existing procedures. What's more, Sepaton's S2100-ES2 provides the option to minimize the amount of storage needed to maintain incremental and full backups via data deduplication. In this way, IT can control the risks associated with exponential data growth, maintain existing backup operations without disruption, and significantly lengthen online data retention times to address the need for faster and easier data recovery.

The key to the S2100-ES2 VTL is the incorporation of powerful Scalable Replication Engine (SRE) nodes that run the tape library virtualization and the DeltaStor data deduplication software. With SRE nodes physically independent from storage, IT is able to tune the scalability of the VTL for either data deduplication throughput or storage capacity independently of each other: A multi-node SRE configuration increases the resources available for data deduplication, while additional disk arrays increase storage capacity. With minimal configuration, our VTL, which used a single SRE node, provided immediate backup throughput on the order of 650MBps and initially reduced storage space by a factor of 25 to 1, which increased to an average of 80 to 1 with subsequent deduplication as the retention time of backup jobs increased.

The VM backup dilemma
To deal with issues of IT operating efficiency, CIOs peg a virtual operating environment (VOE), such as that created by VMware ESX Server, as the magic bullet for simplifying the complexity of IT infrastructure and reducing administrative costs. Nonetheless, there are also costs that arise from the adoption of a virtual infrastructure: While virtualization helps lower the cost of managing expanding IT resources, the risk of a single VOE server failing cascades down to multiple virtual machines (VMs) running multiple applications. As a result, virtual machines need very real backup.

When it comes to business continuity, a VM is no different from a physical machine. A VM is just another instance of an OS along with a set of applications and data that must be included in a regular schedule for backup rotation and retention—and that's the rub. Multiple levels of logical abstraction of storage resources obscure and complicate a small, but important, number of IT operations. Among those operations is the once-simple notion of backup.

In a VOE, IT administrators are immediately confronted by a daunting question: What should be backed up? Should administrators concentrate their efforts on the logically exposed VMs running important business applications, or should they focus on the VOE applications and files that create those logical VMs?

Each VM has two distinct sides to its persona. First, there is the IT-centric perception of a VM as a VOE server application, represented as a collection of virtual machine file system (VMFS) files. Second, there is the business-centric perception of a VM as a standard computer system, represented as a collection of native VM files (Windows NTFS). This dichotomy has the potential to generate significant operational overhead for IT.

To avoid the consequences that a critical resource with distinctly different personas can bring, IT needs to implement data protection processes that address both sides of a VM's persona. To ensure a quick successful resolution of that issue, integration of NetBackup with VMware Consolidated Backup (VCB) provides for the restoration of a VM as either a VM application or a logical system in the VM's native file system from a single backup job.

For our tests, we employed Symantec NetBackup version 6.5.3, which adds patented technology to enhance integration with VCB. The new technology manifests itself in the form of a new backup job option, dubbed FlashBackup-Windows. When IT administrators use NetBackup 6.5.3 with VCB to perform a backup of a VM running a Windows OS, via the FlashBackup-Windows option, they are able to restore a backup job to conform to either VM persona: as a collection of VMFS files, which represents the VM as an ESX application, or as a collection of NTFS files, which represents the VM as a Windows system.

Problems have surfaced, however, when we have attempted to integrate third-party applications with the NetBackup FlashBackup-Windows scheme. We easily sidestepped this issue with the Sepaton VTL by using the standard MS-Windows-NT option with NetBackup to back up a disk repository containing the VM backup jobs created with the FlashBackup-Windows option.

We easily backed up and restored sets of VM backup jobs with NetBackup using the S2100-ES2. In this way, we staged the VTL as a vault for long-term retention of VM backup jobs and benefited from DeltaStor's data deduplication capabilities.

Virtual tape stagecraft
In essence, we created a very efficient backup staging scheme that was similar to the internal staging scheme offered by NetBackup. We began with a D2D backup of all of the VMs in our VOE. These remained in a designated directory on the NetBackup server for a short period. We then backed up that directory to the Sepaton VTL for more cost-effective long-term storage.

Real-world IT backup loads are dependent on a number of factors, including data retention requirements, the timing of backup events, and the nature of the data in terms of compressibility and redundancy. That's why we chose a test scenario with eight VMs running Windows Server 2003 along with SQL Server and IIS, to gain a better perspective on the S2100-ES2 VTL's ability to deal with all of the factors impacting data redundancy. 
No additional software was required to integrate Sepaton's VTL into our test scenario. To begin working with the VTL, we only needed to use the embedded Web Management Console to set up two virtual libraries for testing, a StorageTek STK L180 and an ATL P3000, which were both configured with eight logical LTO-4 drives and 100 virtual tape cartridges. NetBackup immediately recognized the two virtual libraries, inventoried and tracked the cartridges, and assumed control of all management tasks related to backup processes.

Eight VMs comprised the backup load for our VTL. We provisioned each VM with a 12GB logical system disk and a 25GB logical work disk. From a data-content perspective, each VM system disk contained 4GB to 5GB of highly redundant data in the form of common OS and application files, while each work disk contained 5GB to 6GB of relatively unique structured and unstructured data. From an IT application perspective, however, each system disk was represented by a physical 12GB vmdk file, which contained about 7GB of "empty" free space. That made DeltaStor's ability to compare data at the object level as well as at the byte level all the more important for data deduplication.

We kept all files for our eight VMs in a single ESX datastore. Within that datastore, the ESX server created a folder for each of the eight VMs. The folder for oblVM2 contained a 12GB vmdk file for the complete system disk and a 51MB vmdk file for the map of the RDM work disk. These and all of the files related to the state and configuration of oblVM2 need to be backed up in order to restore oblVM2 as a running VM. To satisfy normal business continuity practices, we also needed to be able to restore all of the files associated with oblVM2 and its two disks, WinSys and oblVM2_Work, in native NTFS format.

Our scheme of backing up backup job files may seem strange, but it proved to be extraordinarily effective. First, NetBackup is extremely well tuned for handling very large files, such as those created in a backup job. What's more, DeltaStor does not apply deduplication during backup, which means all backup jobs stream at full throttle. We consistently ran large backup jobs that could be split into eight parallel streams for the Sepaton VTL at 650MBps. That's several times faster than the typical throughput of an inline data deduplication system.

With each virtual library having eight logical drives, NetBackup tries to divide any backup job into as many I/O streams as possible to utilize all eight drives. With backup jobs logically subdivided into eight parallel tasks, backup throughput averaged 650MBps, which pegged total SAN I/O at 1,300MBps as data was simultaneously read from the NetBackup disk pool and written to the Sepaton VTL.

Pointing out data redundancy
The Web-based S2100 Console provides the means to configure the virtualization software running on the SRE node with libraries, tape drives, cartridges, and barcode schemes. In all cases there are numerous device options from which to choose. The multiplicity of these options provides sites with device-specific needs with the ability to transparently swap a physical device with a virtual one.

Sepaton applies classic thin provisioning for the virtual tape cartridges that IT administrators assign to a library. All cartridges are immediately visible in the storage pool; however, the file system does not reserve space for a particular cartridge until actual data is written to that cartridge by a backup application. More importantly, an IT administrator's choice of capacity for virtual cartridges will impact data deduplication.

DeltaStor is backup-job-centric with respect to data deduplication, and cartridge-centric with respect to the reclamation of storage space. That makes it a good practice to match cartridge size with the average size of a backup job to minimize cartridge spanning and ensure efficient space reclamation. With this in mind, we configured and provisioned two virtual libraries on the Sepaton VTL: one for standard backups of files on physical systems and one for storing backups of repositories containing large images of virtual systems.

All data deduplication schemes implement a set of algorithms to replace redundant data with pointers. In addition to the algorithms used to identify redundancy, a fundamental construct of any deduplication scheme is how the pointers work. One technique, dubbed backwards referencing or differencing, replaces redundant data in the current backup with a pointer to a previous instance of that data. The alternative, which is used by DeltaStor and dubbed forward referencing or differencing, replaces redundant data in previously saved backup jobs with pointers to the current backup.

For IT, the differences in these two pointer schemes are profound. Storage systems used to maintain backup jobs utilize low-cost disks tuned for sequential rather than random access as backup software traditionally relies on streaming large blocks of data to optimize throughput. Substituting data blocks with pointers adds overhead to the process and changes the characteristics of I/O operations. As a result, data deduplication negatively impacts IT's recovery time objective (RTO).

Rather than deduplicate data while a backup is in progress, which slows performance, DeltaStor leverages the power of Sepaton's ContentAware technology to aggressively recapture storage space after a backup job has completed. What's more, by substituting data with pointers in older backup jobs, which are typically less likely to be restored, DeltaStor can maximize data deduplication while minimizing the impact on the restore process.

Following a successful backup job, DeltaStor compares previous backup jobs—not a hash code representation—with new reference backup jobs in order to substitute duplicate data in older backup jobs with pointers to data in the most recent reference jobs. In that way space from multiple backup jobs can be reclaimed concurrently each time a new backup job is stored.

Without needing to minimize overhead processing during a backup, DeltaStor can exploit a deduplication methodology based on the content of backup jobs free from compromises. DeltaStor probes deeply into backup jobs to discover duplicate data using two levels of sophisticated algorithms that are first object-level and then byte-level specific. DeltaStor automatically assigns discovery algorithms based on the backup application and job type. 

As backup jobs increased over time, DeltaStor continued to improve storage space utilization of all retained jobs. Deduplication ratios increased as DeltaStor continued to compare existing backup jobs to the most recent reference jobs. Deduplication ratios averaged 25 to 1 as DeltaStor initially reclaimed space from backup jobs; however, as backup jobs that were retained over longer periods of time, DeltaStor reported an average deduplication ratio of 80 to 1.

Even when duplicate data is initially discovered at the object level, DeltaStor applies byte-level comparisons on the matches to maximize data deduplication results. This makes, DeltaStor highly effective with all backup jobs, including backups of email stores, databases, and images of VMs.

DeltaStor further reduces storage requirements over time by successively comparing backup jobs to new reference jobs as they are created. As the number of comparisons made for a particular job grows, so too does the likelihood that duplicate data will be discovered. As a result, the DeltaStor metric for the average deduplication ratio for backup jobs that have been retained and compared over time—80 to 1 in our tests—needs to be evaluated as a snapshot in time. By allowing backups to stream untouched and using the most recent backup job as an untouched reference point, DeltaStor ensures optimal utilization of resources and provides outstanding performance with near linear scalability in backup and restore operations.

Jack Fegreus is the CTO of openBench Labs.


UNDER EXAMINATION: 8Gbps VTL with data deduplication



Dell 1900 PowerEdge servers
-- Quad-core Xeon CPU
-- 4GB RAM
-- Windows Server 2003 Server
-- VMware Consolidated Backup (VCB)
-- Veritas NetBackup v6.5.3

(2) HP ProLiant Servers
-- HP DL580
-- Quad-processor Xeon CPU
-- 8GB RAM
-- VMware ESX Server

(8) VM application servers
-- Windows Server 2003
-- SQL Server
-- ISS
-- HP DL360
-- Windows Server 2003
-- VMware vCenter Server

Xiotech Emprise 5000 System
-- (2) 4Gbps Fibre Channel ports
-- (2) DataPacs


-- DeltaStor data deduplication software identifies and replaces duplicate backup data with pointers to a single baseline copy to improve storage utilization.
-- DeltaStor replaces duplicate data in previously stored backups with pointers forward to the most recent backup, which allows active backups to stream with no additional overhead and recent backups to be restored with little or no reassembly.
-- A single-stream backup job that ran at 96MBps increased to 650MBps when processed as eight simultaneous streams.
-- A single-restore throughput averaged 110MBps.
-- DeltaStor works in the background to provide continuous improvement in storage optimization.
-- As storage space was reclaimed, backup jobs showed initial data deduplication ratios of around 25 to 1, which increased to an 80 to 1 ratio with continued deduplication with longer retention.
-- The Sepaton VTL can add processor nodes and disk capacity independently while being managed as a single unit.


This article was originally published on August 20, 2009