Lab Review: Using data de-dupe for VM backup

By comparing streaming data during backup, Overland Storage's REO 9500D VTL delivers a 26-to-1 reduction in the storage needed to back up eight virtual servers.

By Jack Fegreus

-- For today's IT decision-makers, storage provisioning is at the core of a perfect storm hitting IT operations. More applications requiring more primary data are coming online, just as IT is changing its backup media of choice from tape to disk. This considerably elevates the importance of any technology, such as data de-duplication, that can help IT contain the consumption of disk resources.

For IT, greater efficiency in operations begins with optimal resource utilization. More processors, greater storage volume, and an expanding portfolio of applications equates to greater complexity for a department already burdened with the highest-rising corporate labor costs. That's why the issues of consolidation and virtualization are now just as important to IT as the traditional concerns over reliability, availability, and serviceability (RAS).

To deal with issues of operating efficiency, CIOs now peg a virtual operating environment (VOE), such as that created by VMware ESX Server, Microsoft Virtual Server, or Xen, as the magic bullet for simplifying the complexity of IT infrastructure and reducing administrative costs. Working with virtual resources, system administrators can focus their attention on a limited number of abstract device pools that can be centrally managed, rather than on a plethora of complex proprietary devices that must be individually managed. The mobility of virtual machines (VMs) within a VOE also enhances IT's ability to balance workloads and maximize the utilization of resources. What's more, new VMs can be rapidly configured and deployed by simply cloning stored templates.

For a small to medium-sized enterprise (SME), the ability to simplify resource management makes a VOE the ideal platform on which to scale out applications and garner optimal RAS levels. By leveraging technology advances in multi-core CPUs and high-speed networking, SME sites can easily support a large number of virtual machines on a small number of physical servers. In this way, IT can realize all of the RAS capabilities of a large data center, while avoiding all of the costs associated with racks of 1U servers.

Nonetheless, there are costs associated with the benefits derived from the adoption of a virtual infrastructure: Virtual machines need very real backup. From the perspective of business continuity, a VM is no different from a physical machine: It is just another instance of an operating system along with a set of applications and data that must be included in a regular schedule for backup rotation and retention -- and that's the rub. Regular backup rotations, which typically copy files on daily and weekly schedules, consume 25x the amount of storage that is being protected via the retention of multiple time-ordered copies of data.

Provisioning for that 25:1 expansion in backup media has the potential to be a serious drain on the savings promised by a VOE. As a result, the REO 9500D's core data de-duplication technology, which can reduce backup storage requirements by a factor of 30 or more, can be an essential element in delivering VTL scalability and optimal ROI. For any site embarking on a D2D backup initiative, provisioning archival storage to house backup sets will be a pivotal issue in order for IT to continue to run backup operations smoothly and realize a fast ROI. At the heart of this issue is the question of how to scale a backup repository based on a site-specific backup load. From 1TB of primary disk storage, a traditional grandfather-father-son (GFS) retention plan for backup sets will typically consume 25TB of secondary storage: In terms of a traditional tape library, that's the equivalent of about 150 LTO-2 cartridges.

Real-world IT backup loads are dependent on a number of factors, such as data retention requirements, the timing of backup events, and the nature of the data in terms of compressibility and redundancy. A VOE is the perfect microcosm to examine all of those factors impacting data redundancy. That's why we chose to gain a better perspective on how a backup load can impact the scalability of a VTL and its D2D repository by setting up a test scenario based on a VOE with eight VMs running Windows Server 2003 along with SQL Server and IIS. Each VM was also provided with a unique 10GB collection of data files. As a result, each VM represented a backup target of about 18 to 20GB of uncompressed data.

On a second Dell PowerEdge 1900, we ran Diligent Technologies' ProtecTIER Manager software to configure the REO 9500D VTL; VMware VirtualCenter to manage the virtual infrastructure; VMware Consolidated Backup (VCB) to create and share snapshots of VMs during a backup; and Symantec Backup Exec v12 to manage the end-to-end backup process. All of these applications were hosted on a 64-bit version of Windows Server 2003. Using the ProtecTIER Manager GUI, we configured the REO 9500D VTL as two virtual ATL P3000 tape libraries -- oblVTL-1 and oblVTL-2 -- provisioned each library with four virtual DLT7000 drives and 20 tape cartridges.

Within that test environment, our primary goal was to assess the impact of the ProtecTIER data de-duplication software, dubbed HyperFactor. In particular, we measured both write throughput and storage utilization when using the REO 9500D repository to back up and store savesets of VM images using Symantec Backup Exec and VMware Consolidated Backup.

To examine the effect of data de-duplication on throughput, we first ran the oblTape benchmark on the oblVTL-1 virtual library with ProtecTIER HyperFactor disabled. For these tests, we varied the block size used to write data to tape while we were streaming random data that was calibrated to produce data that would be compressible at either a 2x or 3x rate. As the tape block size increased beyond 16KB, so did the effects of compressibility. At a block size of 32KB for writes—the default size used by Backup Exec -- the average VM backup throughput was 50MBps, which was in line with the performance bounds that our oblTape benchmark had projected.

To back up the VMs in the test VOE, openBench Labs used VCB and Symantec Backup Exec. The primary VCB files are part of the ESX Server package so no additional installation was required on the ESX Server host. Nonetheless, a small VCB package, which includes a VLUN driver to mount VM snapshots and an integration module for the backup software being used, must be installed on a "proxy server." The proxy server is a Windows Server host that has SAN fabric access and networking connectivity with VirtualCenter, which is used to manage and report on the state of all the VMs and the backup server. We configured our Backup Exec server as the proxy server.

The proxy server initiates a VMFS snapshot using the VMsnap command on the ESX Server host to create a point-in-time copy of a VM's disk files. In this process, all file-system buffers in the VM's OS are flushed to commit writes, and new writes to the VM's file system are suspended. Agents can also be invoked to quiesce specific applications, such as Microsoft Exchange Server or SQL Server, running on the VM. The major advantage to using this VMFS snapshot technique is that the VM remains online and continues to work for the few seconds that is takes to complete the snapshot.

Once the snapshot is created, the VM resumes writes, but the data now goes to a special file dubbed a delta disk file. The VM's .vmdk file now represents the state the VM was at the time the snapshot was created. ESX Server now creates a snap ID and a block list of the VM's .vmdk file. These are then sent to the VCB proxy server. The proxy server uses the snap ID to identify the VM snapshot uniquely for backup processing and the VLUN driver uses the block list to mount a read-only drive within the Windows OS of the proxy server. This further minimizes the disruptiveness of the backup process as data is accessed via the storage network rather than via the production network.

For image-based backups, full VM images are presented as files to the proxy server. The backup software agent then moves the data from this read-only drive or image file to secondary storage. In our VOE test scenario, that secondary storage was the Overland REO 9500D VTL repository. Finally, once the backup process has completed moving and checking the data, the VCB integration module unmounts the drive and ESX Server removes the snapshot and consolidates the delta disk data back into the .vmdk file. More importantly for IT operations, the backup window, as seen from the viewpoint of the VM, only took the precious few seconds needed by ESX Server to take and then remove the VMFS snapshot.

While REO 9500D backup throughput performance is consistent with that of tape libraries based on LT0-2 or LT0-3 drives -- depending on I/O block-size -- the REO's D2D backup scheme derives decisive advantages from the practice of keeping all backup savesets in a centralized repository. Immediate benefits center on the simplicity and speed of data restoration. In addition, there are the obvious labor savings associated with relieving IT of the onerous task of having to track hundreds of tape cartridges. Cutting labor overhead costs through the elimination of tape handling, however, is only the tip of the savings iceberg.

For the REO 9500D VTL appliance, the online saveset repository is a key factor in substantially reducing the volume of physical storage needed to hold all of the savesets. As the REO 9500D VTL writes a new saveset during a backup, the ProtecTIER HyperFactor software operates on the incoming data stream of that saveset using cryptography-based algorithms to perform a byte-level differential comparison with patterns within all existing savesets without regard for data file structure. This structure-agnostic approach makes it possible for the REO 9500D to work with any backup package to de-duplicate data and optimize the use of storage resources.

What's more, HyperFactor finds data matches with no I/O to the disk, which helps explain why we measured no impact on the throughput -- 50MBps -- when backing up VMs with HyperFactor enabled. To avoid introducing I/O overhead, HyperFactor uses a highly efficient RAM-based index, which can map a petabyte of physical disk storage into 4GB of RAM, to rapidly identify data matches. As a result, HyperFactor radically changes the economics and usage profile of disk-based data protection.

To gauge the effectiveness of the HyperFactor data de-duplication feature, openBench Labs first performed a backup of the eight independent VMs. In backing up images of our eight VMs, Backup Exec transferred 75.6GB of compressed data to secondary storage, which represents a slightly better than 2:1 compression ratio. Nonetheless, the REO 9500D did not put 75.6GB of data into its repository. In storing the first Backup Exec savesets of our eight VMs, the REO 9500D used just 17.3GB of secondary storage, which represents a 4.4:1 HyperFactor data de-duplication savings.

After the first backup, openBench Labs continued to run the eight VMs for another week. Over this period, we also scheduled each VM to check Microsoft Update for the OS and applications nightly. Over the period of a week, there were several updates for Windows Server 2003 and SQL Server. After a week, we backed up a new set of VM images. From the perspective of Backup Exec, a total of 144.9GB of compressed data had now been saved on secondary storage. From the perspective of the REO 9500D, however, the 16 savesets used just 19.9GB of storage in the repository. That brought the overall HyperFactor ratio up to 7.3:1. More importantly, for the second set of VM image savesets, the HyperFactor ratio was actually 26.8:1. As a result, over time, the HyperFactor ratio would converge to a better than 25:1 ratio.

By comparing data as it streams during a backup, HyperFactor has no need to be aware of saveset structure. As a result, it works with any backup package. Furthermore, HyperFactor maps and indexes existing savesets in RAM, which eliminates the need to incur I/O on the REO VTL to check for duplicate data while streaming a saveset. In tests that generated image backups of VMware virtual machines, HyperFactor delivered a 26:1 reduction in the amount of storage needed to back up eight VMs that had undergone OS and application upgrades along with changes in working data.

Data De-duplication in a VOE

Overland Storage REO 9500D VTL
--REO 9500D array features dual controllers and hot-swap drives in a RAID-5 storage pool that can be virtualized as up to 64 DLT7000 tape drives and assigned to as many as 12 virtual ALT P3000 libraries. 
--Using ProtecTIER software from Diligent Technologies, the REO 9500D finds common data using a highly efficient inline data de-duplication process that does not affect backup throughput or integrity.

Two Dell PowerEdge 1900 servers

   --Windows 2003 Server SP2
        --PT Manager GUI
        --Symantec Backup Exec 12
        --VMware VirtualCenter 2.2
        --VMware Consolidated Backup
        --Benchmark: oblTape
   --VMware ESX Server 3.5
       --Eight VMs
             --Windows Server 2003
             --SQL Server

QLogic SANbox 9200 Fibre Channel switch

IBM DS4100 disk array

--Using Symantec Backup Exec and VMware Consolidated Backup (VCB), backup typically streamed at about 50MBps.
--For independent virtual machines (VMs), data de-duplication reduced the volume of storage needed for multiple independent images by a factor of 4:1 and by a factor of 26:1 for multiple images of a VM.

This article was originally published on August 01, 2008