Lab Review: Data deduplication for real tape

Posted on January 22, 2010

RssImageAltText

To provide the infrastructure and storage resource virtualization needed to integrate D2T and D2D technologies, Spectra Logic's nTier Deduplication appliances extend the data protection operations of VTLs to physical tape libraries, enabling backup and recovery that meets data retention or off-site storage requirements.

By Jack Fegreus, openBench Labs

January 22, 2009 -- The growing complexity of IT environments and the new focus on defining and meeting service level agreements (SLAs) for the support of critical business processing has accelerated interest in disk-to-disk (D2D) backup. In particular, attention has been focused on the ease and simplification of recovery processes. That interest has been tempered, however, by the equally pressing need to optimize the utilization of storage resources.

Spectra Logic's nTier appliances provide scalable data protection processes for physical and virtual clients that address many IT needs for optimized backup processes. Using an nTier appliance, storage administrators are able to leverage post-process data deduplication in cost containment strategies that can be applied in complex virtual server environments.

Through storage virtualization software, an nTier Deduplication appliance has the ability to be both a target and an initiator on a SAN, which allows a VTL on the appliance to assume the cartridge inventory of a physical library by synchronizing both the barcodes and the tape headers of cartridges in the physical library. In this way, the nTier appliance is able to access a physical tape library and expose a surrogate VTL to all data protection applications running at a site. And IT administrators are able to upgrade an existing physical tape library virtually just by adjusting the configuration of the surrogate VTL.

In so doing, IT garners all the advantages of a sophisticated D2D backup process scheme, while at the same time VLT and physical library synchronization maintains all of the traditional data security that off-site tape storage and hardware-based tape encryption provides. This is particularly important as IT is increasingly under pressure to meet regulatory compliance mandates that define rigorous levels of data security. With its preconfigured RAID-6 Fibre Channel disk arrays, simplified thin provisioning of VTL cartridge storage, and automated configuration of data deduplication requirements, the Spectra Logic nTier Deduplication appliance simplifies IT management tasks and leaves only the task of resolving site-specific policy issues. By providing a consolidated hierarchy of resources, an nTier Deduplication appliance simplifies the use of automated policy–based management systems, helps lower labor costs for backup and disaster recovery operations, and eases the burden on IT to meet SLAs for increased backup reliability and enhanced data security.

Virtual tapes and machines

To gain perspective on the ability of the Spectra nTier Deduplication appliance to simplify data protection processes, openBench Labs set up a data protection test scenario for a VMware vSphere4 environment using two Dell PowerEdge 1900 servers. On one server we installed VMware vSphere, VMware Consolidated Backup (VCB), and Symantec's NetBackup (NBU) 3.5.4 software. On the second server, we used VMware ESX 4.0 to host eight VMs running Windows Server 2003. Each VM was configured as an application server running SQL Server and IIS.

To provide our virtual operating environment with a hierarchical storage infrastructure for backup, we set up a Spectra nTier v80 appliance and a T50e physical tape library provisioned with two LTO-4 drives with 4Gbps Fibre Channel connections. On the nTier v80, we configured three VTLs to support three distinct test scenarios: a virtual Spectra T50e to test automatic data caching, a virtual StorageTek L180 to test deduplication of backup images of typical Windows-based files, and a virtual Spectra T200 to test deduplication of VM backup images. The nTier appliance supports the robotic emulation of a wide range of physical tape libraries and tape drive options.

Using the Spectra BlueScale Web Interface, we monitored the operational status, configured the IP address, and prepared disks for maintenance on the nTier v80. In particular, 3.4TB of RAID-6 storage were allocated to the Single Instance Repository and 1.9TB to the VTL pool.

To support the storage needs of VTLs created on an nTier v80, Spectra Logic provisions the appliance with 1TB drives in groups of eight, which are configured as RAID-6 arrays. This scheme gives each array a usable capacity of 6TB. Total array capacity must then be allotted to a Single Instance Repository (SIR), which stores unique instances of deduplicated data, and a VTL cartridge pool, which stores raw data that has not been deduplicated and pointers to SIR data following a deduplication process. By default, each nTier appliance is preconfigured with an SIR pool and a VTL storage pool. For our Spectra nTier v80, storage capacity allocated roughly in a 2-to-1 ratio for the SIR pool compared to the VTL pool.

The need to provide for a pool of raw data as well an SIR pool is characteristic of data deduplication schemes that employ post processing for deduplication. In particular, there are a number of factors that affect the amount of storage that the VTL pool will require. These factors include the number and capacity of virtual cartridges, whether thin provisioning has been applied to the cartridges, and even whether some VTLs have been configured without data deduplication—an option with post processing.

Moreover, the FalconStor deduplication engine used in the Spectra nTier appliance provides numerous policies that IT can utilize to trigger data deduplication of the cartridge data based on either the elapsed time after a backup or the remaining capacity of the cartridge. This means deduplication can occur over a very wide time interval that ranges from immediately to the time it takes to consume the full capacity of a VTL cartridge.

We created and managed three VTLs using FalconStor's software. To back up ESX 4.0 VMs, we set up a virtual T200 library with eight HP Ultrium 4 drives. With data deduplication enabled and a policy assigned to the library's cartridges, the FalconStor software automatically created a bank of virtual tape drives, which were used in deduplication jobs for any VTL. Following deduplication, the cartridge compression reflects the space used in the VTL pool to store pointers to the unique data now stored in the SIR pool.

We utilized thin provisioning in each of our test VTLs. In particular, we configured the cartridges to have just enough initial capacity to safely store a typical backup image for that VTL. We set the incremental growth in cartridge capacity for each VTL to be between 10GB and 20GB. We also made data deduplication an on-demand process rather than an automatic one for each VTL in order to maintain maximum control of our test results.

When data deduplication is enabled on a VTL, the nTier Deduplication appliance creates a separate bank of virtual tape drives that are only used when data is deduplicated on a cartridge. While the SIR and all other constructs supporting data deduplication are independent of the VTLs created, the rules governing data deduplication are cartridge- and VTL-specific.

Backup, transfer, dedupe

For our two independent VTLs these were the only steps needed to begin utilizing the libraries. To test automated data caching with the T50e physical tape library, the VTL configuration process required several more steps that began with the setup of a communications link with a target physical library.

In our test case, both libraries were of the same type—a Spectra T50e. This is not a requirement: The critical commonality is not the robotics, but the distinguishing characteristics of the tape cartridges. This leaves IT free to make virtual upgrades to any VTL that is exposed to data protection applications in place of the physical library. In our tests, we added six more LTO-4 tape drives to the virtual Spectra T50e exposed to NBU. With this configuration, we were able to back up eight VMs simultaneously and minimize the backup Window for VMs hosted on our ESX 4.0 server.

We ran an initial backup of a Windows file server with NBU. Immediately following the backup, we ran a data deduplication process on the virtual backup cartridge. Deduplication took less than 7 minutes and resulted in a 10-to-1 reduction in the data foot print of the original backup set. Total storage used for the backup, including unique data in the SIR and pointers to that data in the VTL pool, was just under 5GB after data deduplication.

We began by using the VTL interface to make two of our four Fibre Channel adapter ports initiators rather than targets. This allowed us to discover the Spectra T50e on the nTier v80 and assign the physical T50e to the virtual T50e that we had created.

We were now able to synchronize our virtual T50e with its physical partner. During this process, virtual cartridges were created on the VTL with the same bar codes as the cartridges in the physical library. In addition, we cached the header of each physical cartridge on its virtual partner. This latter step was required by NBU, which checks that the barcode and the tape header of a cartridge match before it writes data to the cartridge. This feature makes it impossible to use both the physical and virtual libraries simultaneously with NBU: The two libraries create duplicate barcode entries, which conflict in the NBU media database.

We took the nTier automated data caching process one step further by checking the health of the media in each cartridge located in the physical T50e using the Spectra BlueScale Media Lifecycle Management (MLM) on the library. This feature allowed us to remove virtual tapes from the VTL that were partnered with suspect physical cartridges. As a result, we did not have to worry about media failure when we copied new data from the VTL to the physical Spectra T50e.

In this way, the NBU media catalog remained valid for both virtual and physical cartridges. In particular, we would flush the contents of the cached virtual cartridges to the physical T50e and then apply data deduplication to the virtual cartridge. Using nTier automated data caching, from the nTier disk we created a backup tape cartridge that could be encrypted using the T50e's BlueScale Encryption Key Management and sent off-site for secure storage, while keeping a local copy of the backup with a storage foot print minimized by data deduplication.

Saving space with global deduplication

We began our assessment of the nTier Deduplication appliance's backup and restore capabilities by using NBU to back up the contents of a volume on a Windows server containing 46GB of data to a VTL on the nTier v80. Following the backup, we ran a data deduplication process. After deduplication, the nTier v80 was able to represent all of the data by storing just 4.6GB of unique data in the SIR storage pool and 359MB of pointer data in the VTL pool.

We were able to restore the deduplicated data back to the server at a rate of 75MBps. This would serve as a baseline for performance with our ESX 4.0 host server. Even more impressive was the combination of backup throughput and the storage utilization gains that the nTier appliance was able to provide for our vSphere and ESX 4.0 virtual environment. By allocating eight tape drives to both our Spectra T200 and T50e VTLs, we were able to run eight VM backup processes targeting either VTL in parallel. As a result, NBU was able to load balance across the eight processes and keep a steady stream of data flowing to the nTier v80.

 
When we ran an NBU backup policy that launched a backup for all of the VMs on our ESX 4.0 host and used either the T200 or T50e VTL, NBU recognized the eight virtual tape drives and ran all of the backup processes in parallel. At the same time, we measured average data throughput at the FC switch port for our appliance at about 225MBps, with peak throughput reaching 280MBps. Following the backup, we ran a deduplication process on all eight cartridges. Each 12GB backup image was able to be represented by adding a unique set of data in the SIR ranging in size from 245MB to 600MB, which represents an average data deduplication ratio of 35-to-1.

We were able to minimize the backup window for the VMs hosted on our ESX 4.0 server by matching the number of tape drives in the VTL storage target with the number of VMs to be backed up. In that scenario, NBU ran each drive in parallel and load balanced the backup streams across the available bandwidth of our nTier v80 with its single RAID-6 storage array.

This configuration strongly validated the use of the nTier v80's automated data caching feature. Using either the virtual T200 or the virtual T50e, both of which we had configured with eight LTO-4 drives, we were able to handle a steady stream of VM backup data on the order of 225MBps and subsequently deduplicate cartridge data for a net storage savings of 35-to-1.

Jack Fegreus is the CTO of openBench Labs.

OPENBENCH LABS SCENARIO

UNDER EXAMINATION: VTL with data deduplication

WHAT WE TESTED: Spectra Logic nTier Deduplication appliance

Spectra Logic nTier v80 appliance

-- (2) Intel Xeon dual-core CPUs
-- (2) 1Gbps NICs
-- (2) 4Gbps dual-port FC HBAs
-- 8TB RAID-6 storage
-- Spectra BlueScale Web Interface

HOW WE TESTED:

Dell 1900 PowerEdge server

-- Quad-core Xeon CPU
-- 4GB RAM
-- Windows Server 2003
-- VMware Consolidated Backup (VCB)
-- Veritas NetBackup v6.5.3

Dell 1900 PowerEdge server

-- (2) Quad-core Xeon CPUs
-- 8GB RAM
-- VMware ESX Server
-- (8) VM application servers
-- Windows Server 2003
-- SQL Server
-- ISS

Spectra T50e Library

-- (2) Dual-port 4Gbps FC controllers
-- (1) IBM LT0-4 tape drive per controller
-- Spectra BlueScale Web interface

Xiotech Emprise 5000 disk system

-- (2) 4Gbps ports
-- (2) DataPacs

KEY FINDINGS

-- The nTier appliances maximize utilization of storage resources via the thin provisioning of virtual tape cartridges and data deduplication.
-- Post-processing deduplication can be scheduled based on the length of time after a backup or on the utilization of virtual tape cartridges. 
-- Backup benchmark: A backup of eight VMs on a VMware ESX 4.0 host ran at 225MBps with eight simultaneous streams.
-- Restore benchmark: After running data deduplication and reducing the storage foot print on the order of 10-to-1, a restore of Windows file data ran at 50MBps.
-- The nTier v80 provides for assigning surrogate VTLs to physical tape libraries by caching the barcodes and tape headers of tape cartridges in the physical library.
-- Automated caching of tape cartridges provides for copying backup data to physical cartridges for security, while maintaining local deduplicated cartridges on the VTL for quick recovery.
-- Hashing algorithms enable the nTier Deduplication appliances to provide rapid deduplication of tape cartridges, which are typically processed at 128MBps.
-- Storage utilization benchmark: Following the backup of eight VMs on a VMware ESX 4.0 host, we ran a deduplication process on all eight cartridges. Each 12GB backup image was represented by unique sets of data in the SIR ranging in size from 245MB to 600MB, representing an average data deduplication ratio of 35-to-1.


Comment and Contribute
(Maximum characters: 1200). You have
characters left.