By Kevin Komiega
In part one of our two-part series on disk-to-disk backup and data de-duplication (see “De-dupe inches into VTL backup” at www.infostor.com) we examined where and how de-duplication is being used in the data center. De-duplication is steadily inching its way into the backup process, but many users are taking a slow, deliberate approach toward putting the technology into production environments.
In part two of our series on de-duplicating backup systems we take a look at how some vendors are attempting to lure users into the world of de-dupe by offering guaranteed data compression ratios, discounts, and even free software in the hopes that hesitant users will get off the fence and start using the technology throughout the data center.
NEC guarantees a de-duplication ratio of 20:1. Sepaton offers a FastStart Deduplication Package for Symantec NetBackup environments with an assurance of at least a 40:1 de-duplication ratio on Microsoft Exchange data in 30 days. If they don’t hit the mark, Sepaton provides extra disk capacity for free.
NetApp has taken it a step further by offering its Advanced Single Instance Storage (A-SIS) de-duplication technology as a free feature with its Data ONTAP 7G operating system.
For Jonathan Davis, a system administrator at the Duke Institute of Genome Sciences and Policy (IGSP) and a NetApp customer, activating the free A-SIS feature in his existing storage environment was a no-brainer.
“I’ve seen a lot of vendors quoting de-duplication ratios that are unrealistic. In the real world, even a ratio of 2:1 is good because you free up half of your disk,” says Davis. “NetApp is offering de-dupe for free across all of its filers. I’ve seen some other vendors charging 40% of the overall system cost for their de-dupe, but if you’re a NetApp user, whatever disk you get back is free.”
The IGSP includes seven centers of genomic research spanning the areas of medicine and bioinformatics to genome ethics, law, and policy. The Institute’s current infrastructure includes more than 100 physical servers, 30 virtual server machines, and almost 300TB of disk capacity. Davis runs a pair of NetApp FAS3070 systems clustered in an active-active configuration at the primary data center, with additional FAS3070 and FAS3020 systems in another building on campus.
The IGSP also recently replaced its tape-based backups with NetApp snapshot copies to disk. Davis uses SnapMirror for synchronous mirroring between the primary FAS3070C and his remote FAS systems.
Davis started looking at NetApp’s A-SIS offering as soon as he heard rumblings that the company was about to put it into beta testing. He thought applying the technology to the IGSP’s four NetApp filers with 250TB of capacity would help alleviate capacity bloat and speed data analysis, ultimately helping the Institute’s researchers store and process more data. He was right.
“We were facing density issues and wanted to pack our systems as tightly as possible. We began deploying de-duplication on a few non-critical volumes in beta and had wonderful luck,” says Davis. “When it moved onto our production systems, I put it on everything.”
Davis estimates that data de-duplication has saved the IGSP
upwards of 20% on disk acquisition costs over a one-year period,
especially in relation to virtual servers.
“When we originally set up VMware, I allocated about 2.4TB for it. With de-duplication, I’ve been able to decrease that to less than 700GB,” says Davis. “We originally planned to buy more disk space out of our central budget, but because of de-duplication, we’ve managed not to have to use the actual IT budget.”
Living up to the hype
In the middle of 2006, disk-based backup systems with data de-duplication were rare. Most virtual tape library (VTL) vendors were just beginning to ship VTLs, many of which promised de-duplication in their product road maps. That is when Prem Ratnam, manager of systems and security administration for the Canadian Automobile Association (CAA), began shopping for a new backup system.
The CAA is a federation of nine automotive clubs serving more than five million members through more than 140 offices across Canada, offering a variety of automotive, travel, insurance, and related services. As a 24×7 service organization, CAA’s systems are constantly taxed, leaving little time for lengthy backup windows.
The CAA’s planned downtime windows have to be small to accommodate constant demand for services and data availability, and its StorageTek library system and four LTO-2 tape drives typically exceeded the eight-hour backup window.
Additionally, the tape system only allowed for the retention of seven days’ worth of data. “Most of our restore requests were for data that was outside the seven-day period,” Ratnam explains, “which meant we had to restore from tape that was stored off-site. When a restore request came in, it would normally take at least a day for the tapes to be delivered, and five to ten hours to restore data from them.”
Ratnam and his team were spending an inordinate amount of time managing backups. “We needed a longer retention period, and the options available were either to buy a bigger library, faster tape drives, or go to a disk-based solution,” he says.
CAA surveyed the vendor landscape and brought in a Data Domain DD560 disk-based backup system. The twist was that Ratnam put the system through its paces using production data rather than in a test environment. The gamble paid off and he was surprised with the results.
“Data does not change much in lab environments and you can get amazing de-dupe ratios, but they can be misleading. Production data changes by the minute and really helped us to see what ratios and implementation issues we would be facing,” says Ratnam.
Data Domain’s DD560 system has enabled CAA to increase its data- retention period from seven days to thirty days. CAA retains about 100TB of backup data and each night backs up another 3TB. The IT team also put a DD580 and a DD410 system in their primary data center to handle additional backup jobs.
Ratnam backs up Unix, Windows, and NetWare systems and is seeing real-world compression ratios of up to 35:1 on Windows and Unix data. Data in the NetWare environment is already pre-compressed, but Ratnam is seeing an added 10:1 compression with Data Domain’s DD560.
“When we went into this we did not have a set ratio that we needed to achieve. We looked at the literature and saw all sorts of numbers, but considering we came from an entirely tape-based environment, I would not want to go back there again. It would not have worked,” Ratnam says.
Mulling the options
One end user still on the de-dupe fence is Prairie Cardiovascular, a nationally recognized heart care center. “We’re not using de-duplication today, but we’re talking about it and how we may use it,” says John Collins, Prairie Cardiovascular’s chief information officer.
Collins is not blown away by guarantees, discounts, or free de-duplication software. He is more concerned with finding a solution that meets his needs. Prairie Cardiovascular’s requirements are unique in that not all redundant data is just taking up space. Some of it is necessary.
“In a healthcare imaging environment you end up with huge data sets that become embedded as part of a patient record, but they also need to be manipulated by physicians,” says Collins. “As a result, we end up with at least three or four copies of images in some cases.”
Prairie has a mix of applications, including a large SQL database, an electronic medical records system, and systems for vascular and diagnostic data. Collins recently completed an overhaul of the center’s entire SAN infrastructure, replacing aging gear with a Hew-lett-Packard StorageWorks 8100 Enterprise Virtual Array (EVA).
The 8100 EVA is also part of Prairie Cardiovascular’s backup plan. The company backs up data from disk-to-disk-to-tape, using an HP StorageWorks MSL2024 tape library for off-site backups.
Collins is not looking at de-duplication to save money on storage. That is actually low on his list of concerns. “The cost of storage is not what attracts me to de-duplication. Disk is cheap. It’s more about reducing backup windows and the time to recovery, as well as the amount of time it takes to search for data,” says Collins. “I’m more concerned with the time it takes my guys to manage this stuff.”
Collins turned to CommVault for his backup and e-mail archiving needs and is now mulling the pros and cons of using the Single Instance Store (SIS) de-duplication feature in CommVault’s Simpana software suite.
“We have to come up with some ways to cut our backup windows and start controlling the growth,” says Collins, “but we’re struggling with how to deploy de-duplication from a procedural standpoint. It’s mainly a convenience issue for our physicians and determining how we store copies for compliance reasons,” says Collins.