In an earlier report on the progress of information life-cycle management (ILM) initiatives in IT organizations, a few users and analysts were quick to note one obstacle to more widespread ILM adoption: A purported lack of tools that can automate the process of migrating data to other tiers of storage (see “It’s a long and winding road to ILM,” InfoStor, June 2005, p. 24).
In an attempt to test the above hypothesis, we looked for users who had already established various tiers of storage in their IT infrastructure and were benefiting from automated, policy-based data movement across the tiers.
Most users we spoke with began developing different tiers of storage as a means of consolidating file-based storage. Often, their consolidation projects had begun with high-performance, high-cost NAS systems (or NAS-SAN gateways). These high-end systems comprised the first tier of storage in a multi-tiered architecture. However, time and explosive data growth had caused these systems to fill up, forcing the IT organization to choose between adding more high-end storage capacity at a fairly high cost per gigabyte, or migrating less-critical data from the primary systems onto lower-cost platforms, such as those based on Serial ATA (SATA) disks.
One such user is Matt Miller, senior systems administrator at Duke University, in Durham, NC. The university had originally acquired a pair of clustered Network Appliance F820 NAS systems a few years ago to help consolidate files from a variety of Windows NT servers. Once they consolidated their servers, Miller and his team turned their attention to ways that they could consolidate the large quantities of data remaining on user workstations. “We had lots of research in data sets that was still not being backed up on a routine basis,” Miller explains, “but bringing all that data into the central facility was just not possible because of the price we were paying for the tier 1 storage.”
Miller and his team began looking at ILM software in early 2003. There wasn’t much to choose from back then, he recalls. But they were still able to find something that could satisfy the team’s two key requirements: That the solution would not have to “front-end” any of the Network Appliance filers with another server or device, and that migrating files to other storage tiers would have no impact on end users.
They ultimately chose to implement a Network Appliance NearStore R200 array as their tier 2 storage, with 6TB of usable ATA-based capacity. To automatically migrate files from tier 1 to tier 2, they chose NuView’s File Lifecycle Manager (FLM) software.
After running NuView’s free FileAges utility for a few weeks to identify the access frequency of many of the files located at the university’s various CIFS sharepoints, Miller said that creating an initial policy was relatively straightforward. “We realized a lot of our assumptions were true; for example, much of our data was not accessed very often,” says Miller. They discovered that about 60% of their data (about 840GB) had not been written to in six months and could be migrated to the NearStore R200 with the help of FLM software.
Miller’s team first instituted a policy in FLM to migrate this infrequently accessed data from tier 1 to tier 2, with a goal of keeping capacity utilization on the tier 1 filers no higher than 80%. Any files migrated via FLM appear as normal files to end users. After the actual file is moved to tier 2, FLM just leaves metadata for the file on tier 1. When a user attempts to access a file stored on tier 2, the system quickly restores the file to tier 1 for higher-performance access.
ILM and tiered storage have provided a number of benefits in this example. Extending the life of the university’s primary storage is just one of the advantages, according to Miller. “We’ve been able to keep our filers small because of the ILM,” he says. He estimates that the university could theoretically continue to use its nearly four-year-old filers for another three or four years before upgrading them. The only reason for a quicker upgrade would be to gain better performance, says Miller.
Other benefits include further reductions in backup times. Miller’s team does not back up tier 2 data, as it was already backed up earlier when it existed on tier 1. The university also replicates data from tier 1 to tier 3, where tier 3 consists of Linux systems and the open rsync utility, which performs replication and backs up only changed data to disk. The university plans to retain this data on tier 3 storage for 10 years.
Feeding the beast
Matt Decker, a lead systems engineer at Honeywell Federal Manufacturing & Technology (FM&T), in Kansas City, also knows all too well what you can uncover when you start doing some of your own detective work regarding the types of data routinely stored and backed up in an IT organization.
As a prime contractor for the National Nuclear Securities Administration (NNSA), Honeywell FM&T’s IT infrastructure consists of several terabytes of both structured and unstructured data. Wanting to get out of the trap of continuously “feeding the storage beast” by just buying more storage, Honeywell created a six-person team two years ago that would be responsible for developing a hierarchical storage management (HSM) system that could make more economical use of their storage resources. As one of the team members, Decker soon realized that HSM was really just a by-product of a proper ILM strategy. He decided his first order of business on the road to ILM would be to figure out how to manage all of the company’s unstructured data, a place he thought he’d gain the biggest reward.
At first, they floundered a bit with a cruder HSM model that just allowed them to move a certain portion of files to DVD-RAM jukeboxes when the primary storage systems (from Hitachi Data Systems) reached capacity thresholds. Without a robust policy-based management capability, and many hurdles for users to access their archived files, this solution soon fell short. That’s when Honeywell turned to Arkivio’s auto-stor software, which provides a number of ILM-related modules. From the first day of use, Decker was able to start identifying usage profiles that allowed more granular classification of data based on its value to the organization.
Decker took advantage of auto-stor’s basic reports and utilities to learn how his organization’s file data was being utilized. “I found that 80% of our unstructured data had not been looked at or modified in well over a year, yet we were keeping it on our primary storage and backing it up on a daily basis when we didn’t really need to,” says Decker.
Another Arkivio utility identified 4.5 million duplicate files. By “dredging the data” further, Decker also found a large number of Office *.tmp files from the early 1990s currently stored on the SAN. “We were still backing up and restoring these files that had been abandoned and had now become part of our permanent data set.”
Decker was even able to reclaim hundreds of gigabytes of space by setting an auto-stor policy to limit the size of users’ recycle bins and automatically discard recycle bin contents when the limit was reached.
The impetus for this latest data cleanup effort came as Decker’s team was preparing to implement a new EMC Centera system where much of the static, historic content will be stored. Decker also still uses Arkivio’s auto-stor with the DVD-RAM disks, where a policy automatically migrates data from terminated employees onto DVD-RAM media, where the data is still available if needed.
Once the Centera system goes in, Decker plans to expand migration policies to encompass compliance rules as well, to indicate how long specific files should be retained in unalterable format on the Centera platform. Migrated files have a slightly modified icon that appears to end users, which represents a stub or link file. Clicking on the file takes them to the archived version.
Once much of the automated migration to the Centera system occurs, Decker looks forward to a number of advantages. These include a reduced footprint for primary storage, and significant reductions in backup-and-restore time. Although the compliance component may increase initial costs, Decker looks forward to the cost efficiencies he can expect in years two and three. “It will be extremely efficient and cheaper in the long run, because now we’re managing the data. Before it was a free-for-all, throwing money in, feeding the storage beast. Now we’ll be intelligently managing costs. Since it’s policy-based, it will free up a lot of time,” he says.
Decker’s expectations for cost savings are not unrealistic. Strategic Research Corp. (SRC) studied 40 companies that had already implemented tiered storage. On average, these companies experienced a 40% reduction in storage operating costs. “The ones who benefited the most implemented at least three tiers-primary, secondary, and disk-based backup,” says Michael Peterson, president of SRC and program director of the Storage Networking Industry Association’s Data Management Forum. “For example, those with three tiers saw a $5,000 to $10,000 per terabyte per year average annual savings, depending on their total disk capacity,” says Peterson.
Enabling tiered storage
Some users are being a bit more fluid in how they set up and migrate data among the different storage tiers. According to Mack Kigada, systems engineer at Providence Health Systems, in Portland, OR, multiple tiers of storage don’t always have to reside on different physical disk drives. Kigada is implementing a few tiered storage models, including one disk-based backup model that takes snapshots of live data on an EMC Clariion system, stores the snapshots in another disk location on the same system, then backs up the snapshots via Veritas’ NetBackup. The backups are stored in a separate archive location-possibly the same Clariion disk.
“I have the live tier and am going to have the snaps and the archive tier all on the Clariion, and they may even be on the same disk,” says Kigada, noting that this type of logical tiering may be different from what many tiered storage proponents talk about; however, they can still offer tremendous management savings in terms of reduced backup time, etc. Kigada is also instituting a more-traditional tiered storage model using EMC’s Celerra NAS servers with Celerra FileMover and EMC’s DiskXtender policy engine, along with the Clariion device equipped with 150 ATA drives. Providence Health is also bringing in an EMC Centera content-addressed storage (CAS) platform to replace its existing optical jukeboxes.
This new architecture is part of a move to centralize data centers and consolidate direct-attached storage (DAS) from four different regions onto one centrally located NAS system. To start with, Kigada and his team will focus on porting more than 2TB of data distributed across several DAS servers. He plans to start deploying the system in a pure HSM model, automatically migrating unused files to ATA storage after 30 days. He believes this time frame may shrink to as short as a week as the company begins to learn more about user-access patterns for files.
HSM by any other name…
These case stories underscore the early HSM flavor of many current ILM implementations with unstructured data. Starting from an unmanaged environment that stores, backs up, and archives virtually everything, even straightforward HSM applications can lead to cost savings for IT organizations. According to Arun Taneja, founder of the Taneja Group consulting firm, that’s a good place to start as users wait for more-mature, comprehensive ILM solutions to come to market. “For unstructured data, most companies find they can save 30% to 60% of their primary storage. That’s a good start, even if it has to be done on a rudimentary basis,” he says. Taneja also recommends taking advantage of more-mature ILM solutions for semi-structured and structured data, such as those currently available for e-mail systems.
In this emerging space, IT organizations are already starting to see a variety of entrants that claim to perform automated data migration across storage tiers. This list of vendors is likely to grow as emerging storage virtualization and grid technologies begin to emerge with automated tiering as an integral component.
Both Taneja and Peterson point to products from vendors such as Princeton Softech, OuterBay, and e-mail archiving vendors such as KVS as examples of solutions that currently offer application-layer support for policy-based data migration. Taneja also gives a nod to younger vendors in this space, such as StoredIQ, Kazeon, Scentric, and UK-based nJini as some of the vendors to watch in an emerging ILM-related area that Taneja calls information classification management (ICM).
In the meantime, Taneja recommends moving forward with the solutions available today. “In the absence of all elements of the ILM stack being available, even a generic HSM approach is a good start and will provide significant value.”