ILM boils down to classifying data based on its relative business value and then putting the data on the most appropriate storage resource.
By Claudia Chandra
New government regulations on data retention, such as SEC Rule 17a-4 for financial services, HIPAA for healthcare, and 21 CFR Part 11 for the life sciences industries, are driving the need to manage data over its life cycle and to dispose of it at the proper time to avoid the possibility of civil or criminal liability.
Changes in the storage infrastructure at most companies over the last five years are also driving interest in information life-cycle management (ILM). The transition from host-based to network-based storage (both fabric-attached and network-attached) and the development of new storage technologies (i.e., Serial ATA) have led IT managers to tier their storage environment to deliver the proper balance of performance, data accessibility, cost, and data reliability for different classes of data according to their value to the business.
The growing importance of ILM comes from the realization that the value of data to a company changes over time and can vary between different users or departments within an organization. If information about the changing value of data can be harnessed, then that information can be used to better manage data and storage resources. While placing data on the most appropriate storage resource over its life cycle makes a lot of sense, many challenges need to be addressed to implement an ILM solution.
A universal challenge that IT organizations face is understanding what data and storage resources they have in their environment. Although storage administrators often know how much total capacity has been allocated to applications and departments, they are less familiar with the different types of data created by each department and end user, the growth of these different data types, how much data is stale or most active, or how much capacity is available and used at each storage location. A tougher challenge is determining the value of different sets of data at any point in time. The same data set will change its value over time and different departments will value the same set of data differently. Therefore, the value of data needs to be reassessed periodically and allowed to vary across an organization.
Once data has been identified and classified based on its relative value, the next challenge is matching data to the most appropriate storage resource. Placing data on the right storage device involves moving files from the original location where they were stored to a new storage location. End users and applications, however, need ready access to their data as it is moved from location to location over its life cycle. File movement, therefore, has to be done transparently to users and applications. Once files are moved, managing backup and recovery can also become an issue. If administrators do not have an accurate record of where files have been moved, then they will not know what servers, volumes, and directories need to be included in the organization's backup process.
One method for implementing transparent data movement (migration) is to use hierarchical storage management (HSM) software. However, HSM can be difficult to implement in large enterprise environments because it requires administrators to configure individual HSM volumes as pre-defined sources or destinations, and migration rules can't be globally shared among multiple HSM servers. A better approach to data migration is therefore required.
Solving the various ILM challenges requires the appropriate processes, methodologies, and tools for dynamic data and storage resource classification and data placement.
The first step toward data and storage resource classification is assessing what you currently have. IT administrators must discover and collect detailed information about their existing data and storage resources. Once the assessment is complete, data classification involves the organization of resources into logical groups and assigning a value to each group. Storage resource management (SRM) tools that automate the discovery and organization of data and storage resources greatly simplify the administrator's task.
Although organizations have different sets of data that they regard as high value versus low value, in general, business requirements determine data value. Because of this, business users need to be involved in determining data value. This requires interviews across departments in order to align data valuation across the business. Reports generated by resource discovery tools can be leveraged in the interview process to define resource classifications.
In certain industries, however, departmental or federal regulations on data retention may dictate data classification. For these types of industries, only a limited number of interviews with the organization's CFO, CIO, and the legal department may need to be conducted. There are wide variations in how organizations manage the data life cycle. The appropriate process for classifying data is therefore dependent on the organization's type of business.
More-systematic methodologies may also be used to classify data. A common way is to use data attributes for data classification. Another method is project-based. After a project is completed, any data related to the project turns into low-value data that can be moved to less-expensive storage resources.
Overall, data types, applications, and age are the most important attributes for classifying data. A data-classification tool that can assign data value automatically based on business requirements and attributes can be valuable to IT administrators.
Whereas data classifications vary widely among organizations, there are common methodologies for classifying storage. Most organizations organize their storage resources into three general classes: primary (high value, very fast storage), secondary (medium value, fast storage), and tertiary (low value, offline storage). SANs and direct-attached storage (DAS) are often used for high-value data storage. SANs tend to be used for enterprise, high-availability, high-performance, clustered applications, whereas DAS houses legacy data that is still highly valued. NAS and ATA disk arrays are typically categorized as secondary storage. NAS devices are used for file sharing, user data, and first-level backup, although more and more NAS devices are being used as primary storage devices to replace and consolidate direct-attached file storage.
Serial ATA disk arrays are typically used to store less-critical, older data or fixed-content reference data that can be easily recovered or rebuilt. Serial ATA disks are also used as first-level backup. Tape, optical disks, and other write-once devices are usually considered tertiary storage. They are most often used for backup, archiving, and vaulting. Replicating data among primary storage and other levels of storage is commonly done to improve availability, reliability, and recovery.
Common attributes used to define classes of storage are reliability, availability, performance, accessibility, security, price, and capacity. With advances in storage performance today, availability tends to be the primary factor that determines value among different storage devices. As with data classification, tools that automatically discover storage resources and classify them based on these attributes can significantly simplify storage administration.
Matching data to storage
Once data and storage resources are classified, data can be moved to the appropriate level of storage based on its classification. Depending on how end users and applications access the data, different levels of data movement transparency should be provided.
For the highest level of transparency, users or applications access the data at the original location and the data is recalled upon access. This type of data movement is typically provided by HSM applications that migrate data and leave a tag or stub file at its original location.
To overcome the issue of migration rules that can't be shared among multiple HSM servers, a method of deploying global policies is needed for migrating data across multiple systems and data sets.
Semi-transparent data movement allows users and applications to access the data from its original location, but the access is redirected to the current location of the migrated data.
This type of data movement can be achieved by migrating a file and leaving behind a link or shortcut at its original location.
In the case of non-transparent data movement, users and applications access the data directly at its new storage location. This method of migration is used when the IT department needs to permanently move files to a particular storage class or location.
ILM tools should provide these different types of automated data movement. Data movement can be achieved by using vendor-supplied software features such as move, copy, and replicate that are invoked from a command line or script, or a data management system with an automated policy engine.
Regardless of which tool is selected, it should monitor and report on the effectiveness of the data movement and simulate the policy or script actions. These reports can provide feedback to the IT department and users on the effectiveness of the resource classifications and data movements. The ILM tool should also provide an audit trail of its actions and a search mechanism for administrators to find the current location of files that have been moved. This enables IT administrators to determine what servers and volumes to back up and from what volumes to recover migrated data.
The hardest challenge in deploying data and ILM is identifying the value of data in a way that aligns with the business. Tools are required to automate the process of classifying and prioritizing data, whether it is at the file system, data management, or application level.
Administrators also need automated tools to move data between storage tiers and place data on the most appropriate storage resources according to its value and life cycle, while maintaining transparent end-user data access. In effect, what IT managers want is an integrated data management system that is data value-aware, works seamlessly with existing backup-and-recovery processes, provides policy automation, and provides continuous feedback to administrators for assessing status and maintaining an audit trail.
Dr. Claudia Chandra is director of product management at Arkivio (www.arkivio.com) in Mountain View, CA.