Automating information lifecycle management (ILM)

Areas to focus on include data classification, auditing and tracking, decision-making, and moving and copying.

By Tom Petrocelli

t is possible to implement ILM without any software or hardware at all. ILM is, first and foremost, a process. It is a very difficult process to implement without tools, however.

Tracking information and complying with ILM policies sound good until an organization realizes how tough it is to monitor state changes. The classification process can be nearly impossible without software tools to categorize existing information. So what if the policies say that certain actions are supposed to happen when a state change occurs? If the change can’t be detected and tracked, it might as well have not happened. If there is no way to audit changes in information, how can the organization know whether it is complying with its ILM policies? ILM is too complex without tools to assist in classifying, auditing, and moving information assets.

No particular storage architecture is required for ILM. Instead, the ILM automation tools need to fit the overall storage architecture. If the predominant storage devices are file servers, automation needs to be designed for a file-based environment.

The same is true for networked storage in a SAN or NAS environment. ILM automation is done with software. The hardware matters only in that it supports the needed software.

The areas in which automation can help are

Classification-Determining class from content and metadata, especially for existing, unstructured information;

Auditing and tracking-Ensuring compliance with policies by tracking state changes and saving them as history;

Decision-making-Automating decisions such as whether a state change has occurred and what actions to take based on that change;

Access control-Ensuring compliance with policies by limiting the ability to change or view information; and

Moving and copying-Shifting or duplicating information to different information paths based on policy.

ILM automation is still in its infancy, as is ILM. Some areas are well-addressed by various software products, while others are just emerging. Purely storage-oriented technologies-such as information movers, content-addressed storage (CAS), and access control systems-are more developed, often because they were adapted from existing products. Other technologies, such as classification tools and ILM auditing, are still very early in their product lifecycles (see figure).

Click here to enlarge image


Policy engines

What makes a software system an ILM tool is a policy engine. As is the case with data lifecycle management (DLM), a policy engine is needed to drive the ILM automation process. Considering the fragmented nature of ILM technology today, several policy engines will be needed in a complete ILM automated solution. Unfortunately for the IT manager, this means having to manage duplicate ILM policies in different software systems.

The policy engine is designed to make the software behave in accordance with the ILM processes called out by the ILM policies. There are several data movers available. To be considered an ILM mover, an ILM policy engine has to be directing the moves and updating the state history.

Many vendors have created ILM automation software by grafting policy engines onto their existing software. E-mail archiving products, document management, and records management tools have been converted to ILM automation tools by adding policy engines.

Search and classification engines

Search tools are now being adapted for desktop and enterprise storage systems. The engine scans a storage unit and catalogs the files on it, as well as gathering metadata. This database is then used to find information based on content and metadata. Although not an ILM tool per se, this type of software can be adapted to develop a classification engine. A classification engine scans information in a system, applies rules, and assigns a class to it. When ILM rules are applied to the metadata, classes of data can be derived from the existing base of metadata. Many commercial search engine products have APIs that expose the gathered metadata and could be developed into a classification engine.

Rudimentary classification engines are also part of other ILM software. Policy engines often include some form of classification scanning to accommodate existing information.

A large part of the ILM process is designed to ensure the information in an organization is what it is supposed to be and where it is expected to be. In structured data systems such as databases, auditing the information can be accomplished through the use of transaction logs. File systems, on the other hand, often do not track changes in information. Even in cases in which metadata tracks changes, such as whether an object has been accessed, file systems do not monitor changes in content. Opening and closing a word processor file will change metadata fields. If nothing was done to the file’s information, no real change has occurred in the information. Software used to track changes in information, including content and audit reports, based on ILM policies is an emerging technology.

Content-addressed storage

CAS, also called content-aware storage, is a specialized storage platform that locates and manages information based on its content.

As information is stored on a CAS array, a hash of the content is created and stored along with the information object. It then prevents changes to the information’s content. Some systems will automatically version information if the content changes.

The advantage of CAS systems is that they prevent undetected changes to content. CAS is mostly used for fixed content, which is content that is not supposed to change. ILM policies stating that information can never be changed until it is destroyed, or insisting on versioning changes to information, benefit from storage on CAS arrays. Digitized images such as checks and X-ray images are also popular targets for CAS usage. Rapidly changing information (e.g., databases) is not a good candidate for CAS systems. CAS assumes that the content will not change and wards off any changes.

CAS also fits in well with ILM because the information is accessed based on content, not on filenames or other artificial constructs.

Because ILM information paths are not dependent on any file-naming scheme, the CAS namespace ensures a unique path for all information.

Information movers

One of the most common actions of an ILM policy is to move information. As the value of information declines, less-expensive resources are used to house and protect the information. Something needs to move it there. Many system administrators accomplish this through simple scripts.

Unfortunately, scripts tend to be static and must be rewritten if policies change. Software that does this automatically, when ILM state changes dictate an action, ensures that moves happen when they should.

Some information movers are DLM systems with an ILM policy engine embedded in them.

This article is excerpted with permission from Data Protection and Information Lifecycle Management, by Tom Petrocelli (Prentice Hall PTR, ISBN: 0131927574; Copyright 2006). Previous excerpts from this book appeared in the January 2006 (p. 38) and February 2006 (p. 33) issues of InfoStor.
Click here to enlarge image


Tom Petrocelli is president of Technology Alignment Partners (www.techalignment.com) in Williamsville, NY.

Controlling costs with ILM

One goal of ILM is to control the cost of protecting information while providing maximum protection. Some factors associated with the cost of protecting information are:

Performance-Faster access costs more;

Availability-It costs more money to ensure high levels of availability;

Scope-Scope denotes how much information must be protected; and

Duplication-Money is wasted when duplicate information is protected.

ILM policies can be used to limit all these costs. By knowing the value of the information, an organization can adjust the performance and availability of the storage systems used to provide protection. Moving information from a Fibre Channel SAN with Fibre Channel arrays to an iSCSI SAN with SATA arrays may cost less. The performance and availability of the iSCSI SAN and SATA systems are lower, but they are sufficient for less-valuable data. Instead of buying more-expensive infrastructure, the organization uses less-costly infrastructure. With ILM, the IT organization makes decisions on what is important information and what is not. Certain classes of information will be deemed unworthy of any protection at all or of only the most rudimentary protection. Solid ILM policies will allow organizations to narrow the scope of what is and isn’t to be protected. With less to be protected, not as much money is allocated to new protection and storage resources. This helps control infrastructure costs.

Duplicate data is not the same as duplicate information. The same information may exist in many forms throughout an enterprise. Different data can then hold the same information. ILM policies help decide when information is a duplicate of existing information and which form should be protected. Again, this helps limit the scope of what is and isn’t protected.

Most data-protection strategies have a one-size-fits-all philosophy. Data lifecycle management (DLM) begins to break out of that paradigm by imposing an age-based model for data protection. ILM takes this a step further and looks at the information for guidance. ILM policies act like medical triage. They determine which information needs what resources and how soon. The organization can then focus its resources on the most valuable information.

This article was originally published on July 01, 2006