Part two of a three-part series examines ILM issues such as data location, information paths and perimeters, data lifecycles, and the changing value of information.
By Tom Petrocelli
In Part 1 of this series we introduced information lifecycle management (ILM), unstructured and structured information, and the importance of context (see InfoStor, January 2006, p. 38).
Many ILM decisions will be based on location of the underlying data. Location helps determine the integrity of information and is essential for managing multiple copies of the same information across an enterprise. Location also tells where the data that comprises information is. This helps determine whether information is where it is expected to be or whether there is more than one copy of the data.
On one hand, location is a concrete element of information. Classification and state can be subjective. Location is, instead, physical and tangible. The problem is that file systems and structured data stores have different ways of expressing location. The manner in which Unix describes where data is differs from the way in which Windows describes it. Data stored on a network introduces additional ways to depict where data is, which can make location statements in ILM policies very difficult.
This article is excerpted with permission from Data Protection and Information Lifecycle Management, by Tom Petrocelli (Prentice Hall PTR, ISBN: 0131927574; Copyright 2006).
Instead, it is more useful to use a virtual location that can be mapped to a real location. Called the information path, it is a way of describing where information is without subscribing to specific operating system nomenclature. The information path should include at least the following:
- Network path
- Host name
- File system or application name
- Local object name
- Component names (if needed)
The network path should be a virtual path, not a physical address. When combined with the host name, a general data storage location can then be given as a virtual address. The file system name or application name is needed to accommodate structured and unstructured information. An application in this case is likely to be a structured data storage application, such as the name of a database. The local object name provides the unique identifier for the information, and the component names provide an additional level of identification if the ILM policy calls for it.
The addition of a version identifier supports the ability to have information paths point to different versions of the same data. The information path could then be the same for multiple versions of the same information. Differences in these paths would point to different data, but the information would be the same.
The same information could have multiple information paths. If that sounds like being in more than one place at the same time, that’s because it is. Multiple copies of the information are still the same information. This is critical to maintaining the integrity of information. For ILM policies to be carried out correctly, all copies of information must have the same rules applied to them at the same time. A copy is the same information, only in an additional location.
When information is in certain locations-on a laptop or home computer, for example-the information cannot be verified as to whether changes in state have occurred. It is beyond the control of systems and monitoring. Subsequently, state changes to the information cannot be tracked. The boundary between where an ILM policy can expect to have control and where it cannot is called the information perimeter (see figure). Information stored beyond the information perimeter cannot be verified as to state, context, or even existence.
The information perimeter defines
- Where data is and whether that’s where it is expected to be; and
- Where it has gone to and where copies might be.
ILM policies must address what happens when information crosses the information perimeter. Specifically, there need to be procedures for deciding whether information that is outside the information perimeter is considered to be valid.
The information lifecycle
Information has a lifecycle. It is created, then it changes, and finally is destroyed. ILM manages this lifecycle to optimize the use of resources, meet regulatory requirements, and ensure the integrity of the information. When a lifecycle has been developed for a class of information, it can be expressed as a series of policies.
There is no set information lifecycle. Some products will impose a particular lifecycle on an organization, but ILM does not dictate this. An information lifecycle is dependent on the needs of an organization and the nature of the information.
All information lifecycles can be derived from a general model (see figure). The model states that information is created, its state changes in some way, an action may occur due to that state change, and eventually it is destroyed.
Creation is the initial action and destruction the final action. ILM policy must define which state changes trigger actions and what those actions will be. Some changes that may trigger an action are
- Aging-The difference between the current state’s timeframe and a previous one has exceeded a threshold;
- Copied-There is a new, additional information path associated with the information;
- Moved-An information path has changed;
- Transformed-The information has been changed from one class to another;
- Relationship-A relationship with another piece of information has been changed, added, or removed; and
- Content-Any alteration in the content of the information should trigger an event, even a null event. Comparing the current hash with the previous one shows that content has changed.
Changes in metadata or content represent a change in state. This, in turn, may trigger actions under ILM policies. This continues with changes in state and new actions until the last action possible is taken: destruction.
Life and death of information
What if Widget Corp., a maker of high-quality widgets, is no longer happy with the results of its data lifecycle management (DLM) e-mail policy? Too often, e-mails that should be retained aren’t, and others that were supposed to be destroyed haven’t been. Now the company has angry customers and upset lawyers. The costs of storage and e-mail management continue to rise, though at a slower rate.
The problem is not that Widget can’t manage e-mails in general.
What it cannot control, with the DLM policies in place, is information that doesn’t fit the rules the company has set up. Widget has discovered, for example, that many employees in sales copy e-mails into documents not covered by the e-mail policy.
On the other hand, many e-mails are destroyed, but not the original documents attached to the e-mails. The company also realizes that many customer e-mails really aren’t important and shouldn’t be protected. Attention must be turned to what the e-mails mean to lower costs and better protect the company.
Widget turned to ILM to solve some of these problems. The object “e-mail” is too coarse for Widget’s purposes. Instead, e-mails and other documents must be classified, a lifecycle determined, and policies written. IT and customer service have decided that only three categories will be needed initially: orders, proposals, and other. Classification is based on content, especially specific clues inside the e-mail text. Orders can be identified by the order number in the e-mail, for example. Other metadata items that IT and customer service feel are important to the ILM process are
- Location-Information paths will help identify copies;
- Type-Object types will be tracked to look for transformations from e-mail to documents;
- Relationships-This is especially important for tracking the source of attachments; and
- State-Being able to compare changes in content and metadata at different points in time will allow for more directed actions. The company can also guard against changes in order e-mails after they have arrived.
With this in hand, Widget will be able to apply different levels of protection to different types of e-mails. Rules can be applied to attachments and their source documents (and vice versa). A history of changes in state will show when content and other metadata has changed. Finally, when it is time to make decisions about destroying e-mail, all copies and references to the e-mail can be considered in the decision-making process.
ILM is more involved than DLM. With different metadata, actions, and different types of classes, it helps to develop a set schema to use in policy making. Schemas and data dictionaries are popular in vertical applications for managing the semantics and operations of information.
There are, for example, a multitude of XML schemas designed for protocol communications and data stores, many designed for specific industries.
However, as is true of all elements in ILM, there is no set schema for all circumstances.
Placing a value on information is difficult. Organizations value information in their own ways. Some organizations may place a high value on certain types of information, such as customer contact information or orders. Others may find practically no value in the same type of information.
A report by the US Department of Transportation notes that decision-makers value information based on the ability of information to reduce costs, save time, improve decision-making, and improve customer satisfaction. These dimensions make sense, but can still be hard to quantify.
A general way to look at the value of information is to consider
- Replacement value of the information;
- Cost to create the information;
- Opportunity cost; and
- Regulatory failure costs.
Certain information is necessary if the organization is to operate properly. The costs associated with disruptions caused by loss of the information can be calculated directly. How much more would it cost to process returns, for example, if the customer history information were missing?
Even if the value of information cannot be determined directly, the cost to replace it can be. If the order database were destroyed, what would it cost to have all the orders entered by hand from paper records? By the same token, the cost to create information in the first place also places a value on it. A certain amount of the scientific grant money was absorbed by the cost of gathering cases. How much was that? What were the budget dollars associated with inside sales that can be attributed to order taking?
There are also measurable effects of lost opportunities. The value of the information associated with an order, for example, can be considered the value of the order. Finally, costs associated with failing to comply with regulations are straightforward. The amount of money that might be spent on lawyers, fines, and judgments can be determined by laws and case history.
ILM uses the value of information to trigger decisions regarding the disposition of information. As value changes, actions may be taken on the data, such as moving the information to less-expensive storage. Gross measures of value are useful to ILM in this regard. Even coarse value levels (e.g., important, useful, garbage) will work in some ILM policies; in others, a dollar amount will be necessary.
The changing value of information
Information becomes more valuable or less valuable over time. Events can alter the value of information as well. Generally, as information gets older, it becomes less applicable to the current situation and less valuable unless it is updated. Passage of time is a change state, because time is a factor in state. Data protection and storage resources need to change as the value of the information changes over time (see figure).
Take the case of the customer service e-mail. When an e-mail that contains an order is first received, it is very valuable. It represents potential revenue that must be protected. Loss of that e-mail means a loss of revenue. A second e-mail, which contains a note of thanks, may have some value from a quality assurance or marketing perspective. It is not, however, nearly as valuable as the order e-mail. As time goes on and the order is processed, it is still quite important and valuable. Loss of this information may mean that it will be impossible to complete the order in a timely fashion. When the order is fulfilled, however, the value of the e-mail begins to drop. It may be needed to answer questions from the customer or provide historical reports, but the revenue is already realized.
After a while, if the customer hasn’t called to complain, it is unlikely that he or she will do so. The source e-mail has already had the important information extracted from it. The original e-mail has practically no value. As information, it has degraded to the point of being valueless.
Each change in the value of the information, from high to moderate to none, represents a change of state that could trigger actions. It would be reasonable to move the e-mail to less-expensive storage at each change in value. Eventually, when the e-mail has no value, it can be destroyed.
Tom Petrocelli is president of Technology Alignment Partners (www.techalignment.com) in Williamsville, NY.