Part one in our three-part series on ILM focuses on information assurance, data protection, and the importance of context.
By Tom Petrocelli
Data protection is part of a larger strategy-information assurance. Information assurance is the process by which an organization ensures, protects, and verifies the integrity of vital information. Without information assurance, it is impossible to know that critical information is what it is assumed to be or where it is supposed to be. Organizations have begun to understand that the information they store as data needs to be managed for information assurance to be realized.
Traditional data-protection strategies, such as backup and restore and replication, deal with data, not information. To provide better information assurance, a new type of process is necessary: information lifecycle management (ILM). Without ILM, information is questionable and cannot be used without considerable risk.
What is ILM?
ILM is first and foremost a strategic process for dealing with information assets. Typically, ILM is expressed as a strategy, which is then used to generate policies. Finally, a set of rules is created and used by the organization or software to comply with the policies. ILM processes take into account what the information is, where it is located, what relationships it has to other information, and the lifecycle of the information.
Initially, ILM appears to be a lot like data life-cycle management (DLM), which is also a policy-based process. It has rules and takes into account a life-cycle. The difference is that ILM operates on information, not data.
Data is raw and lacks structure that is externally visible. Information, on the other hand, is capable of external validation, even if it requires a human being to do it. Whereas data is completely dependent on applications for meaning, information is independent of applications.
Information is a collection of data within a certain context. When someone receives an e-mail, prints it out and reads it, or imports it into another program, it is still an e-mail. The blocks of data that comprise the e-mail are data. The data becomes an e-mail when the reader (human or computer) recognizes that there are “from” and “to” lines and a message body.
Information has value
How much is a block of data worth? That’s hard to say unless you know what the data is meant to represent. The value of information is easier to understand because what it is is known. An order from a customer has a value that can be determined from real costs and loss of revenue. A CFO’s presentation to the financial community has a value that can be determined by changes in the stock price of the company. Assigning value to information is based on what is valuable to the organization.
As is the case with DLM, ILM is a strategic process. It is not about technology or products, although these can be used as tools for automating ILM rules. Unfortunately, there is some confusion about how products and existing processes fit into an ILM strategy.
Other technology or processes that are often confused with ILM are
- Data storage or storage management-Although storage is often part of the ILM picture, it is not a complete ILM solution. Storage may be considered to be part of the ILM policies, but it is secondary to the process;
- Content-addressed storage (CAS)-CAS is a very useful tool for ensuring ILM policy compliance. However, it is not the ILM process or policy in and of itself; and
- Document management and records management-Document management and records management are considered by some people to be subsets of ILM and can be useful parts of the ILM strategy. However, not all information is in document or records form.
Why bother with ILM?
There are some clear reasons why organizations are moving toward ILM. Many are first attracted to ILM because of regulatory compliance.
There are, however, many other benefits, including the following:
- Enables information assurance-ILM helps organizations verify that data is what and where they think it is;
- More efficient use of resources-ILM allows a finer level of resource allocation than DLM does. Less-important information can be given less-expensive system resources;
- Data protection in line with information’s value-As with DLM, data protection can be applied to the data that comprises the information. ILM allows decisions about data protection to be made based on the value of the information over its lifetime;
- Better security-By using ILM, organizations can better track where information is located. This eliminates duplicate, lost, or misplaced information. Good ILM policies should also help organizations determine when unintentional modification to information occurs. It allows the organization to know when people are looking at, copying, or changing data, with or without authorization;
- Allows organizations to handle large amounts of ever-changing information-ILM policies help organizations avoid drowning in useless information by helping the organization focus on the most important information first; and
- Enhances privacy-By tracking the copying, destruction, and accessibility of information in an organization, ILM diminishes the likelihood of a privacy breach.
Unstructured and structured information
Operating systems and file systems know only about blocks and files. They cannot tell what is in those files. They may surmise that a file is a word processor document by its extension or MIME type but cannot be certain. Extensions and MIME types can be changed. Even within applications, information can be recognized only in a generic fashion, as a document or spreadsheet. Applications do not know whether a file is an important document, a financial report, or letter to a friend. Files and similar constructs are considered unstructured. Operating systems, file systems, and applications have no external means of understanding the meaning of the data.
The most difficult type of unstructured data to deal with is images. Photographs, X-rays, scanned documents, and other images do not have any internal clues to help determine what the information is. All validation of the object is external and provided by an outside source. This makes management of images through traditional means, such as keyword searching, almost meaningless.
Databases, XML files, and other structured systems are different. They arrange data into information by using a schema. A schema is a description of the data that provides context. By applying the schema, order is imposed on the data, and it becomes information. Anyone looking at the schema and applying it to the data will understand what the data represents.
The advantage of a structured system, as far as ILM is concerned, is that context is already provided. Description of the information is not needed, because the schema provides the necessary context. ILM policies rarely have the luxury of dealing only with structured data.
The importance of context
Data, by itself, is not very useful. Look at a single number on a page and it tells you very little. Add other numbers and symbols to form an equation, and now there is some meaning, if you know how to read the formula. Combine the formula with text that explains the formula and now there is useful information. The text and the symbols provide meaning to the numbers.
Information differs from data in that it has context. Context is other data that imparts meaning and structure to the data. For data to be useful, for it to be information, there needs to be context around it. Context acts as a catalyst, converting raw data to useful information (see figure).
Different types of context
There are several forms of context that can be applied to data. Explicit context is context that is stated directly. It exists when data has a pre-determined and externally readable structure. Databases have explicit context. Their schemas are an inherent part of the overall data set, and any application that can read the database tables can also understand the meaning of the data.
Implicit context is the context that is implied by attributes of the data. A file with a .doc extension implies a document. If that file is in a directory or folder titled “Marketing Plans,” it is implied that the file is part of the organization’s marketing planning. Clues in the document also hint at the context of the data. Specific formatting, such as letter format, titles, and formulas, provide evidence as to the meaning of the content.
Finally, there is rules-based context. Rules are a way of making implicit context explicit. Rules-based context imposes context on data based on a set of external rules. The following rule illustrates how it is possible to express rules-based context:
If the file has an extension of .doc, is in the “Financials” folder, and is dated after the first of the year, it is a year-to-date financial report.
No matter what is actually in the files, no matter what the internal structure of the file, it is now considered to be financial information in report form. Information with explicit context carries the context with it, and implicit context derives only from context based on the data. Information with rules-based context imposes the structure externally without regard to the content of the data.
There are pros and cons to each type of context. Explicit context is easier for software to deal with. Because the context is embedded in the data structures, a computer can read it and know how to process the information. It does not always work well for all types of information. An order can easily be depicted in a database or as a structured object because it has pre-determined components, but a letter cannot because it is free-form in nature. Parts of a letter can be given explicit context, such as the address block and signature line, but the body-the most important part of the letter-cannot. There is enough context to know that it is a letter but not enough to know what the letter is about.
Implicit context is difficult to impossible for software to understand. Although research in natural language processing continues, humans are by far the best tool for determining context from content. We can look at unformatted text and tell whether we are looking at a marketing plan or a letter to a friend. Computers cannot do that well.
Rules-based context strikes a middle ground between explicit and implicit context. Almost any type of information can be described by a series of rules. There will be mistakes, however. If the rules are broad, information will be categorized incorrectly.
Context is what ILM leverages to make better decisions than DLM. By providing a deeper understanding of what the data represents, context allows policies to be developed that better describe what to do with the data. Context converts the raw data to information, which enables ILM.
Characteristics of information
Information has several characteristics that are important to ILM. The most important characteristics are:
- Context-Context is additional data that provides meaning to the data;
- Relationships-Information often includes relationships with other information. Sometimes it’s only a casual reference. At other times it’s a strong, formal link, such as a hyperlink;
- Application independence-Data relies on applications for interpretation; information stands by itself. Different applications using the same data can interpret it in different ways. Information is interpreted the same way, no matter what application is using it. A printed book and an e-book are still the same information; and
- Determinable value-The value of information can be determined, because it has meaning.
The lifecycle of information is based on context, is affected by the lifecycles of other information, is independent of the applications that use the information, and changes along with the value of the data.
Whereas DLM is a function of age, ILM is determined by context and value, of which age is a component.
Determining and managing information context
Although any number of attributes can provide context to data, the most important from an ILM perspective are
Other attributes may also be important, depending on the organization and its information management needs. In many cases, they will be components of the attributes stated here.
Anatomy of an e-mail
A good example to consider is an e-mail object. An e-mail has a number of constituent components that make it an e-mail. First, it can be classified as an e-mail. That may be because a person can recognize it as one or because it has a MIME type of message/rfc822. It has content that can be examined for e-mail formatting and relationships related to an e-mail, such as an attachment. The object may also reside in a directory that is only for e-mails and may have a file format specific to e-mail systems. Finally, there might be state information, such as the time the object was created, headers, or similar descriptors (see figure). The object is recognized as an e-mail because it has the context of an e-mail.
Classification is a quick form of identifying what information is. This is something that humans do quite well but machines do not.
For information lifecycle management, classification is the most important attribute and will drive most actions within an ILM policy. (For more information, see “ILM requires data classification,” InfoStor, November 2005, p. 36.)
Classes may be broad, such as financial, marketing, and personnel. They may also be very specific, such as First Quarter Financial Reports. If classes are too broad, actions will be limited to only those that can take place among many different types of objects. If classes are too specific, the organization will drown in policy documents.
Classifying structured data is easy. The classes are determined by the schema. Unstructured data, on the other hand, can be very difficult to classify. Humans can do this by looking at the data-“Yep, that’s our third-quarter financial report”-but computers are terrible at it.
To classify unstructured data, rules-based context is overlaid on the data and stored as metadata. Various attributes of the data are examined to provide a class for the information. The existence of an object in a particular directory or folder, along with keywords found in the content of the object, may be used by a rules-based system to determine its class. Another way to classify unstructured data is through human intervention. When information is created, the person creating the information, or a designated person, can choose a class for it. Even in this case, a set of rules on how to determine a piece of information’s class will be needed. Otherwise, classification will be inconsistent and useless.
State describes content and metadata-context-at a specific point in time. Changes in some component of the context indicate a change in state. ILM policies may demand that these changes in state trigger actions. The specific metadata that defines state in an ILM system is described by policies. Within ILM policies, state is the catalyst for actions. If a state change occurs, an action, proscribed by policies, must also occur.
Tracking state and history
Time is a necessary element of state, even if the timeframe is only now. It is possible to only define a current state, although it is more useful to define state in other timeframes. By tracking state over time, it is possible to accumulate a history of the information. The timeframe “now” defines a current snapshot while other timeframes define history.
This is a powerful tool for managing information. By tracking state, it is possible to compare the current state against an expected state. Changes in state will help determine whether the information
- Has been copied, deleted, or moved;
- Has had a constituent component modified or whether a new version has been created;
- Has had related information changed;
- Has been transformed into another type of information; and
- Has aged past a defined point.
The important part of any information is its content (e.g., the words in the document, numbers in the spreadsheet, images, etc.). In a computer system, content is stored as data. Much of the context of information can be derived from the content. By examining a document, clues can be found that help discern whether it is a letter to a friend or a technical manual. Humans are very efficient at performing this task, whereas computers are not. Knowledge management systems have developed very sophisticated inference engines to do what we do naturally. Inference engines examine the content of a document to determine its meaning, usually for purposes of classification. Through the use of statistical analysis and rules-based systems, context can be rendered from the document. These systems are rudimentary compared with what human inferences can do. They often mis-categorize information and need human editors to make corrections.
Search engines are similar to inference engines in that they scan content for clues as to its meaning. Unlike inference engines, search engines are more of a tool to help humans make content decisions. Often based on keywords, a search engine can provide a list of possible targets. The human then decides whether it meets the criteria for classification.
For ILM purposes, humans can do the job of deriving context from content. A person can make the decision as to what the content means. Unfortunately, this is inefficient. It is not too difficult to ask end users to make a decision as to what the content means for newly created information. It is a daunting task to have people go through existing information and determine context from content.
Tom Petrocelli is president of Technology Alignment Partners (www.techalignment.com) in Williamsville, NY.