Information classification enables better management

Intelligent information management (IIM) tools allow companies to index, search, and locate information more quickly and accurately.

By Brian Babineau

Although it never makes its way onto a balance sheet, information can be an asset to any organization that leverages it appropriately.

Currently, many organizations attempt to manage information without really understanding anything about it. Rarely do IT departments care about the details of the data, such as the application that created it or the last time certain files were accessed. In addition, IT rarely worries about the content or context of the information that is created. IT departments barely have enough time and resources to focus on maintaining the availability of primary systems as well as backing up data to protect it.

The combination of newly created data and the vast amounts already being stored is making it difficult for organizations to gain control of all the information for which they are responsible. IT departments cannot locate data quickly, information may be inaccessible to the people who need it and, worse, data may be available to the wrong people. The only alternative organizations have is to manage data more intelligently.

Intelligent information management (IIM), as defined by the Enterprise Strategy Group (ESG), encompasses the methodology, processes, and underlying technology that enable organizations to understand, organize, and take appropriate action against all data types.

To understand more about its information, an organization needs technology that can analyze and index new and historical data. The index abstracts data attributes and contents, helping organizations classify information into related groups and categories, against which policies can be applied. Information management is the process of enforcing these policies by taking action against the data, such as establishing a retention period and archiving e-mails or encrypting private files.

Organizations continuously buy hardware devices as applications create the need for more capacity. In addition to the primary data, IT departments create multiple copies to meet backup and disaster-recovery requirements. Multiple copies of data make it difficult to track what data resides where. Further, the format of the data may be altered by data-protection software as the data is copied. And, making matters worse, the data could be spread across multiple storage systems, including devices configured with disk, tape, and optical media-on-site and off-site. As such, IT departments are forced to manage several copies of data in different formats on various systems.

Apart from storing and protecting data, organizations also store data to meet record retention regulations, such as SEC Rule 17a-3 and HIPAA. In addition to the record-retention regulations, organizations are saving more data to improve corporate governance policies and processes. Revenue data, correspondence with customers and suppliers, and budget data may be kept to support financial and other business process audits. Accounting firms that audit publicly traded corporations need to store audit records for four years in compliance with Sarbanes-Oxley.

Recently, organizations have begun to deploy digital archives to separate primary and backup data from unchanging business records. ESG defines digital archiving as “the long-term retention and management of historical digital assets that are no longer actively needed for current business operations, but have been purposefully retained to satisfy regulatory compliance, corporate governance, litigation support, records management, or data management requirements.”

Archiving capacity continues to increase as organizations continue to store information for longer periods of time.
Click here to enlarge image

As the figure indicates, organizations are just starting to archive databases, e-mails, and unstructured files.

Losing control of your data

As the capacity of primary storage and backup and digital archives increases, IT organizations cannot control or manage the information efficiently. Organizations have little or no understanding of what data is being created, the contents of the data, or where the data is being stored and archived. This lack of understanding results in mismanagement of information, thus creating unforeseen risks.

As organizations transact business electronically, certain information (i.e., customer credit card numbers and Social Security numbers) may be captured and stored. If this data is lost or stolen, some states require that an organization notify all customers that a breach of privacy has occurred. This disclosure tarnishes an organization’s brand and threatens the trust established with customers. As more copies of personal data are made, the risk of data loss or theft increases. In fact, many of the security breaches that have been publicly disclosed were the result of backup tapes being lost or stolen.

Data supervision can become an extremely resource-intensive process if organizations cannot quickly search and locate relevant messages that should be reviewed further. There are other reasons why organizations need to find data quickly. Litigators and regulators have realized that data, whether in a primary storage system or an archive, can be a viable source of evidence for legal matters ranging from executive malfeasance to insider trading. While many high-profile legal cases have focused on e-mail, all data-if the courts believe that it’s relevant to the matter-is discoverable. During discovery events, organizations are being asked to produce a myriad of different data types other than e-mail (i.e., office documents, invoices, and other customer records, Web pages, etc.).

Recent ESG research revealed that 42% of organizations have been involved in legal or regulatory actions that necessitated the search and retrieval of an electronic record. The respondents, as highlighted in the figure, above, also noted that their greatest e-discovery challenge is retrieving data from offline media such as backup tapes.

Electronic discovery events uncover many challenges that organizations face when managing data.
Click here to enlarge image

Restoring data from offline media can be extremely expensive because organizations that create and store the information must bear a large portion of the costs to retrieve the data.

Half of the organizations surveyed by ESG also point to the lack of effective software tools to search and retrieve relevant information as a major hurdle in electronic-discovery processes.

Legal discoveries are not the only events that could send IT scrambling to find relevant data. When primary data is corrupted or lost, IT must locate a backup copy of the data and restore the application, database, or file system. Depending on the size of the data set, the restore operation could take hours or days, after IT finds what storage system or systems that the secondary copy is on.

Nearly half of IT users surveyed cite long recovery times as their primary data-protection problem. Other reasons, such as resources dedicated to data protection, are also cited, as shown in the figure, right. These resources are often looking for the data that needs to be recovered. In some cases, entire databases, servers, and file systems are restored to recover a specific transaction or file.

Companies that index primary data can quickly retrieve it for business use. The concept can be applied across digital archives as well as backup data. By understanding where all backup copies are, IT can quickly search and retrieve the file or transaction that needs to be restored. The restore process is also shortened because only a small subset of the data is restored. Finding and retrieving data when you have little or no control over it can be a nightmare. Securing information is just as difficult when an organization fails to understand what it has created and stored. As previously mentioned, an organization needs to secure data containing sensitive information, such as credit card numbers, but other types of internal proprietary data must also be protected from internal and external threats. Controlling data such as patents, trade secrets, and digital trademarks is difficult because this information could reside anywhere within an organization. IT systems also store vast amounts of confidential data, including employee reviews, internal legal complaints, and merger, and acquisition financial models. For obvious legal risks, these data types cannot be made accessible to the masses. Without proper access controls to sensitive information, employees could easily steal, misuse, or delete data or engage in other inappropriate conduct such as insider trading.

Click here to enlarge image

With significant data capacities within primary, secondary, and archival storage systems, organizations struggle to find, manage, and use information. In some cases, organizations view data management as simply the movement of data to faster devices to improve availability. However, there are several other management functions, including data classification, search, retrieval, supervision, and security that need to be performed. To do so, organizations need to understand more about their data, prepare and organize data for action, and then manage it.

Information classification

Information classification solutions help organizations understand their information better and then organize it for more-precise and intelligent management. Before organizations buy any tools to help them manage data, they first need to broadly understand and begin to manually categorize their data. This means identifying the types of data they’ve already created and what they are likely to produce in the future.

Data types should be separated into information groups according to various criteria and policies established for taking appropriate management action against them. Information groups should take into account external influences such as record retention and privacy regulations, information security risks, and data access requirements.

There is a significant amount of information that can help classify an Excel file.
Click here to enlarge image

For example, an information category can be “Confidential Financial Information.” The criteria for this category can be any Excel spreadsheet created by an employee in the finance department. A policy that could apply to “Confidential Financial Information” is establishing a retention period of three years, and the associated action with the policy is archiving the file on an immutable storage system.

With basic information categories and policies in place, organizations can now use information classification solutions to analyze and index the information. These solutions scan and analyze data sources such as storage systems and file servers, indexing attributes such as the creator, creation date, and actual contents of the data. The context of the data improves as more details about the attributes are identified. For example, a basic attribute about a file is the creator. Additional understanding that a file was created by a finance employee improves context of the information, making it easier to categorize and manage. When scanning files for the “Confidential Financial Information” example, the process should identify Excel files, the creator, and the department of the creator. The figure, above, depicts the various attributes and associated detail available in an Excel file.

Data analysis and indexing is imperative to creating a deep level of intelligence of an organization’s data. In our example, data is analyzed for file type, seeking Excel files. If all files are properly named with the right file extensions, this is an easy process. However, many files stored on file systems or within backup systems may not have an .XLS file extension. The extension may have been deleted by the creator, modified by data-protection software, or non-existent. If the file extension is not ascertainable, the information classification solution must rely on the contents of the file to determine the file type. For Excel spreadsheets, the software must be able to recognize columns and rows, formulas, and index number patterns and other contents. Even if the file type can be determined, the software should ensure the contents of the file align with that which is expected of a particular file type.

For example, if a file has an .XLS extension but contains a significant amount of text and no formulas, there is a good chance the file extension, either intentionally or unintentionally, was misnamed or changed. Information classification solutions should be able to identify and index all data sources accurately.

Determining basic information categories, criteria, rules, and actions is largely a manual process that should involve representatives from organizations that establish corporate policies regarding information access and privacy, compliance with regulations, and IT. More-granular categories, criteria, and policies can be developed during the analysis and indexing step.

Information classification solutions can identify patterns of data that may warrant the addition of a category or modification to criteria. Organizations may deploy technology that automatically creates categories and criteria based on recurring or common patterns within the contents of the data. For example, the information classification solution may highlight that many Excel documents with “Confidential Financial Information” in the contents also contain the words “Expected Sales Revenue” or a phrase with similar meaning. Organizations can create an additional policy to encrypt these files so that only finance managers can access the data.

Information classification solutions build an information index and classify data according to associated policies, enabling IT departments to take more-intelligent actions with the data. In some cases, information classification solutions can automatically pass along index information with the data to another product that can enforce the action.

Continuing with the “Confidential Financial Information” example, all Excel files created by finance employees would be tagged with a retention period and retained on immutable storage systems for that time period. In addition, the information classification software can ensure those files are sent through a security appliance that can encrypt the data before being stored. The retention period can be enforced via integration with the information classification solution, the archiving software, and a storage system configured with immutable storage. Authorized users can access the data by decrypting the files with the right password.


With an intelligent, rich index, information classification solutions are contextually aware, and in a few cases, can extract both data attributes (e.g., creator) and contents (e.g., credit card number) from a file, e-mail, or database. This extraction must be accurate and precise because organizations are creating and storing significant amounts of data that must be managed more efficiently.

Intelligent classification solutions make information management, including archiving, more effective.
Click here to enlarge image

Once all information is indexed, organizations can conduct more-precise searches that return relevant information. Finding data more quickly can help IT departments restore files, reduce the time it takes to respond to a discovery request, and improve supervision capabilities. In addition, organizations can locate, centralize, and secure data that contains sensitive information such as a customer’s address.

In addition to providing context about large amounts of storage capacity, information classification solutions can reduce the amount of data that an organization saves.

By indexing the data, an organization can identify and then delete duplicate copies of the same file or message. This data-cleansing process is also referred to as single-instance storage, de-duplication, or coalescence. Depending on the granularity required, duplicate files or parts of files are eliminated. When data is copied during data-protection operations, the copies are smaller because duplicate data has been removed. De-duplication also reduces the risk of multiple copies being accessible to the wrong people or, worse, being lost. Organizations can also reduce storage capacity by retaining the appropriate business records. An intelligent index can distinguish files created for personal use and those generated for business purposes.

Finally, as organizations manage data, information classification solutions continue to track the data, adding more information to the index. The index can also serve as an audit trail in case litigators or regulators request proof that data was unaltered or inappropriately accessed during a retention period. An audit trail can also be used to assist in proving chain-of-custody processes were followed when electronic files are admitted into evidence at trial.

Organizations understand that they will store large capacities of data and more of this information will be kept online within a digital archive. As the digital archives are built, additional intelligence must be captured before data is finally stored.

In the ESG survey, organizations stated that the capture and indexing of all data attributes and contents are essential when deploying a comprehensive archiving solution, as shown in the figure above. To that end, many archiving applications rely on embedded information classification solutions to ensure data is archived intelligently.

Without embedded classification technology, archiving solutions do not offer IT organizations the full potential to search, locate, and find the appropriate information. In addition, administrators cannot use the archives to take additional action, such as establishing a legal hold or encrypting sensitive files and messages. Complete information indexing provides companies with the capability to manage data more intelligently. With more capacity to manage, organizations are deploying digital archiving applications that must quickly capture and fully index information.

Information classification solutions are the only means that organizations can use to deploy successful archiving solutions as well as improve other aspects of information management.

Archiving is just one example of an information management action that is improved with information classification technology. Organizations can no longer store data blindly because of the risk associated with losing control and mismanagement of the data-and the potential for creating new revenue streams by keeping the right data forever is enormous.

The risks, ranging from failure to comply with regulations to storing multiple copies of the same data, can be costly. Organizations cannot afford to lose control of their data or continue to manage it without understanding what it is. Information classification provides the necessary context and detail about data that enables organizations to effectively archive, secure, and protect it.

Brian Babineau is an analyst with the Enterprise Strategy Group (www.enterprisestrategy group.com) in Milford, MA.

This article was originally published on June 01, 2006