Classification is complex, but the good news is that a variety of tools exist to help you get started.
By David G. Hill
Information lifecycle management (ILM) is the policy-driven process of managing information as it changes value throughout the full range of its lifecycle, from conception to disposition. For ILM to become a reality, however, businesses must classify the data they need to manage. Identifying and ordering data according to business and regulatory requirements require tools that use a policy-management engine based on business rules, as well as metadata and content knowledge of files and/or databases. As a result, organizations can classify data on the basis of value or requirements such as compliance or availability.
The benefits of data classification reflect and bolster the benefits of ILM:
■ More-efficient use of the storage infrastructure through the use of tiered storage solutions;
■ Greater productivity for storage management;
■ Enabling or simplifying compliance management and eDiscovery processes; and
■ Enabling information that was lost to be found and used effectively.
Developing greater knowledge of information enables querying across broader sets of data, identification of new relationships between data for competitive advantage, easier programming at a higher (business metadata) level, better compliance, enhanced data quality, and better data administration across data stores.
Commitment is critical
A number of software tools are available to help automate the process of data classification. These tools are typically policy-management-driven, which means that there is a policy “engine” that applies business rules. This sounds great, but the software machinery can only mesh the gears of automation after business users have engaged business rules. And therein lies the rub.
Business rules are set by the organization, specifically IT and executive management with strategic responsibility for business information. However, getting these groups and individuals involved is not always easy. First, they need a compelling reason to participate, which means that they have to benefit directly from the process. The fact that data classification organizes data so that IT can manage it better should warm the hearts of IT management, but does little for business managers. For example, although “tiering” storage allows IT (and therefore the enterprise as a whole) to save money, business management is unlikely to see any compelling direct benefit (even if there is a sophisticated chargeback scheme where the cost savings appear).
Compliance might be one “killer application” for data classification, since it is a business task that can attract the buy-in of executives responsible for legal discovery and compliance audits. However, compliance tends to focus on one or just a few applications, so it is not a compelling motivation for universal data classification. Still, it is a good start.
A second “killer application” might be enterprise search, since a good deal of information that has value to the enterprise is often misplaced or forgotten. Search is also an important component for eDiscovery as well.
Even if IT can get management to participate, setting business process rules is not easy. For example, trying to determine the meaning of simple terms such as “customer” or “product” across cross-functional boundaries can be a major challenge. However, there are good semi-automated tools that can help to create data classification metadata.
Data classification tools
A couple of the terms that have popped up to cover the data classification space are “information classification and management” and “intelligent information management.” Both are good attempts at product categorization, but some vendors decline to be pigeonholed. Categorization is an attempt to view products through a single lens, not only for comparison purposes, but also to help IT organizations understand what they need to get their hands around.
The products listed in the table on p. 34 are from companies that focus on the data classification space in some general sense. Storage resource management (SRM) tools or classification tools that are only a part of a larger package are not listed.
The data classification market is still very fluid, and products and partnerships are being announced at a fairly rapid pace. The vendors listed in the table are mostly smaller vendors, but most have partnerships with some of the larger players in the industry. For example, Abrevity lists among its technology partnerships one with the EMC Velocity Partner program and a Gold Business Partner relationship with Hewlett-Packard. Among its partners, Arkivio includes EMC, Hitachi Data Systems, Network Appliance, and Symantec. Index Engines states that EMC and NetApp are among its technology partners. Kazeon partners with vendors such as Hitachi and NetApp, and others. Scentric lists Hitachi as a technical partner. And StoredIQ touts EMC and NetApp as technology partners. Note that the partnerships are by no means exclusive. Note also the absence of some key IT vendors, such as IBM and Sun, in these lists.
Looking through different lenses
Enterprises can look through a couple of perceptual lenses to help them determine which data classification solution may serve their needs. Vendors’ products tend to be a composite of functions and the function sets are not the same for each product.
The management lens
The first filter is to determine which types of management functions are performed: storage, data, or information. The three types (derived from the Storage Networking Industry Association definitions) are the following:
■ Storage management-Discovers, monitors, and controls physical storage assets.
■ Data management-The non-data-path control and use of the data itself from creation to deletion, such as migration, replication, and backup/restore processes.
■ Information management-Manages the content and decision-making relationships of information as it moves through the lifecycle of a business process, such as records management and content management.
What is the difference? Storage management covers “tiering;” data management focuses on data protection (such as employing different types of data protection for different classes of data) and migration; and information management is about content-awareness (what is inside is what is important), such as applying eDiscovery.
In a broader sense, information management is the enterprise-wide administration at the metadata/business level across all vendors/data types. There can also be a mix of types working in harmony and integration. File metadata (a data management function) may be mixed with an index of information (a content-aware information management function) to classify data. That classified data can then be migrated (a data management function) to the appropriate tier of storage (a storage management function).
Storage management is at the block level and uses primitive metadata, such as when the block was last accessed. Data management can use file and database metadata (it is at the level of the file or the database “record,” but does not understand the content of the file or record). Information management is content-aware in that the contents of a file or database can be examined and that information (either directly or in the form of an index) can be used for classification purposes.
Products that fall within the “information classification and management” or “intelligent information management” categorization scheme may have functions that fall within the different categories. For example, one product may focus on migration of data to tiers of storage as well as classification. Another product may focus on classification using both file metadata and content-based search capability, but not do migration.
The data lens
The second way of looking at data classification is through the types of data that the data classification process manages. Data classification does not have to be universal for an enterprise. One application at a time, or a series of interrelated applications, can be selected. However, data classification may involve only one data type or a mix of data types (see table, above).
The most common differentiation is between structured and unstructured data. What users typically consider to be structured data-data in databases-is essentially correct. What is frequently considered unstructured data-for example, where word processing documents are commingled with video files-is not a correct categorization. There is an essential differentiation between semi-structured and unstructured data in that semi-structured data can be effectively searched. For example, one can search for all e-mails or word processing documents (i.e., those supported by content-aware applications) that contain a certain word. That is why there is a need for the semi-structured category.
The same search capability cannot be applied to native unstructured data. For example, questioning when a certain word was spoken in a movie is unanswerable in native mode since video cannot be searched but only sensed (viewed or heard). Speech recognition can be used to determine whether and when a word was spoken, and this information can then be put into a searchable format. The goal for a lot of unstructured data is to increase its structure by pairing it with complementary structured or semi-structured information.
The term “semi-structured” is most often used to refer to e-mail, while other semi-structured data is relegated to unstructured status. While e-mail is semi-structured so are word processing documents. However, to achieve optimum data classification success the focus should be on the data. There is nothing intrinsic in an e-mail that gives it more “structure” than a word processing document.
Since the word “unstructured” tends to be used gratuitously (and inaccurately), a determination has to be made between unstructured and semi-structured data. The distinction is important. True unstructured data cannot be used natively by content-aware applications. Unstructured data is typically stored in Binary Large Objects (BLOBs), which, of course, are changed less. Thus administration and classification have to be different.
Note that no one product has to encompass all types of data, only that all types of data within the organization have to be covered if universal data classification is an appropriate goal for the enterprise. This is a classic caveat emptor example for IT administrators, who must consider carefully the types of data supported by the software tools under consideration.
Many useful products are available to help businesses with the data classification process. At the upcoming Storage Networking World conference, SNIA will be holding a Hands-on Lab to enable end users to be able to test drive a variety of different data classification products. Helping potential customers develop a feel for available tools should help them gets their arms (and heads) around a difficult and complex process.
Organizations have to be able to determine the types of management functions they need to perform (such as storage management for “tiering,” data management for migration, and content-based search for information management). Enterprises also need to decide what data (structured, semi-structured, and unstructured) needs to be classified as the tools may be proficient at one data type or multiple data types.
Still, the challenge remains on how to get commitment from business users and management to engage in the data classification process. One way of achieving that goal may be to leverage and integrate the concepts of ILM with enterprise content management, master data management, and business process management. That way the business will be able to understand the benefits of data classification in ways that are important for the efficient and effective operation of business functions and business units. Then, data classification will lead to true intelligent information management (i.e., blending the content and decision-making relationships of information throughout the lifecycle of the business process).
David Hill is the founder and principal at the Mesabi Group (www.mesabigroup.com). A version of this article originally appeared in the Pund-IT newsletter (www.pund-it.com).