Data classification: Use brains, not brawn

Data-classification initiatives increasingly rely on business value criteria, necessitating a move toward auto-classification using system-generated metadata.

By Dave Vellante and Fred Moore

The current state of data classification is largely a by-product of historical hierarchical storage management (HSM) implementations where data age is the primary classification criterion. Early visions of classifying data based on business value never fully came to fruition, in large part because the proposed schema effectively required the brute force manual classification of data sets upon creation or use. While all classification efforts require varying degrees of manual intervention (e.g., initial discussions with business lines to define requirements), age-based classification enables automation processes to be more easily introduced to data-classification initiatives and have become the de-facto standard.

A new emphasis on compliance, discovery, archiving, and provenance substantially challenges existing data-classification taxonomies. Today’s business-value drivers include “never delete” retention policies as well as performance, availability, and recovery attributes that underpin resurgent data- classification efforts. Although age-based schema still predominates for the most part, they must evolve to more aggressively incorporate richer classification attributes.

Importantly, this extension should be accomplished with an eye toward automation by dynamically assigning metadata to data sets upon creation or use. Future data-classification efforts will involve much broader perspectives and serve as the mainspring of multiple enterprise initiatives, including information lifecycle management (ILM), tiered storage, e-mail archiving, decision support, data mining, electronic content management, and compliance. In short, data classification will serve as the foundation for information value management, and without auto-classification there is little chance to succeed in supporting these often complex efforts.

Click here to enlarge image

This article presents the following premise: IT organizations must break with the past and make business process, and not age of data sets, the defining criterion for classification schema. Furthermore, in designing value-based classification schema, auto- classification capabilities that assign metadata to data sets at the point of creation or use must be included to accommodate scale and manageability.

New value drivers

The traditional catalyst of data classification from a storage point of view has been improved efficiencies, often by freeing up space and/or migrating data to less expensive tiers. In the early 1990s, the common belief was that archival status was the last phase for data before deletion or end-of-life. Then, one- to two-year data-retention periods were viewed as a reasonable amount of time to keep digital data in any accessible format. Age-based retention policies predominated.

It is becoming increasingly important to understand that the value of data changes throughout its lifetime and quite often retains or even gains value over time. Where data should optimally reside, how it should be accessed, how it should be managed, and its corresponding metadata attributes all change during the lifespan of information.

Today, there are countless government regulations that dictate the way data is managed and stored throughout its life. As a result, a new value proposition is emerging where proper classification enables the reconstruction of a continuum of organizational activities performed, and decisions made, over a period of time. What this means is that the long-time reliance on “corporate memory” to piece together a series of events, or conduct a cumbersome discovery, has the potential to be supplanted by a much more reliable and auditable system of infrastructure, metadata, applications, and business processes. To general counsels, boards of directors, and risk managers, this can mean billions in loss mitigation and improved business productivity.

Technology integration

To implement a data-classification and ultimately a lifecycle management strategy from an infrastructure perspective, the de-facto standard three-tiered storage hierarchy model has emerged as the preferred choice. These tiers include primary storage (T1), always disk-based for highly active, mission-critical, or customer-facing revenue-generating applications. Secondary storage (T2) includes virtual tape for enterprise systems or low-cost SATA disk systems and sometimes MAID (massive arrays of idle disks) for data that has a lower activity level but hasn’t yet reached archival status. The third tier (T3), long-term archival storage, remains the realm of tape devices. The issue of moving large amounts of data from one level of the hierarchy to another, passing in and out of a server, is a growing performance concern demanding that a device-to-device data-transfer capability emerge between the tiers. All data is not created equal and the value of data can change throughout its lifetime. For lifetime data management, it doesn’t matter if the data is ever used; it does matter if the data is there and can be accessed.

The model in the figure on p. 33, developed by Horison Information Strategies (www.horison.com), can be used as a visual tool to describe critical data attributes and their logical placement within the infrastructure hierarchy.

Metadata adds intelligence

At a recent Wikibon (www.wikibon.org) research meeting about data classification, the participants concluded that the most important consideration for data classification emerging in organizations today is auto-classification using metadata. Metadata is “data about data” and includes the creation date, who and/or what created it, where the data was used, and when it was destroyed. Managers (and increasingly, courts) need confidence that data was not changed without a record and the use of metadata to provide this assurance is a clear way of addressing this problem. In addition, automating metadata creation can support data-classification efforts by accommodating the never-ending changes in government regulations, corporate policy, and competitive pressures.

Because applications and users create data, these are the logical places where metadata should be assigned. Metadata is additive in nature and does not need a single point of control. As such, file systems, applications, system management software, databases, storage management software, and intelligent storage hardware are all potential candidates to create metadata and should be exploited for this purpose. The creation of metadata must be automated and/or made as simple as possible for end users to add; otherwise, the management of classified data will become practically impossible due to the enormous complexities of dealing with change. Classification metadata should reside on T1 or T2 storage devices and be readily and speedily accessible.

Click here to enlarge image

The action item here for IT professionals is that value-based data classification necessitates automation of the creation of metadata at the time of data-set creation or use. The first and most important step is to agree on metadata types and the layout and structure of each type of metadata. Then agree where, when, and how classification metadata will be created and maintained.

Data-classification methodology

Most businesses have not started on a formal classification program. Getting started sooner rather than later is of utmost importance. Data-classification efforts typically lack a formal structure and rely on informal meetings between the storage staff and business units, or other correspondence to obtain user requirements for storage services.

Four distinct levels of classifying data exist, and these represent best practices for beginning the data-classification process: mission-critical data, vital data, sensitive data, and non-critical data. Determining these categories enables the most cost-effective storage tier and appropriate data-protection strategies to be selected. These levels also identify which backup-and-recovery technology is best-suited for each data-classification level to meet or exceed the recovery point objective (RPO) and recovery time objective (RTO) requirements. Once data has been classified, selecting the optimal storage infrastructure and data-management options, as well as developing automation techniques, becomes much clearer.

The figure, below, depicts the four types of data categories and how key attributes differ for structured versus unstructured data (including metadata).

Organizational considerations

Keys to a successful data-classification project are detailed planning and meaningful dialogue with executives, line- of-business owners, and users about data requirements and metadata automation strategies. The objective is to optimally match different levels of storage with users’ needs.

However, organizations are realizing that data classification can provide benefits well beyond storage efficiency. This presents a challenge for storage professionals who have traditionally been responsible for data-classification implementations. Data classification is a fundamental building block for effective ILM, tiered storage, and archiving initiatives and the potential benefits touch many parts of an organization. However, out-of-scope requirements can disrupt the initial objectives of data-classification projects, and managers must be extra careful to avoid scope creep.

For IT to implement a full data-classification architecture, detailed assessments will be needed with legal, audit, risk management, and business lines. Architects must be consulted about metadata architecture and application developers/application owners must provide advice to determine metadata automation requirements and policies. Finally, operations professionals must be involved to ensure day-to-day processes and procedures are in place. Understanding these dynamics is important, but taking this on in one data-classification “uber-project” is not advisable. Storage executives need to limit the scope of any data-classification effort to that which can be achieved in the immediate term.

The action item here is executives responsible for storage must keep data- classification schema simple and limited to data that is system-generated (e.g., date of creation and last use) or willingly tagged by users at the time of creation. While necessary, expanding the scope of classification efforts should not proceed until data-classification schema are defined and automated methods of metadata generation are in place. Relying on manual capture of classification information will doom data- classification projects to failure.

To be sure, the justification, internal arm-twisting, and evolution of this project will not be trivial; however, the technologies, regulatory imperatives, and competitive pressures are coming together in a sort of perfect-storm scenario that will dictate investment in this area for the next several years. At the heart of this opportunity are the automatic creation of classification metadata and the enticement of users to provide meaningful input into the process via simplified auto-classification tools.

The amount of data at the back-end of the data lifecycle is growing, not shrinking. Retention policies are now based on data value, mandating that a standard metadata classification scheme must emerge. Several software suppliers are beginning to address this issue, but the market remains highly fragmented, and turnkey solutions are limited in scope.

The final action item is that IT must sell the vision of how enabling automation of metadata will create business value by reducing risk, driving huge improvements in productivity, and facilitating the exploitation of untapped corporate knowledge. Application owners must be persuaded to develop metadata creation and auto-classification functions and supporting architectures. Finally, metadata creation must be simplified for end users to participate in the process and add incremental value.

Dave Vellante and Fred Moore are both members of the Wikibon community (www.wikibon.org). Vellante started International Data Corp.’s (IDC) storage service in 1984 and is now an entrepreneur in the Web 2.0 software industry. Moore has been a fixture in the storage industry for decades. In 1998, he founded Horison Information Strategies (www.horison.com), an information strategies consulting firm in Boulder, CO.

What is Wikibon?

Wikibon (www.wikibon.org) is a worldwide community of practitioners, consultants, and researchers dedicated to improving the adoption of technology and business systems through an open-source sharing of free advisory knowledge. Consultants, writers, and technology peers come together on Wikibon to collaborate on projects and share ideas about how to analyze, design, implement, and adopt critical business functions. Wikibon.org was founded by former Meta Group and International Data Corp. (IDC) executives. The site went live in January.

Each week, Wikibon hosts a community research gathering on important industry issues. The collective insight of the attendees is recorded, transcribed, and published in the Peer Incite Newsletter. Representative topics include virtualization, data classification, e-mail archiving, and other emerging technologies.

This article was originally published on June 01, 2007