A successful data purging project requires cooperation between many non-IT and IT groups, including the storage team.
By John Merryman and James Brissenden
March 23, 2007—With the ever-decreasing cost of primary storage, why bother purging data? The status quo is to do nothing. Most legacy applications have never purged data, and new applications are rarely designed to accommodate purging. At the same time, corporate file servers storing unstructured data are forever bloating, and at best, IT is only purging unneeded user data. Only e-mail applications have truly entered the purge arena, but solely for those willing to procure an e-mail archiving product or service.
However, the status quo is about to change. The tides of perpetual data retention are turning due to an evolving culture of corporate governance, federal regulatory changes, and increased pressures to stem the costs associated with overall data bloat.
To understand the problem, let's first explore how we got here: In IT, many shops got lazy and never bothered to purge data, and on the business side, many companies have yet to fully associate ballooning budgets with a lack of data purging.
We have a lot of good excuses, however. Over time, more and more business functions have become automated through technology (e.g., e-mail, document management, databases, ERP, CRM, etc.). Most organizations have experienced sustained organic growth, M&A activities, or both. The explosive growth of user and application data has demanded additional storage capacity, increased performance, and sufficient data-protection measures. Data retention and purging always take a back seat to operational stability, and by the end of the budget cycle are the first to take the hit.
Meanwhile, the "paper people" in records-retention departments have diligently applied policy and process to the way information is managed on paper. The physical nature of hard-copy information presents a more immediate challenge to the business, while digital information is obscured by technology and generally won't fill up hallways and office buildings. Most organizations can direct you to hard-copy information either on-site or off-site and let you know the retention schedule associated with it.
However, if you ask IT where to obtain a specific type of information and its associated retention schedule, good luck. Detailed knowledge about digital information management is difficult to find. Data owners and creators, database administrators, and storage administrators all may have unique perspectives on where and how information lives in the land of ones and zeros. The line of business may have a good idea of what business processes are supported by technology, but usually lack intimate knowledge of data-management practices in place.
The value of historical data is also increasing significantly with data warehousing and mining techniques, once applied only to the traditional warehouse but increasingly applied to production applications. Historical use of data has evolved to a state where legitimate business processes depend on access to historical data. Consequently, many businesses benefit from the existence of historical data, so the operational risks of data purging must be weighed against the benefits (infrastructure cost avoidance, decreased litigation risk, increased performance, etc.).
Deleting data from production systems has always been a low priority for IT. Storage and data-management teams are routinely handed the responsibility for data retention and purging. Culturally, most IT managers shy away from data purging due to the unknown operational risks, unless a legitimate operational issue forces the topic.
For structured data (applications and databases), the "application genetics" between systems are completely unique. Even with a standardized database platform, the underlying data model implementations across an enterprise are unique to each system. As a result, each application tends to have unique data-retention and purge issues.
A lack of data naming and placement conventions, coupled with a lack of management disciplines, defines the unstructured data landscape today. Unstructured data populates file servers, including user- and application-generated files in hundreds of unique formats. Unstructured data lacks contextual information (what's inside, is it of any value, can I delete it, etc.), so storage teams look for technical ways to leverage file-level metadata (age, owner, type, etc.) to achieve data purging. But the "bottoms-up" approach to purging unstructured data often falls short of sustainable long-term results. Users often will not tolerate limited access to historical data, and in many environments "stealth processes" unveil themselves as operational disasters when unstructured data is purged.
In terms of archiving and purging, e-mail has matured the most rapidly. Exchange and Notes environments have scaled to a point where storage infrastructure costs, performance, and daily backup processes are strained by the volume of historical e-mail. The data volume issues coupled with legal issues have led the industry to develop a wide variety of solutions for e-mail archiving and purging. Nevertheless, many IT shops are only archiving e-mail to a less expensive tier of storage, but are still unwilling to permanently purge e-mail due to legal or operational reasons.
Legal teams traditionally work at a distance from IT operations, but are increasingly driving ad-hoc policy decisions for data retention and purging. The average enterprise is faced with dozens to hundreds of pending litigation events. The demands of e-discovery and associated legal hold orders are changing the profile of the legal department from a standard business group to a power-user of enterprise data. From a legal perspective, all forms of data are discoverable (primary data, secondary data, copies, etc.), so overall data retention directly impacts how much data is available for discovery and the amount of effort required to actually find what data is relevant to proceedings.
The December 2006 changes to the Federal Rules of Civil Procedures (FRCP) clarify how companies must perform e-discovery for all forms of electronic data. Essentially, the rule changes dictate how data is identified, discovered, and made accessible to the court through pre-trial proceedings. Data purged as part of routine operations is not subject to FRCP sanctions. The problem is that most companies lack a data-retention policy or any operational data-purging practices.
Legal teams face the double-edged sword of litigation risk (keeping too much data) versus regulatory risk (not keeping enough data) and often appear "non-committal" to any clear legal decision (and for good reason!). IT, on the other hand, is busy managing applications, data, and infrastructure and often has little insight into the legal risks associated with the data they manage. This lack of connection between IT and legal is common across most industries and often leads to IT managers taking responsibility for determining data-retention policy based on technology capabilities alone.
Deleting production data is a complicated business. Technical considerations must be balanced against business needs. The issues associated with legal, compliance, and operational risks are often ambiguous, and few organizations have a process to accommodate a web of requirements for data retention.
Scope and risk
To begin with, define scope in terms of data types (structured, unstructured, and messaging). Next, assess organizational priorities to determine which types of data require the initial focus. Some organizations will execute a formal risk assessment, and others already know where to focus, based on operational pains (litigation costs, e-discovery events, cost issues, etc.).
Data purging for the enterprise must start with the business. You should begin by understanding requirements and defining data-retention policies. This must involve cross-organizational teams (legal, compliance, records, IT, LOB). An enterprise data-retention policy should consist of two key elements: data-retention policy and technical standards. The policy outlines the "rules" of the organization and may include specific requirements for data retention by information type. Technical standards outline the standard methods, tools, and approaches used by IT to retain and purge data.
Next, build processes that fit your business and operational model. Accommodation of both new and existing data requirements must be considered, along with existing organization processes and management capabilities. In some environments, e-discovery and litigation hold processes are built as part of a data-retention program.
Finally, develop a pilot project, and improve upon the process and policies that have been built. People and well-defined roles define a successful pilot, so be sure to involve the organization and focus on key roles for data ownership. In general, structured data is owned by the business application owner (often from the line of business). Unstructured data is typically governed by department or group. Messaging data, while created by individual users, is often governed by usage type, department, or overall organizational needs.
Regardless of whether you use third-party tools to help identify data and apply retention policy or tackle the project manually, we advise an iterative pilot-based approach, starting small.
For example, an application-by -application approach allows you to build out and refine your method. By starting with a small application where the data is easily identified and relatively simple to purge, you can concentrate on building the relationships and common understanding of the data classification and purge goals, scope, roles, and methods with application owners, IT, management, and legal departments. Be sure to identify the data within your environment that needs to be purged. Good initial candidates include legacy applications, data warehousing, and reporting systems.
Continual process improvement never really ends, so now you should be ready to plan and deploy in the enterprise. Along the way, look for tools to make the processes more efficient, but don't look for tools to create the processes or clarify organizational needs, because they won't.
The vendor landscape for tools providing data purging capabilities varies widely. The following table outlines the key characteristics of the general tools available for data retention and purge.
No product exists to accomplish data retention and purging across messaging, unstructured, and structured data types. The major vendors (typically via acquisition and partnerships) claim to have a solution for all data types, yet when we look beyond the marketing message, it is only accomplished via multiple, disparate technologies, many of which don't even have an integration road map. Smaller vendors tend to have more focus on specific data types and are capable of providing deeper value, but for a more narrow spectrum of data.
We recommend that once policies and procedures are established, users take a methodical look at vendors' solutions by data type. The tools available on the market today vary widely, and evaluation and selection should be driven by requirements. A data-retention policy and associated technical standards are a great starting point to refine these requirements before you implement new technologies.
Data purging is a departure from existing culture in most IT organizations. While data purging may be one of the most important phases of the data lifecycle, it is usually ignored due to the perceived risks and associated complexities. Organizations can continue to "keep everything" with unmanaged cost and risk, or develop a strategy and implementation plan to manage enterprise data to the end of its useful life.
In doing so, we recommend starting by formalizing data-retention requirements and process, and only then look for enabling technologies to make these processes more efficient. This approach minimizes the risk of unnecessary technology expenditures or extensive process redesign efforts, which usually fail when organizations try to wrap the business process around a tool. Enabling a process that links the organization's needs to data retention will at a minimum provide IT with clear requirements on what data is ready to purge.
James Brissenden is a senior strategy consultant with GlassHouse Technologies, an independent storage services firm. He has worked as the technical lead on a variety of disaster-recovery, data-protection, and business continuance projects and can be reached at email@example.com. John Merryman is principal consultant with GlassHouse and can be reached at firstname.lastname@example.org. His focus areas include data classification, data protection and retention, and emerging technologies.