Data classification lays the foundation

Posted on May 01, 2007


Classification is the foundation for larger projects such as information lifecycle management (ILM), tiered storage, and intelligent archiving.

By Michele Hope

When we profiled data-classification practices one year ago, we found many storage professionals using data-classification products to discover what the heck was taking up so much space on their G: drives. More often than not, they were surprised by what they found: terabytes of data that hadn’t been accessed in more than a year or sensitive, highly regulated customer information (e.g., credit card numbers, social security numbers, etc.) that had somehow made its way into various Office documents or flat files housed throughout their network.

We then asked these users what they planned to do with the data they had found. This is where many admitted they weren’t quite sure how best to move forward. Although they universally acknowledged the importance of data classification in such larger initiatives as information lifecycle management (ILM), tiered storage, and archiving, many users seemed to still be feeling their way through their organization’s political hierarchy when it came to progressing past the discovery phase.

Instead, we heard murmurs of upcoming “policy meetings” planned with compliance, security, legal, or key business application managers. According to IT managers, their discovery efforts had invariably shifted focus away from just the handling of data to the handling of information within the organization. Once that shift occurred, cross-functional groups needed to be consulted regarding the development of an appropriate handling policy for the different classes of data (or information) IT had subsequently discovered. This necessary “meeting of the minds” became a subsequent source of frustration for a few storage administrators wanting to move forward with archiving data and tiering their storage architectures.

When asked to identify his most significant storage-related pain points in a survey conducted by TheInfoPro research firm, one Fortune 1000 respondent expressed the following frustration with the process: “My biggest storage pain point is devising some way to archive or tier my storage in such a way that makes everybody happy.”

Members from disparate groups in the organization were often called in to hammer out specific data classification and handling rules that would dictate any subsequent manual or automated data movement or quality of service (QoS) levels to be associated with each class of data.

Click here to enlarge image

While policy development efforts were underway on an enterprise level, a few other intrepid IT users we interviewed still chose to move ahead with their own somewhat “covert” operation: a basic hierarchical storage management (HSM) level of categorization, data movement, and archiving typically based on a file’s (or an e-mail’s) last access date.

Among these users, many believed the short-term gain in freed storage capacity, better management, and faster application performance was worth the initial effort. They also reasoned that any more-sophisticated classification policy rules could then be applied to the data once their organization reached a consensus.

For the IT environments that forged ahead, one key to the data-classification solution’s early success was often how well the application disguised the fact that an end user’s data had been classified and subsequently moved or archived to a new physical location.

Classification: Then and now

Since last year’s report on data-classification usage, progress has been made by both users and vendors. The end users we spoke to this year tended to be more targeted in the specific outcomes and objectives they expected to achieve than those in our prior report. As opposed to using just the “pure-play” data-classification solutions, these storage professionals chose other “hybrid” or “active-archiving” products that have integrated data-classification functionality as part of the overall product or suite.

These users often have up-front agreement about the power users to be involved in data classification and policy-setting and enlist such users to help perform key inputs in the specific solution’s software interface. Interestingly enough, the users we spoke to also tended to work in some area of IT outside of storage.

These users didn’t tell us they wanted to classify their data, or even that they wanted to embark on an ILM initiative. Rather, they backed their way into the task of data classification in their effort to solve a very specific problem. In one case, the problem was how best to address data-at-rest security issues with personal financial information (PFI). In another case, the user was trying to contain database growth and improve on reporting and query performance by archiving older data.

Such targeted objectives geared toward security and archiving tend to correlate with the practices of other data-classification users. Brian Babineau, an analyst with the Enterprise Strategy Group, contends that there are three primary reasons for organizations to deploy classification solutions today. “They are either trying to control confidential information in the face of information privacy regulations, identify a subset of files and messages to support electronic discovery requests, or locate aged files and messages and move them from primary storage devices to lower-cost storage resources,” says Babineau.

Arun Taneja, founder and consulting analyst with the Taneja Group, sees a similar focus on the part of end users in the area he calls information classification and management (ICM). According to Taneja, “The biggest push from the user side is coming from either e-discovery or from some other compliance-oriented initiative in the company, or it’s coming from security.”

Taneja notes that the majority of solutions sold by data-classification vendors often seem targeted at the e-discovery market—for good reason. A company’s single e-discovery effort made without the aid of a data-classification solution can easily run into hundreds of thousands of dollars in paralegals, time, and resources. In contrast, an ICM product applied to the same task may be able to give you a return on investment (ROI) measured in just a few days. “We’re not even talking about weeks or months, but a few days! That’s how dramatic the ROI is,” says Taneja.

While e-discovery shows up as a strong motive for users performing e-mail archiving, the results of a few recent user surveys on data classification by research firms such as TheInfoPro and Peripheral Concepts tend to focus more on security, protection, archiving, storage tiering, and compliance as important factors to incorporate when you are classifying data (see figure, above, and figure on p. 28).

Progress on the vendor front

For vendors, the data-classification market is still relatively young and full of a few looming giants. It’s also ripe with several hungry start-ups that have been busy lining up strategic partners to help them secure accounts in both targeted vertical markets and larger enterprises.

Click here to enlarge image

Some solutions have moved from classifying and handling just one type of data (unstructured, semi-structured, or structured) to all types of data. In addition, many solutions go beyond basic discovery and classification based on meta-data alone. Instead, many now offer what Arun Taneja calls “deep dives” into the unique content of key files or e-mails.

Data-classification vendors cited most often by analysts include Abrevity, Arkivio, EMC (with its InfoScape product), Kazeon (Network Appliance resells Kazeon’s data-classification software), Index Engines, Mathon Systems, Njini, Scentric, and Symantec. On the e-mail classification front, ESG’s Babineau adds vendors such as MessageGate and Orchestria. (See vendor listing for a more complete lineup of data-classification vendors.)

Analysts also gave a nod to Google and FAST Search & Transfer (FAST), which have already made a name for themselves on the search side of the market and now seek to expand further into enterprise data classification. FAST, for example, partners with a wide variety of vendors for data classification.

ESG’s Babineau advises keeping an eye on vendors such as Microsoft and Oracle. “The application-centric vendors, especially those with applications that create a majority of enterprise content, including Microsoft and Oracle, want to participate in this market and should be watched,” says Babineau.

Over the next few years, analysts predict a maturing in the classification market that may involve further acquisitions or consolidation that will change the current mix of players. They also expect a shift to occur in users’ motivation and intended use of data-classification solutions.

While today’s users turn to data classification as a more reactive, externally motivated response to comply with what Babineau calls current governance, discovery, and privacy rules, analysts see tomorrow’s user of data-classification solutions shifting to more of an internally motivated focus on how their organization can effectively reuse the data they classify.

Despite such a lofty future, Taneja is the first to note that progress and maturity is still in the very early stages when it comes to most users’ ability to move past data classification into usage of such solutions’ policy and data movement engines. With the exception of those using vertical applications (such as ECM products like Documentum), “I’d be hard-pressed to find more than a few hundred installations where policy engines have actually been fed, and really strong extraction of information is being done on anything more than a prototype basis,” says Taneja.

A user’s perspective

One user we spoke to who was knee-deep in data classification was Terrence Griffin, with the Atlanta Postal Credit Union (APCU). After hearing industry experts talk about the importance of protecting both data-in-flight and data-at-rest, Griffin, vice president of information services for the credit union, started thinking about how best to protect sensitive data residing on company laptops in the event a laptop was stolen.

“I started to think about laptops giving out and things going missing and began to be more concerned about data-at-rest,” says Griffin. “I was most concerned about our member database and our members’ personal information.”

Griffin was especially concerned that such personal information might end up in the wrong hands after it had somehow made its way onto an employee’s laptop. The APCU keeps the majority of its more than 100,000 members’ account data in a 30GB database housed on the credit union’s IBM mainframe.

All data associated with the mainframe database application is automatically classified by Griffin as critical personal financial information (PFI) that must be adequately protected. Although he was comfortable with how well member account transactions were protected while still within the database application, Griffin knew he wanted to do more to protect this type of data so that it couldn’t leave the network or be viewed internally by the wrong people.

To help him identify how much PFI data was out there on laptops and file shares, Griffin began to look at two vendors that offered data-loss prevention and information security solutions for protecting data-in-flight and data-at-rest: Vontu and FiLink.

FiLink, one of the APCU’s security partners, had asked the credit union to beta-test a new solution, Compliance Protector, it had developed in conjunction with Scentric’s data-classification engine.

As part of the beta-test process on a random subset of computers, the solution took just 20 minutes to identify several security flaws in applications that had caused PFI member data to inadvertently remain on disk. “We found things that made us go, ‘Wow, we didn’t know that,’ ” says Griffin. “Some applications were caching things we weren’t aware of, then not destroying the cache when the application was closed.”

Griffin says they also discovered a lot of member database data in flat files or html records that he wanted moved to a secure server where laptop or desktop users could then link back to it. In this way, even users working from home would have to go back and retrieve that information from the secure server.

According to Griffin, Compliance Protector offers a database on what it calls a D3 server with extracts of what’s classified as PFI data. When the Scentric engine scans for secure data, it first uses the PFI criteria defined on the D3 server. “This stuff then needs to be moved to a secure server as soon as we find it,” says Griffin.

Griffin views Compliance Protector—and the Scentric engine—as a necessary addition to his arsenal of compliance tools. “We have secure e-mail through ZixCorp, Compliance Commander from Intrusion for data-in-flight, and Scentric for data-at-rest.”

ESG’s Babineau views Scentric’s approach to partnering with other solution providers (such as FiLink) as a means for customers to reap additional value when such providers go beyond data classification and help users perform other functions such as securing sensitive information, archiving certain records, or taking other actions with the data. This is especially true if the classification and information management solutions are integrated and tested, according to Babineau.

“Classification is the necessary first step in managing information more intelligently, but grouping the data is only the beginning,” Babineau explains.

“Users must be able to take discrete, specific actions against these subsets of information. Simplifying classification and information management into one solution is a step in the right direction.”

Classification and archiving

Another user who backed into data classification was Deborah Wosika, an application administrator at Helen of Troy Ltd., which markets and distributes personal care and household consumer products.

Of critical importance to the daily operations of the 700+-employee firm was the company’s main Oracle database application with modules including general ledger, inventory, order management, accounts payable, accounts receivable, and purchasing.

Click here to enlarge image

With the size of the database growing exponentially, and all data housed on the same production server, Wosika and her team had begun to notice efficiency lags and some slowdowns in performance when users attempted to run queries or reports against the database.

“Everything was in our main Oracle database, and all we were doing was increasing the disk space, which was not very efficient,” says Wosika. “Data was just going to keep growing unless we archived it off and reclaimed that space so that queries didn’t have to go through so much data and could run more efficiently.”

That’s the point at which Wosika and her team decided to go with Solix Technologies’ ILM solution, ARCHIVEjinni (which was since re-named Enterprise Archiving), after researching a number of alternatives. Solix rose to the top of their list based on its ability to allow seamless access to the archived data. Also in Solix’s favor was the fact that ARCHIVEjinni was integrated with the 10 different Oracle modules Helen of Troy had in use, so that it was an easier effort to archive data on a module-by-module basis into another archive database.

Wosika, who also became the project manager for the Solix implementation, designated key employees assigned with special responsibilities to access the archive data for specific modules. She also defined two sets of responsibilities: one that allowed the user to access the archive data separately, and the second that enabled the same user to access a merge of both the current production data and the archive data. Data classification entered into the picture once Helen of Troy made the decision to go with Solix. “That’s when we went to each user based on the Solix setup parameters and asked them what types of data should get archived,” says Wosika. “For general ledger, you have balances and you have journals you can archive. We decided to archive both after the current year plus two fiscal years.” For other modules, such as order management, the company chose to keep nine rolling months in the production database for all order types except consumer orders, for which they only chose to keep three months’ worth of data in production.”

Some data is archived monthly, other data annually. While an automated scheduler and automated policies are functions Helen of Troy could use with ARCHIVEjinni, the company chose to perform the discovery sweeps and subsequent archiving manually for now as a means to track changes made to the system.

One thing Wosika knows is that the initial archiving effort the company undertook was an eye-opener in terms of the sheer volume of rows it allowed them to process and archive. “I kept track of the time it took and the number of rows we archived off. It was almost 200 million rows, and it took just 105 hours to do it. It would have taken much longer if we had to do it manually,” she says.

Wosika also likes ARCHIVEjinni’s ability to “de-archive” if you make a mistake. “You can easily put the archived data right back into the production database if you want to. All the parameters that were archived off are right there on the screen,” she explains.

When asked about the data-classification role of a solution like ARCHIVE-jinni, ESG’s Babineau offers some guidance. “Solix helps organizations classify structured data and keep the relationships between this information. It is fairly unique because they are one of a few vendors that can classify and then archive structured information,” he says. “One of the keys is maintaining the integrity of database information as it relates to the enterprise application feeding the database. In this case, classification and management action [archiving] are conducted with the same solution.”

Michele Hope is a freelance writer who covers enterprise storage and networking issues. She can be reached at

Related articles:

  • Vendors tackle information access
  • Data classification lays the foundation
  • Data classification: Use brains, not brawn

    Representative data-classification vendors

    • Abrevity
    • Arkivio
    • CA
    • EMC (InfoScape)
    • Index Engines
    • FAST Search & Transfer
    • Google
    • Kazeon
    • Mathon Systems
    • MessageGate (e-mail)
    • NetApp (resells Kazeon)
    • Njini
    • Orchestria (e-mail)
    • Scentric
    • Solix
    • StoredIQ
    • Symantec
    • Zantaz (e-mail)

    Five best practices for data classification

    1. Create a cross-functional team (including IT, risk management, compliance, information security, and legal) to determine how a classification solution can be used.
    2. Identify a subset of corporate data that could present legal or security risks as the initial information to be classified.
    3. Evaluate at least three classification solutions, including one enterprise search vendor. Each product and associated indexing methodologies are different and may have varying benefits to your organization.
    4. Establish a budget for information classification; use the cross-functional team to fund it as many departments should benefit.
    5. At a minimum, implement tiered storage and rationalize an investment in information classification as a means to determine where to place your data.

    Source: Enterprise Strategy Group

    1. Identify the most time-critical and highest ROI application, and then focus on implementing that solution. This application is likely to be e-discovery, compliance regulations, or security.
    2. Look at products that deliver a solution to one, but are “horizontal” in architecture.
    3. Design at the enterprise level, but implement in stages.
    4. Validate scalability of potential solutions, because many ICM solutions do not scale adequately.
    5. Remember that the industry is in the very early stages of ICM design and implementation, so it’s important to vigorously test potential solutions.

    Source: Taneja Group

Comment and Contribute
(Maximum characters: 1200). You have
characters left.

InfoStor Article Categories:

SAN - Storage Area Network   Disk Arrays
NAS - Network Attached Storage   Storage Blogs
Storage Management   Archived Issues
Backup and Recovery   Data Storage Archives