Data classification is the icing on the ILM cake

Call it what you want-ICM, IIM, etc.-but data classification and policy-based management tools will help you get a grip on burgeoning data stores.

By Michele Hope

Life used to be a little more black and white for enterprise data managers. They backed up data regularly, ensured it was able to be recovered, made sure enough storage capacity would be available for key applications, diagnosed and corrected slowdowns in system performance, and handled various requests for new storage.

While this set of tasks didn’t always translate into the easiest of jobs, the parameters around the role of managing data were somewhat well-defined. Did the backups complete? Were you successful at staging a mock recovery?

Although these aspects of the job remain, today’s enterprise data managers are increasingly being asked to also take on the role of custodian of their organization’s valuable information records. They are now being asked to view the data under their care as useful and sensitive information that may also need to be discovered and mined from time to time, classified based on key attributes, and managed or stored differently based on the class of data it represents.

This new focus on data classification and policy-based data management stems from any number of company initiatives, such as an information life-cycle management (ILM) project that seeks to move less-critical resources to lower-cost storage tiers and free high-performance storage systems. Legal requests for e-discovery and compliance regulations are also prompting the need to automatically search and discover local files containing certain types of distinguishing information. The emergence of disk-based backup and archival is also prompting some organizations to focus on making such processes more efficient by pruning existing data stores and removing certain classes of data (e.g., personal MP3s, duplicate PowerPoint presentations) that can take extra time and space to back up.

Emerging data classification and policy-based management software, mostly from a variety of start-ups, has since come to the aid of data managers charged with making sense of the data currently under their care.

Classifying the classification vendors

Storage industry analysts call this emerging classification and policy-based management market by a number of different names. For example, the Taneja Group’s senior analyst Brad O’Neill calls it “information classification and management” (ICM). In a report regarding Kazeon, one of the early entrants in this market, the Taneja Group defined ICM as “a class of application-independent software that utilizes advanced indexing, classification, policy, and data access capabilities to automate data management activities above the storage layer.” O’Neill says the following vendors are examples of ICM solutions: Abrevity, Arkivio, Enigma Data Solutions, Index Engines, Kazeon, Njini, Scentric, StoredIQ, and Trusted Edge

Brian Babineau, research analyst at the Enterprise Strategy Group, calls this emerging space “intelligent information management” (IIM), which he defines as “a process and technology that allows users to understand the context of their data and utilize that context to take more discrete management actions.” Babineau includes many of the same vendors as O’Neill in this product category, along with CommVault, EMC, FAST Search & Transfer, and on the e-mail classification side, Orchestria and MessageGate.

Many of these solutions focus on helping organizations discover and classify their growing volumes of unstructured data (files of various types that might even include flat EML files). Some of the vendors, such as MessageGate, Orchestria, and Scentric, may assist with classification of structured (e.g., database) data as well.

Key differences among most of these offerings include whether the software lets you search and index the contents of files or just the metadata, file-system attributes associated with the files (e.g., subject, author, date last saved, date last accessed, etc.). This is referred to as content search or content indexing versus metadata search or indexing. The difference can be likened to looking at the information written on the outside envelope of a letter, versus opening the letter to read what’s inside.

Click here to enlarge image

According to Taneja Group’s O’Neill, Arkivio and Enigma support metadata searching, although it’s possible to add content search and indexing functionality by integrating the software with some other file inspection software platform.

Many products, such as Kazeon’s IS1200 and StoredIQ’s Information Classification and Management Platform, operate as out-of-band appliances that crawl the network at specified times and can perform content or metadata searches based on keywords or specific pattern matching functions that allow users to locate such things as government-protected non-public information (e.g., social security numbers) embedded in files on the network.

Other solutions, such as those from Index Engines and Njini, operate from an in-band perspective, directly on the data path. In the case of Index Engines, the product performs fast indexing of the backup stream of data as it is automatically being sent to some type of backup storage media. Njini can search and index content before a file is even written to disk.

Choosing options

Other differences in these products include whether they require agents or can operate agent-less, the speed at which they are able to search and discover files (Kazeon claims to be able to process millions of files per day on a single node), the size of the indexes they maintain, what level of standard or industry-specific templates they have (StoredIQ has a strong healthcare focus), policy or rules customization capabilities, and how well they can scale to accommodate a broader enterprise need to index and maintain information surrounding multiple terabytes of stored data.

The entry point for most of these solutions is about $10,000 for a small deployment, although all are designed to scale and grow, using predominantly active/active clustering between the various system “nodes.”

Choosing the right solution encompasses a lot of factors, says Taneja Group’s O’Neill. “The mission of the device or solution is the first way to determine which direction to take,” he says. “If you’re just interested in doing ILM and migrating between tiers, nine times out of 10 you’ll be completely content just using metadata about the files that shows access patterns.”

ESG’s Babineau agrees: “You need to ask yourself why you are building this detailed inventory of your information assets. If you are just planning to tier your storage and put most recently accessed items on faster storage, then there’s no need to look at these more-sophisticated classification tools. Basic SRM [storage resource management] may do it for you.”

In contrast, O’Neill notes that organizations interested in making data-related decisions by delving into the inherent content value of the data itself will need to invest in technologies that perform a full index of both the metadata and content, so that the data can be classified based on its content value. These include environments that want to search content based on keywords or patterns in the document itself.

ESG’s Babineau offers the following example to explain this different level of need: “If you will be governed by privacy regulations and store credit card numbers across Excel files, you need to look at solutions that can scan, index, and identify natural language patterns and number patterns that look like a credit card number. This is where the more-sophisticated tools come into play.”

Taneja Group’s O’Neill also recommends investigating the state of the vendor under consideration, its relative maturity and number of deployments, as well as its road map for future product development that may also include the indexing and classification of structured content. “[Most vendors] are now focused primarily on unstructured and file content, but over the next year this focus will expand to a unified context of file and structured content. It’s important to know how the vendor’s product line will expand,” he says.

Making the right connections

Another key area of evaluation is the relationships and partnerships the vendor has been able to forge with more-established players in the e-mail, archival, database, and storage space, according to analysts.

Kazeon forged ahead of the pack here after securing an OEM agreement with Network Appliance last year that gives NetApp the right to resell Kazeon’s IS1200 appliance. Subsequent integration efforts have since allowed Kazeon to roll out more NetApp-specific offerings that include the company’s SnapSearch and Recovery functionality for easier search, management, and recovery of disk-based NetApp snapshots.

“You can safely assume that a year from now one of the hot topics will also be integration with e-mail and database environments in a seamless fashion. This will be the point at which relationships with large vendors become critical,” says the Taneja Group’s O’Neill.

He also notes Hewlett-Packard’s active interest in these tools and EMC’s ongoing intelligent information management (IIS) initiative to integrate core search, classification, and document management technologies from acquired companies Documentum, askOnce, and Acartus.

In a recent report on users’ needs and expectations regarding intelligent information management, ESG found a large percentage of users actually expected metadata and content indexing and search functionality to be embedded and available from within their e-mail and general archiving applications (see figure, above).

Click here to enlarge image

Ultimately, the question of where the classification function should reside will likely need to be answered. An ever-widening range of applications in the areas of enterprise search, knowledge management, ECM, business intelligence, and analytics also now offer their own extensive query and classification functionality.

Going tactical

Some users might be understandably wary of making a wrong move in such an emerging and overlapping market. Steve Carn, director of distributed storage operations at the UnumProvident insurance company, grappled with this issue as well before deciding to go with Arkivio’s auto-stor to help him classify and migrate data to various storage tiers.

“When we first looked at Arkivio,” says Carn, “there were a lot of vendors out there doing this type of work, and they were very new at it. I thought to myself, ‘Why spend half- a-million dollars with a top-tier company when the space is just going to improve?’ We decided to buy something now for this particular part of ILM, spend less money and make a tactical move, not a strategic move, that would give us the most bang for the buck and would last me at least three-to-five years until the market changes.”

Carn was able to use Arkivio to earmark and move about a terabyte of data off his company’s high-performance disk systems (based on the fact that it hadn’t been accessed in the last two years), delete duplicate files, and identify and delete data that fell outside of the company’s acceptable use policy.

For now, Carn and his team are content to focus on auto-stor reports that demonstrate the types of data the company currently has in place. They have decided to tread carefully when implementing policies that automatically decide what to do with the data once they find it. “We’ve just been trying to wade through how much data we had that nobody’s even touched for the past two years. We’re also working with the policy groups not to move any data without their approval,” says Carn. “The point we’re at now is really cost containment and taking snapshots of our data that will be useful for records management and security. I just wanted to find out information about our data, such as duplicate files. But I know the storage group isn’t going to determine what we do with that data.”

Dissecting the data

Another Arkivio user who is focused on figuring out what he’s got in storage first is David Sadowsky, SAN manager of Actel, a provider of field-programmable gate arrays (FPGAs). An all-EMC shop, Sadowsky and his team faced skyrocketing data requirements that were filling up his Clariion, Celerra, and Centera tiered storage environment at an alarming rate.

Sadowsky was in a conundrum. He didn’t want to keep buying new EMC Fibre Channel disk arrays at $18,000 to $20,000 for an added 2.2TB of Tier 1 storage, yet he knew he couldn’t expect busy end users to clean up their old data, either. After evaluating a few solutions, he opted for Arkivio’s software. “That night, the guys from Arkivio came in, installed the software, and dredged about 8TB overnight and the next day gave me reports about what was out there,” says Sadowsky.

What Sadowsky found the next day was surprising. “All of a sudden, I could see that about 70% of my front-line Fibre Channel disks had data on them that hadn’t been touched or accessed in 180 days,” he says. Sadowsky soon developed a policy that moved all of this data to the company’s Centera system, which he backs up only after performing a full migration.

With a data classification process based solely on file access, age, and size, he defined a policy that allowed Arkivio’s software to consistently look for any files larger than 1MB that had not been accessed for 180 days, then make those files eligible for migration. With that one exercise alone, Sadowsky was able to move 40% of his data off primary storage and reduce his tape usage and backup windows by 40% as well.

Sadowsky is also using Arkivio’ software to discover personal files stored on the system, such as MP3 and JPEG files. “Over time, I came to realize I had 53 million files on these data movers,” he explains. “Then, I had all these nice charts I could spin this way and that way to learn more about what was in there. It was like having a microscope that allowed me to look inside the file system.”

Proving compliance

Sometimes, the art of discovery is more than half the battle. Just ask Karen Johnson, HIPAA security official for St. Vincent Health. No stranger to preserving records for regulations, the hospital had been involved in basic classification, retention, and flagging processes to meet local and state regulations well before the emergence of HIPAA. When HIPAA came along, it just made the act of classifying data and protecting personal health information (PHI) more formalized.

“HIPAA privacy compliance says you must have a data classification policy, and within that policy you have to classify the data according to the mandates of HIPAA,” says Johnson.

Writing a data classification policy and sharing it with employees isn’t necessarily the hard part of this regulation, according to Johnson. “The other side is knowing how to implement the policy in an environment where the data grows exponentially, and where our common file space grows on an hourly basis,” she says.

While new systems and applications tend to have compliance issues ironed out ahead of time, all of the terabytes of legacy data that no one seems to have touched created a potential compliance issue for the hospital. Knowing that employees couldn’t be relied upon to perform much in the way of manual classification, Johnson and her team were on the lookout for ways to automate the process of tagging, classifying, and managing personal health data.

After seeing StoredIQ’s solution at a HIPAA conference, Johnson’s team asked the company to come in and demo the product. “They brought a box into our environment, pointed it at our common file server with close to a terabyte of files on it, and showed us how it worked. It wasn’t long before my small team witnessing the demo came running into my office and said, ‘You’ve got to see this!’ The next thing I knew, we were negotiating with them about how fast they could bring the system back,” she says.

The main thing that sold them on the product was the fact that StoredIQ’s solution was able to go in and look at data on the network without ever resetting file attributes such as last ownership and last access dates. “Not resetting attributes is a big deal in compliance, in terms of who last had access to a file and when they accessed it. It’s an important part of your security environment if you’re doing forensics and trying to put an investigative report back together because someone lodges a privacy complaint,” claims Johnson.

Johnson and her team also liked the fact that StoredIQ’s product was able to use natural language patterns to make connections between words like “registration,” which can be used for a variety of communications, and patient IDs and names. “Putting those types of things together makes up a restricted, confidential record under HIPAA security requirements, which is considered a protected health record,” Johnson explains.

Johnson and her team are still working out the exact process to follow with any protected health data the StoredIQ software discovers, but she looks forward to making more-specific policy-based actions on the data as the company’s process evolves.

Michele Hope is a freelance writer specializing in storage, and a regular contributor to InfoStor. She can be contacted at mhope@thestoragewriter.com.

This article was originally published on May 01, 2006