Manage Unstructured Data: File Analysis at Scale

Posted on April 22, 2015 By Christine Taylor


Corporate data is growing 35-50% year-over-year. Unsurprisingly, many businesses are finding their storage spend doubling. Much of this spend is on storage, which is steadily rising as a percentage of computing infrastructure. In 2009, storage accounted for about 20% of the fully populated infrastructure. In 2015, that percentage has doubled – and it keeps on rising.

Managing all this storage is a particular challenge with unstructured data, which makes up 75% and higher of most stored data. Why is the percentage so high? Consider fast-growing Office files, email, SharePoint, videos, graphics, audio, cloud data, sensor data – a fast-growing universe of hard-to-manage information.

So what’s the Problem?

Active data is relatively visible and manageable on production systems but the longer data ages, the less visible it becomes to users and processes. We call these files “dark data.” Dark data is unstructured data whose characteristics or existence are essentially invisible to IT.

This data is spread across multiple storage repositories: networks, Exchange, SharePoint, Box, the cloud. It is difficult to know that it even exists without deliberately searching metadata across different storage systems. Even when searchers find it, it’s difficult to know its priority and access settings. At best dark data simply takes up valuable storage space; at worst it impacts security, retention policies, and business value – exactly what IT is responsible for protecting.

Increased expenditure is one unpleasant consequence. CapEx and OpEx scale up sharply with swelling storage. Today’s storage costs include buying and managing storage for file shares, Exchange, and SharePoint, all of which may be on-site or distributed across remote locations. Companies also buy services like third-party file sharing, applications delivered remotely, and cloud-based storage – or they let their employees do it with corporate data. All of this adds to the cost and complexity of storing data.

Fast-growing data also adds levels of complexity to unstructured data management. IT has data flowing in from sensors, customer-generated content, communications, exploration, and media. The sheer amount of data spread across many storage locations makes it hard to even find the data you’re looking for, let alone manage it or aggregate it to serve business processes like analysis and eDiscovery.  IT is not able to visualize and act upon widely distributed data across many different storage systems and applications.

Poor management also impacts data-driven business processes like eDiscovery and compliance. IT spends more time trying to locate information and more money storing it, and the expense/inadequate data management circle just gets more vicious.

The Advent of Hyperscale File Analysis

There are ways to bring dark data into the light. This is where new, highly scalable file analysis platforms enter the picture. They can and do have eDiscovery, security, compliance, and analytics toolsets but are primarily focused on the fundamental challenge of managing files for efficiency, cost-effectiveness and business value.

The fundamental capability of these platforms is to first grant the ability to discover and identify data by its characteristics. The deeper the ability to locate data by metadata and user access, the better. The next step is to classify the discovered data according to policy, then to act according to that policy. Finally all of this must be capable of operating in a hyperscale storage environment.

1. Data management. File analysis software automates storage management: tiers aging data for big cost savings, safely manages lifecycles including defensible deletion, locates and fixes orphaned data, and feeds data to business processes for analysis and governance. For example, tiering older data on less costly media frees up expensive production storage, which increases performance and saves money on scaling expensive production storage arrays. 

2. Defensibility. A lot of companies have a “delete nothing” policy for data because no one wants to be responsible for accidentally deleting a critical data set for the case of the century. But IT is responsible for high storage costs in a “delete-nothing” environment. Defensible deletion protects data decisions and saves on the high cost of storage. The information management platform will be able to automate deletion with or without approval layers. Defensible deletion also benefits the Legal and Compliance departments, who need not fear smoking gun data that should have been properly deleted.

3. Data security. The third concern is data security, specifically user access control. InfoSec teams watch the network perimeter while IT is responsible for data security. Data protection of course is part of this domain and so is encryption. However, user access control is sloppy at many organizations because it’s harder to integrate with large volumes of stored data. File analysis platforms can integrate file classification with Active Directory, and subsequently reports and remediates security holes.

Best Practices

When looking for a file analysis platform look for these characteristics.




eDiscovery support

Ability to search and classify file-based data is integral for eDiscovery collections. Look for additional eDiscovery tools such as legal hold and modification tracking, and cost-effective processing into the review software.

Save money and time and lower risk on collections phase of eDiscovery

Defensible deletion

Defensible deletion rules will use policies and metadata to uncover deletion candidates. Should offer immediate deletions up to multiple approval rounds. Scheduling and bulk deletion add to efficiency. Defensible processes maintain audit trails and detailed metrics.

Save CapEx and OpEx by deleting and migrating data off of storage. Easily defend deletion decisions against opposing counsel or investigators.

Orphaned/Stale data


Stale data is aging data that no longer serves a business purpose; orphaned data is separated from its application and takes up storage room. Both data types are subject to automatic deletion and/or tiering depending on compliance requirements.

Save storage space for better performance and fewer purchases.



Identify sensitive documents such as private health information (PHI) and protected personally identifiable information (PII) like Social Security numbers, home addresses, or payment information. Launch policy-based actions accordingly.

Automatically comply with security regulations.

File migration


Automatically migrate data matching specific classifications. Tier aging data for storage savings, shorten data migration projects, or move collected data into a protected repository.

Migrate aging data onto less expensive storage tiers for savings and improved production performance.

Typical User Scenarios

Scenario #1: An enterprise data center needs a defensible deletion tool across multiple repositories.

The storage administrator team in a complex data center was having trouble managing data lifecycles. The Legal department did not want to store potentially smoking guns for an indefinite period of time, and storage was consuming a large part of the budget.  However, IT needed to keep deletions defensible both for internal governance and for future litigation and compliance.

File Analysis Outcome: They invested in a file analysis application that allowed them to search and classify files across repositories by creation age, modification date, owner, and content. Some files could be marked for automatic deletion; others for approvals before deletion. Results were excellent: IT savings from SharePoint and Exchange alone was in the tens of thousands over a 3-year period. 

 Scenario #2: A government agency wants to store data to the cloud.

A second example is a government agency that made a case for storing data in the cloud. They also needed to cost-effectively manage on-premise file share, SharePoint, and Exchange and preferred to manage all their unstructured data with the same platform. They liked the scalability of the cloud but security was a big issue, especially since their agency overseers were not convinced about the integrity of data on the cloud. 

File Analysis Outcome: The agency bought a file management platform with strong governance capabilities to prove data security. The platform enabled the agency to defensibly audit access rights and data ownership across all repositories including the cloud, and let them apply rich classifications for defensible migration and deletion. The same platform also reported user access rights and remediated problems according to IT policies. 

Delivery Systems

There are two major product approaches to this scale of file analysis: storage-based, data-aware systems, and massively scalable software-based intelligence operating across different systems.

Storage-Based Data Awareness

This option bases data intelligence in the storage management layer. Storage vendors have added management features to its arrays for decades and some of them are pretty sophisticated. Most of the big storage stalwarts work along these lines including EMC’s SourceOne division, IBM InfoSphere and StoredIQ, and HP’s  Intelligent Retention and Content Management platform.

Newer hyperscaled storage products combine CPU advances and flash with data awareness for analytics. Classification and analytics returns information on the files stored on the device including file attributes, patterns and searches. Analytics may be used for business value, to troubleshoot problems, or to optimize storage processing. Some of these products are discrete storage arrays; others aggregate different storage repositories under central storage management.

Newer products from Tarmin, DataGravity and Qumulo are highly scalable and data-aware. Tarmin offers content-based metadata indexing and applies a broad set of IT management, eDiscovery and analysis tools on distributed commodity storage. Qumulo offers general purpose NAS software with extreme scalability and real-time analytics support. DataGravity discovers expanded metadata including user access and has visualization tools to help interpret analytics. These vendors can be excellent choices for managing stored data for scale, analytics, IT management and business processes. They do however lock you into a specific storage vendor and do not act on third-party file sharing platforms or in the cloud.

Software-Based Information Governance

When IT has a highly distributed storage infrastructure that includes local storage, remote storage, third-party file-sharing applications, and cloud-based storage, they benefit from a software-driven product that discovers, classifies and acts on widely distributed unstructured data. A software product with native APIs for common business applications deploys quickly with a minimum of change to the existing storage infrastructure.

There are many ways to go with software-based information governance technology, and vendors must choose where to concentrate their development and marketing. These choices inform the distinctions between vendors.

One of the paths is Master Data Management (MDM), which constructs master data files for usage in different processes. Ostia Portus works with both unstructured and structured files to transform them into master data information for multiple usages. Reltio offers a cloud-based Master Data Management (MDM) service to classify and integrate data from on-premise and social media sources.

On the unstructured file management side, software products that discover, classify and act on dark data are becoming a popular choice for IT. One of the market leaders is Acaveo, whose Smart Information Server (SIS) centralizes operational intelligence for files located across on-premise, distributed and cloud data sources. We find that Acaveo’s deployment and management simplicity, cost savings, and tight integration with popular applications – including Exchange, SharePoint, Google 365 and Box – make it a leading choice in this market segment.


File analysis technology at scale is not easy to add to existing storage or information management products, given aging code bases and architectural limitations. This is why newer vendors like Acaveo and data-aware storage vendors are filling a data management vacuum. They are leading the charge to control big storage spends and to protect valuable data against compliance and security risks.

These highly scalable tools to manage unstructured data are available today. Don’t hesitate to take advantage of them.

Photo courtesy of Shutterstock.

Comment and Contribute
(Maximum characters: 1200). You have
characters left.