The amount of data that must be retained for long periods of time continues to grow, driven not only by regulatory and compliance requirements but also by business best practices. The overall growth rate of unstructured data is expected to reach 60% through 2014. This growth is fueled by digitization of content in industries such as healthcare, media/entertainment and government, and by requirements to store this data for extended periods of time, often in multiple copies. IT organizations are trying to figure out cost effective ways to store information over a long period of time while retaining the ability to search for information, and locate and retrieve it in a timely manner.
The economics of storage are such that when an archive reaches 100TB and data has to be kept for an indefinite period of time, the requirements for the infrastructure become more complex and using only spinning media becomes inefficient and costly. Across all industries the need to store data efficiently, while preserving its integrity, is both a regulatory as well as a business requirement. The needs of organizations with large archives are:
Persistence, or the ability to store data in a system continuously without disruption due to system upgrades or data migration.
Cost efficiency. Although content may become static, the value of content does not decrease, resulting in a need to ensure data integrity and reliability without incurring costs associated with over-provisioning of performance.
Simplicity in managing the environment, automating tasks and delivering agility and flexibility in system design.
Scalability of the archive into petabytes and beyond.
An open environment that can easily leverage next generation technology and deliver benefits as they become available on the market, whether it is denser storage, new storage interfaces, or use of new media.
The technology to address these requirements continues to evolve. Today, organizations looking to build large, unstructured file repositories or archives may select one of the already available approaches/solutions, each with its own benefits and challenges:
Scale-out, file-based storage systems are built with standard components and high density enclosures and drives under a single name space. These systems often use tiering, where data is moved from faster storage media to denser and slower media as access patterns change. These environments can scale into the petabytes range and have the advantage of data always residing on spinning media and thus being always accessible by applications or users. The underlying file system supported by an intelligent storage array ensures redundancy of data and system availability. The challenge in using such a system is its potential impact on the physical environment (data center space and power consumption) and cost.
Scalable archiving systems use disk and removable media such as tape to create a large content repository. The interface to these systems is typically a standard network protocol such as CIFS or NFS. Data is initially placed on disk and over time is migrated to tape. The system tracks data on tape in the library and on the shelf, making the location of the data transparent to users or applications. The use of tape enables the creation of a lower cost archive with minimal impact on data center space or power consumption. The challenge in using tape, especially tape outside of a library, is if the environment is not tuned regularly, it is prone to failure over time. When dealing with a large number of tapes, the manual process of tuning can be taxing on the resources or costly from a personnel perspective.
Recently announced open-source LTFS with LTO-5 tape allows an LTO-5 drive to be attached directly to a computer and data can be copied or moved to the cartridge for storage. The cartridge is partitioned with one partition storing the index and location of all data on the cartridge, and the second partition storing all the actual data. For smaller organizations, this offers a cost effective way to create data archives using a native file format. Native format enables greater independence during access or recovery from third-party software with data moving engines. The challenges of such a solution, today, include scalability and manageability. Only one tape drive can be attached to a computer at one time, which is not operationally efficient for large environments. The data can be copied or moved to the tape drive using an LTFS plug-in. There is no centralized management framework that tracks location of all data and can initiate tape retrieval. And the previously stated challenges of storing data on tape long term are also not addressed by LTFS at this time.
It has been common for organizations to create a backup tape and send it offsite as an archive. This practice continues today with slight modification. Some organizations are leveraging cloud storage services to create content archives outside of their primary data centers. Both approaches continue to be viable, but have challenges associated with cost, operational efficiency, and viability of the storage media.
Across all solutions, additional technologies that automate management and reduce the data footprint will help resolve the challenge of storing increasing amounts of data for longer periods of time. Standardization of platforms and intelligent management of raw devices will further assist in creating and maintaining scalable, persistent content archives.
Intelligent and automated management and data classification and indexing will work in concert with business policies to move, store, protect, and discard data.
Technologies such as data compression and deduplication reduce the data footprint in large file repositories by eliminating redundancies and white space, resulting in higher density of data and using storage more efficiently without impacting data integrity.
Elastic architectures that can leverage a variety of target media, including cloud services, to enable data to be stored most efficiently without sacrificing simplicity, manageability or control will gain greater acceptance and traction.
Noemi Greyzdorf is a research manager with IDC.