Scale-out secondary storage gains traction

By Eric Burgener

-- With the explosion in primary storage capacity and growth rates, secondary storage is growing at an even faster rate. For many large enterprises, overall storage capacities will grow to hundreds of terabytes and beyond over the next several years.

Monolithic storage architectures have been predominant in enterprise environments, but the new scale requirements for secondary storage are not well-served by these architectures. When a monolithic array is outgrown, it requires a disruptive, forklift upgrade to move to next-generation technologies. Ratios between processing performance and storage capacity cannot be very flexibly configured, and routine storage management tasks such as provisioning are not cost-effectively scalable into the hundreds of terabytes. The data tsunami calls for a new "dense storage" architecture that has very different performance, capacity, configuration flexibility, and cost-per-gigabyte requirements than that offered by monolithic storage architectures.

Secondary storage platforms based on scale-out architectures have appeared over the last several years and offer significant promise in managing extremely dense storage environments much more cost-effectively than monolithic storage architectures. These platforms can support a number of different interface types, including VTL, NFS, and CIFS (NAS), as well as proprietary interfaces. Targeted primarily at secondary storage applications such as backup, disaster recovery (DR), and archive, scale-out secondary storage offers a compelling successor architecture that can scale performance and capacity independently to achieve very high small-block I/O performance and/or throughput and petabyte-size storage capacities at aggressive price points.

In this article, we'll look at a set of criteria you can use to evaluate these platforms, along with a brief overview of the competitive landscape.

What problems do they address?
There are several problems that scale-out secondary storage platforms address. First and foremost, end users are looking for a platform that can support capacities of hundreds of terabytes that require significantly less management overhead than monolithic platforms. 

Second, they need a more flexible architecture that allows them to scale performance and capacity independently so that they can configure platforms to provide the right mix of capabilities without significant over-buying or over-provisioning. And third, users are looking for much more aggressive price points for usable storage capacity than is offered by today's monolithic architectures.

Operationally, these needs can be translated into a short list of functional requirements that a platform must support to fit our definition of scale-out secondary storage. They must allow overall performance and capacity to be increased within the same single system image, allowing these resources to be transparently and independently added to existing configurations. In virtual tape library (VTL) environments, this translates to a single VTL image where single-stream throughput and/or overall capacity is scaled as additional nodes are placed. In file-serving environments (e.g., NAS), this translates to either very large, clustered file system sizes or the presentation of a global unified namespace where overall performance and/or capacity is increased as nodes are added. The platforms must also meet stringent requirements for availability, data integrity, manageability, and cost.

Evaluation criteria
Massive scalability and high performance. Given the scale of data at most enterprises, scale-out secondary storage platforms need to offer linear performance scalability as configurations grow from terabytes to multiple petabytes. This implies the ability to scale capacity and performance independently without requiring disruptive, forklift upgrades so that users can configure extremely low latency and/or high-throughput configurations, combined with extremely high capacity, depending upon their application requirements. If data de-duplication is supported, look for systems that support global repositories that allow throughput to be scaled against a single de-duplication index as nodes are added to the system.

High availability. To provide the service necessary for 24×7 IT operations, the storage platform should have no single point of failure, take advantage of the inherent redundancy of grid or cluster architectures using active-active resource pairing, and support online maintenance, upgrades, and data migration. This should include a transparent ability to re-balance workloads as resources are added to or subtracted from a system, as well as an ability to easily establish replicated configurations for DR purposes. Look for platforms that provide high availability at both the system and the data level.

Data management tools. The secondary storage solution should simplify management tasks going forward by providing centralized management of a large number of distributed resources as a single system image. Also look for storage management capabilities such as snapshots, replication, and de-duplication. And because the issue of scale introduces some new complexities, consider whether you require logical data redundancy features that go beyond the protection offered by RAID 6. Look for innovative referencing algorithms designed to deliver high performance even when the back-end storage is operating in degraded mode due to a component failure, and self-managing capabilities that obviate the need for manual storage provisioning and load balancing when resources are added, deleted, or moved.

Data integrity. Expect at a minimum the types of data integrity safeguards you get from your primary storage subsystems, but look for features that go beyond this. If de-duplication is supported, look for strong protection against false positives (also known as "hash collisions" in systems that use hashing when chunking data in preparation for de-duplication) that ensure chunks will not be erroneously mistaken for dissimilar chunks, and for independent data verification algorithms that ensure data is reliably retrieved from the system (e.g., that when data is converted back into its original form it exactly matches the data that was originally written to the system).

Affordability. Evaluating the overall affordability of a platform goes beyond just comparing the $/GB acquisition cost; it also includes evaluating ongoing management costs, the platform's ability to integrate with existing secondary storage processes, and whether or not it supports heterogeneous server and storage hardware as well as standard interfaces (or requires proprietary components or interfaces available only from a single vendor). Understand the cost differences among raw capacity, base capacity, and usable capacity. Raw capacity refers to raw storage capacity; base capacity refers to storage capacity after the RAID overhead has been taken into account; and usable capacity refers to the amount of capacity represented after data de-duplication has been applied against base capacity. Support for de-duplication helps to make scale-out secondary storage particularly enticing from an economic point of view: consider that a storage platform with an acquisition cost of $10/GB for base capacity that supports a 10:1 data reduction ratio offers usable capacity at $1/GB, a price point that is close to that of tape but offers all the performance and reliability advantages of disk.

Vendor/product overview
Vendors in this space often focus on a particular secondary application, positioning their products around either backup/DR or archiving. Five of the products -- EMC's Disk Library, FalconStor's Single Instance Repository (SIR), Hewlett-Packard's Virtual Library System, IBM's TS7650 ProtecTIER De-duplicating Gateway, and Sepaton's ContentAware DeltaStor -- support VTL interfaces and are focused on backup. ExaGrid's EX Series also focuses on backup with a platform that supports NFS and CIFS (but no VTL interface).

Another group of offerings -- including Isilon's IQ Series and NEC's HYDRAstor -- position their products for use across multiple secondary storage applications simultaneously. These offerings leverage the inherent redundancy of scale-out architectures to provide high availability and support at least SATA disks, with some supporting additional disk options. Many of the vendors that target the archive space also employ their own RAID-6 implementations that can sustain three, or more, concurrent disk failures and perform single disk rebuilds without impacting application performance.

Yet a third group -- Active Circle's Active Circle, Caringo's CAStor, EMC's Centera, Hitachi Data Systems' (HDS) Content Archive Platform, HP's Integrated Archive Platform, Permabit's Data Center Series Enterprise Archive, Sun's StorageTek 5800, and Tarmin Technologies' GridBank -- all focus on archiving.

VTL-based backup solutions
There are five scale-out secondary storage platforms that support VTL interfaces, all based on VTL appliance models that support a broad set of tape management features such as export, consolidation, and shredding. Four of these, including EMC's Disk Library, FalconStor's SIR, HP's Virtual Library System, and Sepaton's ContentAware DeltaStor, use post-processing de-duplication approaches and support multiple nodes, global repositories, and heterogeneous storage. The fifth offering, IBM's TS7650 ProtecTIER De-Duplication Gateway, uses an inline de-duplication approach, but also supports multiple nodes, a global repository, and heterogeneous storage.

HP and Sepaton offer their own RAID-6 implementation, whereas EMC, FalconStor, and IBM all leverage any RAID capabilities in the underlying storage arrays. All five products can be used in DR configurations as well, with EMC, FalconStor, HP, and Sepaton offering their own replication capabilities, while IBM leverages the replication capabilities of the underlying disk array. These products all support a wide range of enterprise backup software packages, although the HP and Sepaton products must be specifically qualified with supported backup applications. The EMC, FalconStor, and IBM products can be used with any enterprise backup software. All products scale to support raw capacities in at least the hundreds of terabytes range, allowing them to support usable capacities in the multi-petabyte range.

File-based backup solutions
ExaGrid is focused exclusively on backup with its EX Series, a grid-based platform that uses a post-processing de-duplication approach with a global repository. The platforms support RAID 6, and the NFS and CIFS NAS protocols. Like many products that use post processing, the EX Series retains the latest backup in its original form to support extremely fast restores, but "capacity optimizes" all older backups. ExaGrid is somewhat unique among the vendors in this article, in that the company focuses on small and medium-sized enterprises, with its scale-out secondary storage platform supporting configurations of <100TB, while the other vendors focus primarily on large enterprises. 

FalconStor's VTL-SIR global de-duplication ensures de-duplication at remote offices using VTL appliances and provides enterprise-wide de-duplication at the data center.

NEC's HYDRAstor is positioned for both backup and archive use, leveraging an inline de-duplication approach with a global repository, and support for RAID 6+ and NFS/CIFS. Isilon's IQ Series is even more broadly positioned, addressing both primary (massively scalable primary file services) and secondary storage environments (backup and archiving). 

Isilon's OneFS file system offers a single global namespace, an enhanced RAID-6 implementation that can handle up to four concurrent failures and supports a variety of standard protocols, including NFS, CIFS, HTTP, FTP, NDMP, and others.

ExaGrid, Isilon, and NEC all support independent data verification checking, snapshots, replication, and intelligent self-management that automatically re-balances workloads across new performance and/or capacity resources as they are added.

Active archival storage
Disk-based archives are referred to as "active archives" and can provide much faster response with lower total cost of ownership (TCO) than tape-based archives. Scale-out architectures combined with storage capacity optimization (SCO) technologies such as single instancing or de-duplication offer the most aggressive $/GB available on active archival storage platforms.

In archives, a given file may be the last copy retained by an enterprise, so data reliability is particularly important. That's why many of vendors in this space support logical data redundancy approaches that can sustain three or more concurrent disk failures without impacting data availability, and can perform disk re-builds without impacting application performance. Generally, these products support NFS, CIFS, WebDAV, and HTTP, although EMC's Centera (the industry's original content-addressable storage, or CAS, product) and HP's Integrated Archive Platform support their own proprietary APIs for loading data into the archive. Many vendors promise future support for SNIA's eXtensible Access Method (XAM), an emerging standard for archiving.

Permabit's Data Center Series Enterprise Archive is a scale-out secondary storage platform focused primarily on archiving.

Active archival products come in two flavors: integrated hardware/software platforms that are designed to be used in conjunction with popular archiving software, and software-only solutions that leverage commodity hardware to create an archiving platform.

Although Active Circle and HDS present NFS and CIFS interfaces like the other vendors, they are unique in that their solutions present a global unified namespace, whereas the other vendors can present single, very large file systems (on the order of 1PB or greater).

Active Circle, Caringo, and Tarmin sell software intended for use with heterogeneous server and storage hardware, while the EMC, HDS, HP, Isilon, NEC, Permabit, and Sun offerings are integrated hardware/software platforms. Sun's StorageTek 5800 is the only archiving system built around open-source software. All vendors, except Isilon and Sun, support some form of native SCO technology, with Active Circle, Caringo, HDS, and HP supporting single-instancing at the file level, and NEC and Permabit supporting de-duplication technology at the sub-file level. Some of these products offer tape exporting capabilities that may be of interest to companies leveraging tape within the storage hierarchy.

NEC's HYDRAstor is positioned for both back and archive applications and uses inline de-duplication with a global repository.

In terms of archive functionality, several approaches exist. Products from vendors such as HDS, Permabit, and Sun are specifically targeted at archive environments, while products from vendors such as Isilon and NEC are more general-purpose storage platforms that allow users to stay with their existing archive software to migrate and manage data. Isilon's systems can be configured to support both primary and secondary storage applications at the same time, and to support automatic file migration between tiers. CAS products from vendors such as Active Circle, CAStor, EMC, and Tarmin, on the other hand, implement an object-oriented approach that treats files as objects, assigns unique identifiers to each object, and performs archive management tasks such as auditing, retention, encryption, and shredding, based on policies associated with each object.

The tipping point
Scale-out secondary storage architectures address many of the performance, capacity, and management limitations inherent in monolithic storage architectures, scaling to hundreds of terabytes and eliminating manual provisioning.

And you may be in for a pleasant surprise on costs: While entry-level configurations often start at more than $100,000, they can provide $/GB numbers that can be an order of magnitude less than that of conventional storage architectures. Assuming SATA disk and data reduction ratios in the 10:1 to 20:1 range -- very achievable in the backup world—these platforms can often support three-year TCO numbers under $1/GB for configurations supporting 100TB+ of usable capacity.

There are thousands of these platforms running in production environments in large enterprises today, although solutions that support heterogeneous servers and storage tend to be newer than the integrated hardware/software CAS offerings. If the functional advantages of scaling performance and capacity independently and the management advantages of automated provisioning and load balancing don't provide sufficient appeal, then the TCO advantages may be enough to put you over the top. If you're dealing with hundreds of terabytes, or soon will be, you need to take a look at scale-out secondary storage platforms.
Eric Burgener is a senior analyst and consultant with the Taneja Group research and consulting firm.

TLC taps HYDRAstor for backup

TLC Engineering for Architecture, one of the largest structural engineering firms in the country, uses complex 3D workstations to design everything from tower blocks to airports. These huge digital drawings are the lifeblood of the company, and they were getting larger and larger over time.

TLC's existing tape-based backup system was having difficulty dealing with the rapidly expanding capacity required for data-protection operations. Completing backups within existing backup windows was a problem, and many restore requests for older drawings stored on tape off-site were taking too long to fulfill.

TLC began looking for a scalable system that would improve backup operations and decrease restore times while maintaining high reliability. The firm eventually chose NEC's HYDRAstor HS8 grid storage system.

"Previously, we had to restore data from tape backups, which was a lengthy process that could sometimes take up to two weeks," says Scott Ashton, a network engineer at TLC. "Deploying the NEC solution has significantly improved productivity for our company." 

Caringo gets accepted at Johns Hopkins

The Center for Inherited Disease Research (CIDR) at Johns Hopkins University provides genotyping and statistical genetics services for investigators seeking to identify genes that contribute to human disease. Building cost-effective, high-capacity systems that can accommodate the high rate of data growth generated by the detailed genetic analyses that form the backbone of their research is a constant challenge. When operating at full capacity, CIDR can generate as much as 2.5TB of new archive data per day.

"We're well funded, but we can't go out and buy a conventional, monolithic storage platform to do this," says Lee Watkins Jr., the center's director of bioinformatics.

Using commodity x86 servers and Caringo's CAStor software, CIDR built a scalable archiving platform that currently supports 31 nodes and 104TB of storage and provides ample room for modular, cost-effective growth over time.

Key purchase criteria for CIDR included a decreased cost of storage relative to conventional approaches, easy scalability with a very high-capacity growth path, and support for online expansion that would not impact production application performance.

This article was originally published on November 05, 2008