Part 2: Managing fixed content with CAS

Posted on September 01, 2005

RssImageAltText

In the second part of a two-part series, we look at content-addressed storage (CAS) products from some of the founding members of the CAS community.

By VS Joshi

In the first article in this series (see August 2005, p. 36), we covered the differences between archiving and backup, highlighting the differences between “active archival” storage systems based on disk, and “passive archival” systems based on tape and optical media. We also examined why certain active archival systems, using content-addressed storage (CAS) implementations of object-based storage, are rapidly gaining traction. (CAS is merely one implementation of object-based storage. For more information, see “Object-based storage: Making disks smarter,” InfoStor, April 2005, p. 40.)

In CAS systems, the content itself becomes the fundamental organizing principle for how data is stored and addressed. To achieve this goal, CAS vendors use hashing algorithms (MD5, SHA 1, SHA 256, etc.) to create cryptographic hashes for the object data, which are then used as part of unique identifiers for the data objects. The hash value contained in this identifier constitutes a unique mathematical expression of the content. With even the slightest change in the content, the hash value changes, generating yet another unique identifier. Although CAS systems do not use the hash value as a handle for data access, the hash values of data objects are used as part of the addressing scheme. This is in part why some vendors use the terms “object-based storage” and “content-addressed storage” interchangeably depending on their marketing position.

Another salient feature of most active archival storage systems is their use of a redundant array of inexpensive nodes (RAIN) architecture, which combines standard off-the-shelf storage servers, inexpensive ATA or SATA disks, and networking hardware with management software. RAIN architectures offer data storage and protection systems that are distributed, shareable, scalable, and less expensive than traditional architectures. This allows applications to be cost-effectively deployed across a grid of devices that are highly available, self-managing, and self-healing.

RAIN nodes are physically interconnected using standard IP-based LANs. The key component of RAIN architectures is the software, which allows nodes to continuously communicate their identity, status, capacity, performance, and health status information among themselves. As more storage nodes are added, RAIN management software automatically detects the presence of new nodes on the network, enables self-configuration of new nodes, and automatically manages load balancing across all the nodes (see figure).


In a RAIN architecture, when a new storage node (SN) is added, the system automatically detects the new node, performs load balancing across all the nodes, and transfers the authority for certain data to the new node. Similarly, when an older node is removed, the system automatically reconfigures and all data from the failed node is then transferred to the remaining surviving nodes.
Click here to enlarge image

RAIN nodes do not require immediate replacement when disks or nodes fail because data is automatically replicated among the surviving nodes in the grid. If a particular node becomes non-functional or is removed from the system/network, RAIN software detects that, and all data on the non-functional node is regenerated (from the mirrored/parity copy) on the remaining functional nodes. Because all these things happen automatically without the intervention of administrators, RAIN architectures significantly reduce management costs and are described as “self-healing” and “self-managing” systems.

Because archive data can potentially outlive at least three to four hardware generation changes, ease of data migration to new hardware becomes a key requirement for archival platforms. Archival platforms leverage grid/RAIN architectures and intelligent software to address those issues. The software enables migration to next-generation platforms non-disruptively, with little to no administrative intervention or forklift upgrades.

This article takes a look at CAS platforms and strategies from some of the founding members of the CAS community organization (www.cascommunity.org), including EMC. Hewlett-Packard, Sun, and start-up Permabit. Archivas, another start-up, is not a member of the CAS community, but it has a similar solution.

EMC

In 2002, EMC became the first large vendor to popularize the CAS product category via its Centera product line, which resulted from EMC’s acquisition of Filepool, a Belgian CAS software vendor, in 2001.

EMC has established itself as a leader in the CAS market with more than 1,000 customers and more than 200 ISVs writing to the Centera APIs. Although Centera started out initially as a CAS product with proprietary APIs, EMC has added another module that enables industry-standard NFS and CIFS access to the platform through gateway technology that EMC gained via its acquisition of Storigen.

Centera uses the MD5 plus hashing algorithm and comes in two models-one for large enterprises and one for small to medium-sized businesses (SMBs). Both platforms have essentially the same features and functionality, differing only in capacities supported. The high-end model can scale to more than 300TB in a clustered configuration; capacity can be scaled in eight-node increments. The smaller Centera model starts at 2.2TB and can be increased in four-node increments. Each storage node includes four drives, or slightly more than 1TB of capacity per node.

Each of the Centera models comes in a compliance edition (volumes cannot be deleted until the retention period is over) or a governance edition (volumes can be deleted by administrators). Centera also provides two methods of data protection: Content Protection Mirroring (CPM) and Content Protection Parity (CPP).

With the latest release of the CentraStar operating system, users can create storage pools in Centera platforms. With Centera Seek software, users with proper authority can easily search and retrieve the archived digital content in each of the virtual pools or in the entire Centera platform. EMC’s Centera Chargeback Reporter software allows internal utilization-based billing and reporting for each pool.

Hewlett-Packard

Until early this year, Hewlett-Packard had an integrated CAS product with e-mail archiving software and an active archival hardware platform together forming the HP Reference Information Storage System (RISS). HP subsequently separated the offerings, with RISS being the active archival storage platform and Reference Information Manager (RIM) for Messaging software being the e-mail archiving software. This enables the RISS platform to be integrated with other vendors’ e-mail archiving software via HP’s RISS APIs.

HP’s RIM for Messaging software integrates with Microsoft Outlook and Lotus Notes and is designed as an application connector to the RISS platform.

The RISS platform includes a front-end portal and back-end storage grid. It is a fault-tolerant system that archives files and e-mail messages; provides single instancing, full-content indexing, and date and time stamping; and is based on the HP StorageWorks Grid for rapid retrieval of records. Portal cells enable single instancing through the SHA-1 hashing algorithm. RISS also provides search capabilities for archived documents.

The RISS system is based on storage “smart cells.” Each cell has its own processor, storage (850GB per cell), and content indexing. The cells are mirrored for data protection. Adding smart cells increases capacity, compute power, and content indexing capabilities. Smart cells can be added non-disruptively, and new cells are automatically discovered and added to the existing storage pool. The RISS system can be broken into domains, with each domain having separate authorization, retention, data shredding, and backup policies.

Permabit

Permabit’s CAS platform provides an archival solution by combining the company’s Permeon software with industry standard hardware. Currently, Permabit has two primary offerings:

  • Permeon software (for Permabit’s OEM partners); and
  • Compliance Store, an integrated hardware/software archival storage system powered by Permeon software.

Permeon software can be accessed via industry-standard interfaces (e.g., NFS and CIFS). Content addressing is done at 64 KB chunks. Permabit uses the SHA256 hashing algorithm. Incoming objects are split at the 64KB level and these chunks get a unique hash value, thus avoiding duplication at a very granular level.

Permabit enables two kinds of write-once, read-many (WORM) volumes: Enterprise (volumes can be deleted by authorized administrators) and Compliance (volumes are non-deletable and non-changeable). The SHA 256 hashing algorithm is used to create the unique digital fingerprints, thus providing a high level of data integrity. Another differentiating feature is that the system scales in gradual increments of just one storage node. Permabit’s solution includes self-management, self-healing, high availability, and advanced replication features.

Sun

Although Sun is one of the founding members of the CAS community, its “HoneyComb” product, which uses object-based storage technology, is still under wraps. Sun’s pending merger with StorageTek adds another twist to Sun’s CAS strategy. StorageTek recently announced its IntelliStore platform (see “STK gets ‘smart’ about archiving,” InfoStor, July 2005, p. 45), and Sun may have to choose between the HoneyComb and IntelliStore products. A differentiator for IntelliStore is that it combines disk and tape in an integrated platform.

Conclusion

IT organizations are facing new challenges in archiving, including meeting compliance requirements. IT has to provide the necessary infrastructure to ensure the company keeps the right data for the right time periods, and that data is easily accessible and can be produced in a timely manner. However, today’s challenges cannot be solved with yesterday’s technology, so IT administrators should take a fresh look at disk-based active archival storage platforms that provide data integrity, data availability, rapid access, and lower total cost of ownership.

VS Joshi is an independent storage analyst. He can be contacted as vsjoshi@rcn.com.


NetApp takes a different approach

To address archival and compliance issues, Network Appliance has a different philosophy than the other vendors covered in this article. NetApp’s approach is not content-addressed storage (CAS) or object-based storage, and the company does not provide separate boxes for compliance and archival purposes.

Network Appliance’s primary storage systems, gFilers, and secondary ATA-based NearStore systems can both be leveraged for compliance and archival via two enabling software products-SnapLock and LockVault. These software products are essentially licensable features of NetApp’s Data ONTAP operating system and provide immutable and permanence features necessary for compliance purposes.

For structured data (e.g., databases) and semi-structured data (e.g., e-mails), SnapLock software archives files using the CIFS and NFS protocols. Retention policies are assigned to files at the archive application level. There are two versions of the software: SnapLock Compliance and SnapLock Enterprise. The primary difference is that, with SnapLock Compliance, volumes cannot be deleted until the retention period is over, whereas in the Enterprise edition an authorized administrator can delete volumes. However, in both cases, files cannot be altered.

LockVault provides permanence features for unstructured data. LockVault backs up unstructured data and all future incremental changes to secondary storage and locks the backups for compliance retention, thus providing one architecture for backup and compliance. SnapMirror software can be used for data replication.

Currently, Network Appliance’s products do not provide single instancing capability, which is one of the salient features of CAS platforms. However, it should be pointed out that content addressing is not the only way of achieving single instancing, and Net-App may have different plans to facilitate this feature. Also, instead of using the hashing algorithms typically associated with object-based storage technology, NetApp ensures data integrity and availability through traditional RAID checksums and its RAID-DP (double parity) feature, which provides the ability to recover from simultaneous drive failures.


Comment and Contribute
(Maximum characters: 1200). You have
characters left.