Managing fixed-content data with CAS

In the first part of a two-part series, we look at object-based storage, or content-addressed storage (CAS), and “active archiving.”

By VS Joshi

In June, CitiBank revealed that it had lost an entire box of tapes in the care of United Parcel Service, with personal information on nearly four million American customers. In May, Morgan Stanley was hit with a legal blow of $604 million in damages as the firm failed to produce documents in time.

These are just two of the many incidents that have occurred in recent months. If this is the state of data management at two leading financial institutions, imagine the potential catastrophic accidents and fines various companies are vulnerable to if something is not done to change the way data is stored, archived, and managed.

Backup versus archive

Before we address the subject of “active archival” of fixed content data, it’s important to clarify the difference between backup and archive, as many people use these words interchangeably.

Backup copies are secondary copies, or “insurance copies.” The purpose of backup is to maintain a copy of a volume or a file in case the original volume or file is deleted or corrupted. The backup copy is used to restore data to a point in time of the last-known good copy.

In contrast, after a certain point in time an archive is the primary copy in the system. Archiving is not a form of data protection, and it is not a replacement for backup. Archiving serves the purpose of

  • Business re-use of data;
  • Preserving data for legal and compliance reasons;
  • Maintaining records for corporate memory; and
  • Better application performance and backup reduction as data is moved from the production system to an archival system.

As archive copies can be stored on disk, tape, or optical media, archival storage systems can be further classified into active and passive archival systems. Disk provides an active archival system, while tape and optical are passive archival systems (because of relatively slow access times).

The purpose of this article is to show why active archival storage systems using object-based or content-addressed storage (CAS) implementations are quickly becoming an optimal solution for archiving certain types of fixed-content data.

Transactional vs. fixed data

An organization’s data can be broadly categorized into the following two types:

  • Transactional data includes most data essential for the daily functioning of the business. This type of data is mainly found in databases and is generated via transaction-oriented applications such as CRM, ERP, etc.; and
  • Fixed data is data that doesn’t change. It is also called reference data and includes various types of content such as digital newspapers, audio and video files, digital photos, X-rays and MRI scans, check images, e-mail, instant messages, legal documents, etc.

Because transactional data is required for the daily functioning of the business, it has to be stored on high-end disk arrays. The cost of this expensive media can range from $50 to $75 per gigabyte.

On the other hand, after a certain time period of frequent access, fixed-content data is rarely accessed. However, whenever it is required, it is required immediately for business use or compliance, regulatory audits, or litigation purposes.

The above usage and access pattern of fixed content data suggests that after the heavy access period fixed-content data can be migrated and archived on a less-expensive media ($5 per gigabyte or less). This helps in achieving substantial reduction in the total cost of ownership. However, passive archival systems may not serve the purpose because access to the fixed content is difficult and slow.

Compliance, audits, e-discovery

In the post-Enron world, regulatory bodies have become more active and demanding, and the occurrences of needing archival data on short notice have increased significantly. Timely reproduction of archived data is critical in cases of litigation and regulatory audits. If this data is on tape or optical media (passive archival media), then the process of retrieving data is manual, slow, and inefficient, possibly exposing organizations to huge fines or at least bad publicity. The sheer cost of electronic discovery in response to litigation is strong motivation for organizations to reconsider their archival practices. Government regulations such as SEC 17A-4, 21CFR, Sarbanes-Oxley, and HIPAA dictate records retention and archival requirements for electronic communications such as e-mail and instant messaging and other electronic data.

Business re-use of data

Having fixed-content data available online gives an organization the flexibility to use that data in ways not possible via passive media.

Besides security issues (due to removable media) and slow access speeds associated with tape and optical media, there are ongoing costs associated with maintaining the media, moving the media from on-site to off-site locations, and movement of the data from older hardware to newer hardware. In the case of tape, the serial access of data makes the retrieval process cumbersome and slow. In addition, large amounts of tape space are wasted on data that is not unique. Restoring from tape can take several days and can use hundreds of tapes, even assuming that all tapes are readable.

Access to older data stored on tape can be undermined by changes in the hardware used to read particular tapes. Not all organizations can afford the expense and disruption inherent in manually migrating older data to new tapes when new libraries, drives, or tape cartridges are introduced. This lack of media migration increases the risk that data will not be available if needed.

In short, a lot of manual activity and inefficiencies are inherent in archiving data on passive storage media, which leads to a strong case for active storage media for fixed-content archival.

However, to fully address the current compliance issues, a special implementation of disk-based active archival is essential. According to the International Data Corp. report, Regulatory Compliance: What role will technology play?, most of the current compliance mandates address one or more of the following requirements:

  • Information and process integrity-Requires airtight processes that support control over information processing, thereby ensuring data integrity;
  • Controlled access-The need to manage access to, or use of, specified information; and
  • Information retention-Encompasses management of indexing, retrieval, and storage of information retained for long-term archiving.

Object-based storage or CAS

A relatively new disk-based archival storage solution based on object-based storage technology or CAS using inexpensive drives such as ATA, is rapidly gaining ground for active archiving of compliance-related data where information/process integrity and information retention policies are critical issues. Some of the key features of these systems are the following:

Data integrity-Object-based storage differs from traditional file-based or block-based storage in that files, images, or data blocks are stored as individual objects or elements of objects with “global unique identifiers.” The identifier is solely dependent on the content of the object and is a unique digital “fingerprint” of the object. Hashing algorithms ensure the highest level of data integrity verification. Since each object has a unique identifier, any change made to the object changes its hash and creates a new global unique identifier. The ability to add a retention period to an object makes it non-erasable until that period elapses. Conversely, the objects can also have a destruction date attached to them so that data is automatically deleted on the specified destruction date, which enables recovery of storage capacity.

Single instancing-Unlike traditional storage technology, the global unique identifiers generated by hashing algorithms enable the elimination of redundant copies of data/files by avoiding their creation from the outset. If an e-mail with an attachment is sent to a mailing list of 100 people, conventional storage technology stores 100 copies of the same attachment. With object-based storage, or CAS, if the content of the object is the same, a unique ID is associated with it and the object itself is stored only once, with 100 pointers to the stored object. This technique is called single instancing or coalescence, which saves storage capacity.

Location independence-In most cases, object-based storage uses a redundant array of independent nodes (RAIN) architecture. Because the objects can transparently move through various storage nodes within the RAIN architecture without any loss of protection or reduction of service levels, the data associated with the objects is location-independent. Due to location independence, object-based storage technology reduces the high cost of storage management because there are no shares or volume managers to administer, no binding of LUNs, no zoning, etc.

Self-management and self-healing-RAIN architecture provides the flexibility of self-protection and self-healing because each node is mirrored to another node, or parity-protected. Continuous background checking and recalculation enables automatic generation of two copies in case of node failure. When a new node is added, authority for some object in the existing nodes is transferred to the new node and the new storage capacity is automatically added into the storage pool without any intervention or provisioning tasks performed by administrators.

Object-based storage, or CAS, is gaining traction because many organizations see the value in creating this tier within their IT infrastructure. Only a few vendors have products in this space, and they recently formed a CAS portal (www.cascommunity.org) to educate end users about this technology.

In the next article in this series, we will cover products from some of the founding members of the CAS community.

VS Joshi is an independent storage analyst. He can be contacted at vsjoshi@rcn.com.

This article was originally published on August 01, 2005