It has been several years since content–addressed storage (CAS) systems emerged to assuage end users’ fears that the pressures of regulatory compliance would ultimately result in a long–term stay in a sanitarium.
CAS platforms are disk–based, object–oriented storage systems designed for the long–term retention of data that is not intended to be changed. CAS systems stamp data with unique identifiers that ensure records have not been altered and that they can be accessed and retrieved at any given time for a variety of reasons.
As the Storage Networking Industry Association (SNIA) works to provide a framework for information independence via the eXtensible Access Method (XAM) specification, the vendor community continues to offer hardware– and software–based archiving products that, by all accounts, are meeting the needs of customers.
E–discovery saves the day
The Roswell Park Cancer Institute (RPCI), a cancer research center and patient care facility located in Buffalo, NY, was introduced to the concept of CAS during an overhaul of its core IT infrastructure.
RPCI was out to build an entirely new storage infrastructure to improve its disaster–recovery processes, increase storage scalability, improve application performance, and archive medical images and e–mail for regulatory compliance and legal purposes.
RCPI is an all–HP shop with a variety of storage and server technologies from HP, including a pair of StorageWorks EVA 8000 arrays, two 10TB StorageWorks Medical Archive Solution (MAS) systems, an Enterprise File Services (EFS) Cluster Gateway, StorageWorks Cluster Extension (CLX) technology, and clustered ProLiant servers. The MAS platform handles the medical images, but RPCI was looking for e–discovery capabilities to meet its e–mail archiving needs.
“Lawsuits and e–discovery were first and foremost on a lot of our minds. The rationale is that if you archive all of your e–mail, e–discovery becomes possible and can thwart frivolous lawsuits,” says Tom Vaughan, director of IT infrastructure at RPCI. “The system pays for itself with the first lawsuit you shoot down.”
Vaughan knew that establishing effective archiving and e–discovery policies meant saving everything. That approach is especially necessary in New York where the state has imposed stringent data–retention rules on top of those outlined in Sarbanes–Oxley. Vaughan and many of his peers affectionately refer to the added retention requirements as “Baby” SOX. “We have to keep all e–mails, archive them, and be able to produce them,” he says.
Vaughan had two simple questions for HP when he was first made aware of HP’s Integrated Archive Platform system (formerly known as the Reference Information Storage System, or RISS) more than a year ago. “Does it store everything, and can we search it?”
The Integrated Archive Platform can distribute hundreds of terabytes of content and billions of objects across a grid of disk–based SmartCells. The architecture provides linear storage, search, and retrieval performance regardless of the size of the archive.
The system works with HP’s Email Archiving and File Archiving software clients to facilitate the long–term retention of Microsoft Exchange, IBM Lotus Domino, and file system information.
The Integrated Archive Platform combines HP’s server and grid storage technology and native content indexing, search, and policy management software into a single rack system for long–term data retention.
RPCI implemented a 2.7TB Integrated Archive Platform to archive its Exchange 2003 data, and now the RPCI IT staff can find and retrieve e–mail quickly for legal discovery and for internal needs. Previously, it could have taken days to weeks to find a particular piece of e–mail. Vaughan says that same task now takes less than one hour.
The need for speed
Policy–driven archiving with fast data access topped the list of concerns for Norton Healthcare, the largest healthcare system in Kentucky, when it began its search for a centralized backup system and a new archiving platform.
The heart of Norton’s business is its multi–faceted hospital information system. Sean O’Mahoney, Norton’s manager of client/server information systems, and his team have deployed a 400TB EMC storage infrastructure, including Symmetrix DMX, Clariion networked storage systems, and EMC Celerra NAS platforms, as well as EMC’s SRDF/Synchronous, Navisphere, ControlCenter, TimeFinder, and Performance Manager software.
Norton’s backup infrastructure includes NetWorker software as the common interface for backup to an EMC Disk Library (EDL) and exports data to Norton’s tape libraries. For its archiving strategy, Norton implemented a pair of Centera CAS systems to archive medical records and radiology and cardiology images.
O’Mahoney says every level of care and administration in Norton’s healthcare delivery system is dependent on immediate access to patient information. “The speed of information access is critical for us,” he says. “We considered tape–based systems, but found that they presented the same problems as our old optical jukebox, such as slow seek times.”
O’Mahoney has two pairs of replicated Centeras. The two local machines at his main data center total about 75TB in capacity and store radiological and cardiology images as well as scanned documents and patient records.
Archiving to Centera enables Norton to adhere to hospital policies and Protected Health Information (PHI) regulatory requirements, including the Health Insurance Portability and Accountability Act (HIPAA) and State of Kentucky regulations, which require that patient medical records and images be retained for a minimum of seven years.
“Centera did several things for us. We no longer have to change out optical platters when they fill up and don’t have to move anything off–site,” says O’Mahoney. “It’s much faster than optical in seek times and writes, and the replication features give us remote copies without having to do anything manually.”
“We gained a lot of value from the speed of data access and the built–in replication technologies we got with the Centera,” says O’Mahoney. “And because there is no more off–site media handling we eliminated potential legal exposures from patient data being exposed to the world.”
Past meets present
The San Diego Supercomputer Center (SDSC) serves as the central nervous system for innumerable scientific projects and houses advanced research data. Researchers rely on the SDSC’s IT infrastructure to store and process data pertaining to large–scale research projects, including earthquake simulations, sky surveys, and biomedical research.
As part of his role as director of the SDSC’s Sustainable Archives and Library Technologies (SALT) lab, Richard Marciano is in charge of data archiving. He and his group manage more than a petabyte of storage for several hundred projects in support of about 5,000 end users. That translates into hundreds of millions of files.
Personally, Marciano’s research is anchored in a new joint effort with the Society of American Archivists and the National Association of Government Archives and Records Administrators.
“My group’s particular area of focus is slightly off the beaten track as it has more to do with the long–term preservation of cultural data from institutions like museums, The National Archives, and The Getty Research Institute,” he says. “My job is to figure out how to keep all of these assets forever.”
Marciano is currently archiving a collection of economic surveys conducted just after the 1929 stock market crash. The documents, he says, are integral to studying economic segregation and the practices of racial zoning and planning that were rampant in post–depression America.
“Our work is all about storage and we have a huge need for the use of commodity storage hardware. We had wanted to use CAS technology for a long time, but the cost was prohibitive,” says Marciano. “We required a vendor–neutral CAS system with high levels of scalability that could accommodate diverse underlying operating systems and hardware.”
The search for a cost–effective archiving system caused Marciano to take notice when Caringo came calling last fall.
Caringo’s claim to fame is its hardware–agnostic architecture. Its CAStor product comes on a flash drive and runs on any x86 hardware platform. End users can simply plug the CAStor USB memory key into a server node. The process can be repeated as often as necessary to scale the archive as the system scales from a few server nodes and 1.5TB of content–addressed storage up to hundreds of nodes and more than 1PB.
New nodes are detected automatically by the CAStor cluster, and all files are replicated across the cluster by default. CAStor ensures that if one disk goes down, there is always another copy somewhere in the system.
The ability to swap out obsolete hardware in favor of newer systems is a plus for Marciano.
“I could have used anything to get this project started. We might have more stringent requirements moving forward, but right now we have three standard HP PCs with 500GB of capacity each,” says Marciano. “The attractive feature in terms of buying additional hardware is that I can re–purpose any low–end machine for storage.”
The Caringo software is currently running in a test environment within the SDSC, but Marciano is considering deploying CAStor on a wider scale. “The other groups on campus who are interested in these kinds of solutions are discussing building their own clusters based on our experience,” he says.