Object-based storage devices (OSD) and shared file systems are the next evolution of storage technology.
By David Freund
We recently reviewed the Lustre shared file system, which is based on an object-based storage device (OSD) architecture. A major benefit of this architecture is its removal of performance bottlenecks when sharing access to large numbers of files among large numbers of systems. Combining capabilities of both file- and block-oriented access methods, OSDs can be used as building blocks to construct modular storage systems that scale as well in performance as they do in capacity. But Lustre is not the only example of object-based storage.
In the late 1990s, Carnegie Mellon University’s Parallel Data Lab developed the OSD concept and built software prototypes, modifying the NFS and AFS file systems to use a more-intelligent OSD device interface, in its Network Attached Secure Disks (NASD) project. This intelligent-interface concept, combined with a scalable, shared file system, is the foundation of the Lustre architecture. It’s also the core of object-based storage products from start-up Panasas.
A number of other vendors also see the potential, and OSD standards, file systems, and other shared-access products are beginning to emerge.
Magnetic disk drives have evolved considerably from their origins in the late 1950s as dumb electromechanical devices. At one time, an operating system had to understand the physical geometry of each drive and issue detailed commands to position read/write heads to a specific position (or “cylinder”), select a specific head, and read or write a specific block (or “sector”) of data.
The operating system was also responsible for optimizing the execution of requests coming simultaneously from multiple applications to obtain the best throughput from each disk device. It also had to handle errors that occur due to problems with the magnetic media in certain blocks on the disk-storing additional information such as a CRC value to verify that a block is undamaged. The operating system even had to retry operations that encountered such errors.
Over time, disks got smarter. Drives gained the ability to store and verify CRC-based metadata, retry failed operations, correct errors, and “re-vector” bad blocks. They also gained enough intelligence to queue multiple commands, optimize their execution, and even cache some of the data. In essence, the disk drive became a virtual, error-free array of blocks; servers access them simply by logical block number (LBN).
A file system adds another layer on top of the LBN concept. When an application opens a file, it has access to its own virtual array of blocks. A file may be “fragmented,” with parts of the file scattered among several different LBN ranges, but the application is kept blissfully ignorant-the file system does all of the work of mapping virtual file blocks to LBNs. This mapping, along with other information such as file name, creation date, access controls, and so on, forms the “metadata” that a file system stores on disk. File systems are also responsible for coordinating allocation, de-allocation, and access requests from multiple applications running simultaneously in a system in order to guarantee file integrity and security.
Object-based storage takes this evolution of disk drives another step forward, embedding the handling of metadata into storage systems. Metadata handling is not physically built into drive electronics-at least not yet. A specially designated server is connected to a small number of disk drives to create a logical OSD.
Instead of an array of addressable blocks, an OSD presents variable-sized data-storage containers called “objects,” each with its own unique address, settable metadata attributes, and a defined method of reading and writing the object’s contents. The details of how an object is physically laid out on the media are the responsibility of the OSD-effectively transplanting that responsibility from a server’s file system to the storage device. This is similar to today’s file interface; objects can, in fact, be used to hold files, but they can also be used to store parts of files or other types of information.
The case for OSD
A major reason users have been shifting from direct-attached storage (DAS) to networked storage is to gain the ability to share storage capacity among multiple servers-consolidating their storage and simplifying management. As they gain experience with these systems, users’ interest in sharing information, not just capacity, has been growing.
A shared file system enables sharing of file systems, directories, and even individual files among applications on multiple servers as though they were running on a single system. A server’s shared file system manipulates on-disk file metadata the same way as a single-server file system, but coordinates its activities with other servers-typically through the use of some form of “lock manager.” Because this combination has historically been integrated with clustering products, such file system components are commonly referred to as cluster file systems.
Unfortunately, cluster file systems can create a performance bottleneck as I/O workload, number of cluster members, or shared-storage capacity grows. The bottleneck usually forms in the file system’s metadata processing-specifically, coordinating the allocation of every physical disk block. NAS filers can’t avoid this bottleneck, either; they relieve other servers of metadata-processing work, but the filers still perform similar coordination in much the same way, whether configured as single servers or clustered.
OSD-based shared file systems combine two concepts-scale-out clustering and data striping-to remove the performance bottleneck while retaining the advantages of a shared file service. The central idea is to make disks “smart” enough to manage their own space allocation, and to spread files, directories, and entire file systems across these smart devices. This has additional advantages:
Modular performance and capacity scaling: Because each OSD handles its own space allocation, this specific type of metadata processing-the vast majority of typical cluster file system overhead-is spread out over multiple devices without needing to coordinate those devices’ activities. Increasing storage capacity is done by adding OSDs, which correspondingly increases the allocation-management capacity of the combined storage system. Striping files across OSDs provides additional file I/O bandwidth the same way striping traditional disks does-and adds another performance advantage: distributing file-access coordination.
Scalable security: OSDs can enforce access security for their own objects without significant overhead. In the context of a shared file system, a server’s file-system “client” could, for example, open a file by first contacting some form of object manager or metadata server, obtaining an access key, and presenting that key to an OSD when requesting access to an OSD object. OSDs check the validity of the key against local metadata and accept or reject the access accordingly. Revoking access rights for thousands of servers could involve little more than having a metadata server send a new key to the OSD(s), immediately preventing access by any server using an old key.
Lustre, developed by Cluster File Systems, combines OSDs with an installable Linux file system and a “metadata server” (MDS) to create a file system that can be used by multiple servers to access files simultaneously. Traditional cluster file systems spend the vast majority of their processing overhead coordinating disk-block allocation to files as they are created, extended, modified, and deleted. An OSD-based file system pushes that processing into multiple OSD storage devices, effectively “scale-out” clustering the block-allocation metadata processing similar to the way applications execute on large high-performance computing (HPC) server clusters. Lustre is currently intended to be used with such applications, creating a file system designed keep pace with the growing file-I/O needs of HPC clusters.
Panasas uses a similar approach and targets a similar group of HPC customers. That’s no surprise, since it was inspired by the same research project at Carnegie Mellon that inspired Lustre. Unlike Lustre, which is an open-source project, Panasas considers its implementation to be intellectual property, which is bundled with its own commodity hardware. So far, Panasas’ ActiveScale File System has proven to be a more mature OSD implementation, with better built-in manageability. The current release, announced in November 2004, includes features such as snapshots, parallel-backup capability (compatible with traditional backup software), “active spares” to improve data-reconstruction speed when recovering from a failure, continuous storage-media monitoring, and quotas.
Panasas has also pushed for the creation of industry standards for object-based storage, engaging the Storage Networking Industry Association (SNIA). The SNIA OSD Technical Working Group (TWG) submitted standards drafts and worked with the ANSI T10 Technical Committee to add OSD-specific commands to the SCSI protocol, providing a file-like interface to objects stored on OSDs. Adoption of this standard would mean Fibre Channel and iSCSI can continue to be the network protocols between servers and storage devices because both use SCSI commands to control and transfer data to and from those devices.
IBM is also actively involved in the OSD standards efforts and is exploring the addition of OSDs to its current shared file systems, such as GPFS and SAN File System (SFS), to accomplish the same goal-eliminating potential performance bottlenecks. IBM believes SFS provides excellent metadata-processing scalability. However, the architecture lends itself well to the use of OSDs. SFS already uses an installable file system component to run on “client” systems using the shared file system and a cluster of metadata-management servers. IBM co-chaired the SNIA group that drafted OSD extensions to the SCSI protocol. It’s possible that IBM will enhance SFS to support ANSI-standard OSDs. After all, its creators at IBM’s Almaden Research Lab designed it with object storage in mind. IBM is also experimenting with other implementations, creating different OSDs in software, hardware, and in various combinations with varying capabilities.
Intel has also done research into object storage. It co-chaired the SNIA OSD TWG along with IBM and has demonstrated the results of its efforts at a number of Intel Developer Forum events.
EMC has also been active in object-based storage, most notably with its Centera storage system. Centera provides a subset of the capabilities that Intel, IBM, Panasas, Cluster File Systems, and others have collectively defined for OSD-but adds capabilities beyond that definition. It still uses scale-out techniques to scale storage capacity while maintaining consistent performance.
Centera is not a re-implementation of a filer, but a different kind of storage device. It’s a “content-addressable storage” (CAS) server intended mainly for use as an archival store of “fixed content”-data that does not (or must not) change, such as medical images, streaming audio or video content, e-mail messages, billing records, engineering documents, etc.
When an application creates and writes an object, also known as a binary large object (BLOB), Centera generates a 128- or 256-bit key based on the object’s contents-forming a unique address used to reference the object. EMC’s focus with Centera has been on security and reliability. For example, data can only be accessed by servers that have the appropriate key; it can enforce rules governing the retention and disposition of data using various metadata-based criteria (such as preventing applications from modifying or deleting an object regardless of its access rights-creating “write-once” objects); it can also maintain redundant copies of objects to protect against hardware failures. Centera has been one of the more commercially successful examples of object-based storage, having recently crossed the 1,000-customer threshold, with more than 30 petabytes of capacity shipped.
Network Appliance has also embraced a scale-out approach to its storage servers, based on technology from its Spinnaker acquisition, which NetApp says will accelerate delivery of its “storage grid architecture.” The company’s SpinServer clusters, which can combine up to 512 servers into a single NAS service, offer impressive aggregate scalability, but only if there is little contention for files or directories. Instead of an OSD approach, each directory or file is served by a single appliance. A SpinCluster namespace presents a directory tree, where subdirectories can be virtual file systems (VFSs) that are served on the same or on other appliances. Each SpinServer serves a number of VFS directory trees; requests for a non-local file or directory are redirected to the correct appliance within the cluster.
While traditional storage access mechanisms based on disk blocks and files are not going away any time soon, a more sophisticated form of storage is emerging. As users increasingly look to share information as well as capacity among networked systems, object-based storage will be deployed to improve scalability, manageability, and modularity.
David Freund is an analyst with Illuminata Inc. (www.illuminata.com) in Nashua, NH.