In the last of a three-part series, we delve into installable file system approaches to data sharing.
By Marc Farley
Previous installments of this series of articles examined potential technologies and architectures for data sharing in SANs. To clarify, data sharing through NFS and CIFS network-attached storage (NAS) systems is not covered in these articles. Whereas NFS and CIFS work at the file level, these articles explore ways that data sharing can be accomplished at the block level, or at a level in between blocks and files.
This article looks at three different models of installable file system (IFS) technologies that can be used for data sharing:
- Indirect access manager-broker model
- Centralized, direct access
- Distributed, direct access
Indirect access manager-broker
Data sharing through an access manager is a fairly simple concept. Every computer that participates in this data-sharing system requests the right to access data through an access manager system that parcels out access "tickets" for all requests. The access manager, or broker, is the central intelligence of this system, and determines the access characteristics and order of access for the entire system. Several implementations of this type of shared file system are available in the market.
The access broker approach is similar to the notion of a SAN backup connection broker that establishes a connection between a server system needing access to a tape drive and an available tape drive in a library. The difference is that the data-sharing access manager provides connections to a storage subsystem (typically disk), as opposed to a particular tape device.
Figure 1: File access manager
Access to shared data through a broker begins with an installable file system component running on each requesting computer in the data-sharing system. The installable file system component submits requests to the access manager for a specific file, record, or range of blocks.
The access manager can optionally authenticate the request and then determine if another system has already reserved, or is working with, the requested data. If there are no conflicts, the access manager sends a ticket to the requesting system, which it stores locally for use when accessing the data. This ticket can contain several types of access control information, depending on the structure of the data-sharing system and the processing capabilities of the storage subsystems. For example, tickets can contain such things as locking information, ticket expiration times, encryption keys, device logical block lists, and file system metadata. The requesting system then communicates with the storage subsystem using the logical block addresses and other information contained within the ticket. It is also possible for the requesting system to exchange the ticket with the storage subsystem so the subsystem can perform additional authentication and instruction processing. However, this type of additional processing capability is beyond the scope of most storage subsystems used in access manager data-sharing systems today.
One significant architectural aspect of the access manager approach is the division of the control path between the requesting system and the access manager, and the data path that exists between the requesting system and the storage subsystem. The control path between the requesting system and the access manager is typically done over a LAN, although it could also be done over a WAN or MAN. The protocols used in the control path are likely to be Internet protocols that run on top of the TCP/IP protocol stack. The data path is typically a SAN and uses serial SCSI block-access protocols, which is why access manager file systems have been referred to as SAN file systems. However, it is possible that the exchange of the ticket between the requesting system and the storage subsystem could be accomplished through a separate control channel. There are a number of possible implementation choices for the exchange of ticket information.
The access manager approach is appealing for scenarios where the data sharing requirements are not particularly heavy, and where NAS file sharing is impractical. These scenarios could include SANs with large files where the number of file requests is relatively small, such as in print, video, and multimedia production environments. The file access manager resolves any locking conflicts that arise between systems, while providing fast streaming I/O performance. In fact, lock management is the primary feature of this approach, and locks are relatively simple to understand and resolve. Figure 1 illustrates how the access manager works.
Figure 2: centralized data sharing
The primary weakness of the access manager approach is the fact that the access manager is a potential single point of failure, and is not practical where data availability requirements are high. In addition, the access manager is also a performance bottleneck that realistically prevents it from being used for heavy transaction processing applications. The single-point-of-failure weakness could be addressed through clustering technology where a clustered pair of access managers could increase availability; however, it seems unlikely that clustering would realistically enable the performance to be adequate for transaction systems.
One problem for any data-sharing system is cache management. New updates to shared data should be available to all sharing systems involved. A cache notification mechanism should prevent stale data from corrupting fresh data when multiple systems are processing concurrently. The access manager approach does not provide any capabilities for synchronizing cache information as it is completely outside the data path. While the allocation function resides in the access manager, it would be a severe performance constraint to use a cache-registration scheme that communicates the contents of each requesting system's local cache to the access manager over a LAN. In general, LANs are not reliable enough and have relatively long latencies for cache synchronization.
In addition, the access manager approach does not provide cross-platform semantic integration, as each client accesses the data directly without the benefit of a semantic interpreter. That is not to say that semantic integration could not be incorporated into the scheme, but that would require a significant development effort in storage subsystem technology, and may break the cost model of access manager shared data systems.
Centralized, direct-access method
The concept of centralized data sharing can use installable file systems in requesting systems, where a many-to-one relationship exists between requesting SAN-attached systems and a single intelligent "back-end" storage subsystem. Figure 2 illustrates this architecture.
This is a straightforward architecture that is simple to understand at a high level. Previous articles in this series explored the concept of an intelligent back-end storage subsystem, which provides services for locking, caching, and semantic integration to requesting host computers. In essence, this is a client/server model where a single, monolithic subsystem serves the storage I/O requests of multiple data sharing systems.
While the overall architecture is simple, the implementation details are far more complex. One of the most difficult aspects of this approach is in understanding how communications are accomplished between host systems and the storage processor. In contrast to the access manager method, which provides distinct control and data paths, the centralized, direct access method uses an intelligent back-end storage subsystem that communicates with requesting systems over a single path for both control and data. Considering that a single subsystem provides all services, separate paths would be unnecessarily complicated for synchronizing operations and maintaining adequate performance levels.
Retrieve Inc., purchased by Sterling Software and subsequently Computer Associates, developed technology for such a shared, centralized, direct access system. Data transfers between host systems and the back-end storage subsystem involve context-rich data transfers that contain control information in addition to storage data. The installable file system module running in the requesting system is responsible for formatting and managing these transfers. The intelligent storage subsystem performs storage space allocation for all requests from all data sharing systems as they are received.
Similar to the file access manager approach, there is potential for the intelligent back-end storage subsystem in the centralized, direct access system to be a single point of failure. As is true with the file manager approach, this weakness can be addressed by using mirroring or cluster technology with the storage subsystem.
However, unlike the access manager approach, the performance of the centralized, direct access method does not have the overhead of exchanging tickets and ticket processing in both the access manager and requesting system. The result is far less latency in the I/O channel, and the possibility of providing excellent performance for transaction processing applications.
In addition, it is feasible to implement cache synchronization with a centralized, direct access data sharing system. With high-speed storage I/O channel performance between requesting systems and the intelligent storage subsystem, it is possible to establish a cache registration mechanism between each requesting system and the centralized storage subsystem. While there are several ways to structure this, the basic idea is the same for all of them: All operations would be checked against known accessed data, and if collisions are found, the proper data synchronization process is initiated.
Finally, semantic integration could be accomplished for heterogeneous systems. To do this, the intelligent storage subsystem would resolve differences in data access operations using a consistent mapping among various native file system implementations. While it is unlikely that complete semantic integration is achievable across different file systems, it is true that access filters would allow administrators to provide predictable behavior for heterogeneous, shared applications.
Distributed, direct access model
Of the three models discussed in this article, the distributed, direct access approach is the most challenging to understand. This method uses a many-to-many architecture, where any requesting system can communicate with any distributed node that is part of the distributed storage service. In essence, a distributed system of storage subsystems replaces the single storage subsystem of the centralized method discussed above. Although the concept is more challenging, the potential of this approach may provide the greatest benefit to system administrators due to its advanced scalability and availability characteristics.
The integrated networking technology of the distributed system is a key element. Storage I/O in a distributed direct access system is made between any requesting system to any participating storage subsystem. Receiving storage subsystems manage their local storage space and perform space allocation on locally attached devices. As the IFS in the requesting system does not perform any of its own space allocation, it is not very efficient for it to create block-specific I/O commands. That means that it cannot use a block-oriented communications protocol, such as SCSI. In other words, requesting systems communicate with distributed storage processors using some other protocol, such as TCP/IP or the Virtual Interface (VI) protocol. I/O commands become more like peer-to-peer messages than master/slave I/O. This approach has the benefit of off-loading requesting systems from performing space allocation, and provides architectural compatibility with many kinds of existing or installed technologies.
Figure 3: redundancy and raid striping
Tricord Systems, formerly a manufacturer of super-server systems, has changed its business to be completely centered around the development of such a distributed, direct access system. In addition to providing a protocol-independent distributed system, the Tricord File System also uses a striping technology based on RAID algorithms for redundant data protection. The notion of "object-oriented RAID," as Tricord calls it, is to stripe data objects across intelligent storage subsystems so that each subsystem performs space allocation for its own particular part of the file system.
Directory and file information is striped across multiple storage processors as it is generated. For instance, the directory entries in the file system and the starting addresses for files are spread throughout the network. Files are striped across subsystems in small-granularity sub-units, which are managed by the local subsystem as local data entities. Reads and write operations from a single requesting system are accomplished in this striped fashion, which provides excellent parallel I/O performance.
Another interesting aspect of this approach is the ability to circumvent the requirement for data redundancy within an intelligent subsystem's storage. Because the redundancy is provided at the network level, it is not necessary to do it at a lower, device level. The redundancy in the distributed, direct access system provides complete redundancy on other subsystems. Figure 3 illustrates the RAID striping potential of a distributed, direct access file system.
Also shown in Figure 3 is the concept of a processor array. Just as a RAID subsystem has arrays of disk drives, the distributed, direct access data sharing system can utilize arrays of intelligent back-end storage subsystems. The individual nodes that make up this array provide locking, caching, and semantic integration support.
One potential downside to the distributed, direct access approach is difficulty in performing storage management applications, such as backup. Certain operations, such as an image backup of a single node, may not be possible. For instance, this is not a file system that can be copied from a single subsystem using third-party copy transfers for backup. Instead, the distributed file system must be accessed through specified access locations in the network.
This article examined three approaches to sharing data at the block or sub-file level (a level in between blocks and files). Some assumptions were made about the presence of intelligent back-end storage subsystems that could provide device virtualization, volume management, or space allocation for multiple requesting systems.
While it is possible that existing file systems can evolve to incorporate some of the data-sharing technologies discussed here, it seems more likely that the shortest path to successful SAN-based data sharing will be accomplished through installable file systems.
Readers interested in exploring this technology will find it difficult to find much material on these topics. File systems are mostly a subject for academic study, with little common understanding among IT professionals. This brief introduction to these topics is intended to help readers get started down the path of this complex area. As SAN applications receive more exploration, information on SAN data sharing and installable file systems will likely become more readily available in the form of books, white papers, and product marketing collateral.
Marc Farley is vice president of marketing at Solution-Soft Systems Inc., in San Jose, CA. www.solution-soft.com. He is also the author of Building Storage Networks (Osborne/McGraw-Hill, 2000).