In the second of a three-part series, we look at possible approaches to solving data-sharing problems in storage area networks.
By Marc Farley
The first part of this series discussed how high-performance processors can be implemented in SAN storage subsystems (March, pp. 34-42). The possibility of spreading file-system information across multiple storage subsystems in a storage area network (SAN) was introduced as a possible application for this type of intelligent back-end subsystem. This article takes a closer look at data-sharing technologies, particularly the file-system functions needed to support data sharing.
Locking and access semantics
Most file and database systems provide some level of locking, or the ability to access or update data for the users or applications using that data. The idea is to prevent other users/applications from accessing or updating data already being used by the first user/application.
The process that provides the lock mechanism is the lock manager. The lock manager is responsible for managing access to the locked data, as well as any actions and notifications that are needed when attempts are made to access locked data. For instance, a lock manager can use advisory locks, warning users or applications that the data they have accessed is locked. Without the ability to enforce its lock, the data can be subsequently updated by any user or application that has accessed it. In these cases, the lock manager may be able to notify clients that the file has been updated, advising them to reload the data.
The lock manager for a database system can lock data at the following levels:
- Table level (the entire table and all contents within it are locked)
- Record level (individual records are locked; other users can access other records)
- Field level (individual fields within records are locked, other fields can be accessed)
Similarly, a file system lock manager can lock data at various levels:
- Directory level (the entire directory and all files within it are locked)
- File level (individual files are locked; other systems can access other files)
- Byte range (contiguous block ranges within files are locked; other ranges can be accessed by other systems)
File systems and databases have been using locking technology for many years. In general, the technology has worked fairly well and has enabled multi-tasking systems to run multiple applications with access to the same data.
Figure 1: In a Parallel Sysplex environment, the Coupling Facility handles cache notification.
However, the matter of locking shared data in a SAN, versus a multi-user environment, poses different problems with several new wrinkles. For example, there are fundamental differences in data-access operations among systems. These differences are briefly discussed below (see Semantic Integration).
A lock manager for a data-sharing environment can build a lock resolution/hierarchy based on several items, such as user profiles, prioritization, policies, or order of data access (e.g., in a first-come, first-lock manner). Determining how locks are assigned to clients is not a trivial matter. Users are not necessarily accustomed to finding out that paragraphs or spreadsheet cells of particular files cannot be modified. Furthermore, users are unaccustomed to receiving notices that files have just been changed, advising them to reload the files.
If locking messages are issued, how will users be notified and how will they react? If they are not notified, users might think there is something wrong with their equipment or with the network. If they are notified, they might not know what to do and take alternative actions, such as saving a new version of the file.
The whole notion of notifying users about locks is an interesting puzzle. Should the lock manager tell users when locks are cleared? If so, how does it identify which locks have been cleared? What if a user wants to override a colleague's lock? Should there be a method for doing so? If so, how does an organization keep users from abusing this capability or hacking locked data?
Figure 2: Diagram shows a cache coherency mechanism for open systems using an intelligent back-end storage subsystem as a global cache manager.
There are bound to be many interesting surprises along the way. Questions like these will likely be the greatest deterrent to implementing shared data systems.
Among the challenges of SAN data sharing are the various ways systems access data. Semantic integration deals with these differences, including the types of storage I/O operations used and user-interface characteristics that accompany them. For instance, systems differ in the way they name, open, lock, delete, update, etc., files. If a user on a Unix system creates a file name that is illegal on a Macintosh system, how does the Macintosh user see that file name?
Also, file systems may support different functions, and it's not unusual for two systems to implement a certain function differently. The issue is not so much how an access function works, but how the data appears to different users working on different platforms.
Caching shared data
Caching is one of the primary performance boosters for any computer system. However, if it is not implemented correctly, it can actually degrade performance. And in a shared data environment, a poorly implemented cache can lead to data corruption.
This section examines various aspects of caching in a data-sharing environment. To avoid confusion with memory caching, throughout this article "cache" refers to the caching performed for a disk subsystem.
Cache coherence is the property of multiple caches having the same version of data. Maintaining cache coherence in a shared data environment is a messy problem with several variables, including the caching algorithms used, the number of systems involved, the number of storage subsystems, and the application mix.
Where data sharing is concerned, the overriding question is, How does a system know when cached data is stale-that is, when another system has updated it? This can be thought of as a data-version problem, where multiple versions of data can be out of sync with each other.
For the most part, it is difficult to implement more efficient cache than in system memory. A write-back cache in system memory provides the best performance of all disk caches for repeatedly accessed data, such as transaction processing. However, with data sharing, one system may want to access the data in another system's write-back cache.
Unfortunately, there are no external methods to determine which blocks are being held in a system's cache. Therefore, it is practically impossible to fetch data out of a system's cache and retrieve it for another system. Specialized caching software could be developed for homogeneous environments, but the performance gains might not be enough to overcome the overhead of the updating process.
One way to get around this problem is to use a single cache in a shared data subsystem that all systems access. However, while this approach might provide a single point for coherency, it might not provide the performance needed for some applications.
If a single cache in a subsystem cannot be used, some way of indicating the status of cached information in other systems is required. One way to do this is to use indicators, or status flags, that represent the state of cached data. These indicators can exist in a system's local cache or in a cache module in the subsystem. A key point: The indicators should be able to be quickly checked, as opposed to the time it takes to generate a typical I/O request.
As data is about to be accessed from cache, the system can check status flags to determine if it needs to reload the data from the subsystem. If the status flag indicates the data is not stale, the system can access the data from its own cache. When a system updates data, it also sets the appropriate indicator flags so that other systems know their copies of the data are stale.
The IBM Parallel Sysplex Coupling Facility uses such a mechanism. It uses a global cache to keep track of all data in every system's cache. Each system maintains local cache indicators representing the segments of data held in cache. These indicator bits are called local state vectors and indicate whether the data is stale or not. Systems communicate with the Coupling Facility about data they are updating.
When the Coupling Facility receives a message from a system that data is being updated, it checks to see if participating systems have that data in cache. If others do, the Coupling Facility updates the local state vector in each system's cache to set the bit to "stale."
As those systems attempt to access that data, they are notified that their copies are stale, and they read the data again from the global cache or the storage subsystem (see Figure 1).
Figure 4: Diagram shows a many-to many relationship with data/file layer implementations running in multiple systems and allocation layer implementations running in multiple intelligent storage subsystems.
The ability to make this communication happen almost instantly can be provided by a high-speed network such as Fibre Channel; in the case of the Coupling Facility, it is achieved via ESCON.
Notification can be handled in other ways, depending on performance requirements and system architectures. In general, this type of cache coherency enforcement requires fast processing and communications.
In addition, it requires a global entity (in this case, the Coupling Facility) to be able to set and change data in a system that is accessing it. This is quite a departure from most storage architectures where the subsystem only responds to I/O requests from the systems it is serving. Figure 2 shows a general cache coherency mechanism for open systems using an intelligent back-end storage subsystem as a global cache manager.
Installable file systems
The words "installable file system" have the tendency to strike fear in the hearts of many. Although it sounds a little bit like rocket science, the installable file system concept is actually easy to grasp. IFSs provide the same types of services as a system's native file system, but also incorporate specialized services that extend the capabilities of the native file system.
There are two important points:
- Installable file systems do not necessarily take over the operations of the native file system-they augment it.
- Installable file systems are provided for by published programming interfaces by system vendors.
Locking, semantics, caching
One of the major potential advantages of an installable file system is the ability to integrate the functions of data sharing in a unified approach. For example, some of the problems of shared data access, including locking and semantic integration, could be eased by users using a consistent installable file system across multiple platforms. Users would likely have to make adjustments, but they would make similar adjustments. Also, a global caching mechanism could be implemented in conjunction with an installable file system. Such technology development, however, would not be trivial.
Distributing file system functions
Installable file systems in SANs could incorporate capabilities not yet available in native file systems. We'll explore some of these possibilities below.
Separating the I/O stack
The model for an I/O function stack is presented the table. The items in bold represent functions associated with file systems.
The concept of separating the data/file layer from the allocation layer could be key to data sharing in the SAN. One way to facilitate this separation is by using an installable file system. We'll now examine this idea in detail through multiple scenarios.
Role of the allocation layer
Space allocation refers to the method in which a storage device or subsystem is filled with data. It can be filled from the bottom up, by scattering data throughout, or any number of ways. Sometimes the allocation scheme results in less-than-optimal alignment of data in the subsystem, and utilities such as disk defragmentation software are needed. The allocation scheme can be designed for special types of data and application access methods or it can be general purpose in nature.
As far as SAN data sharing is concerned, the allocation of space in a data-sharing device/subsystem does not have to be controlled by the systems that are reading and writing data to it. There are potentially other ways for systems to signal how much space is needed for a particular data object or update, letting the device or subsystem handle the details. This type of technology is referred to as object-based storage.
By moving the allocation of space from the host, an installable file system can be quite "thin" in the host system. Instead of using system cycles to calculate space allocation, the installable file system would transfer the data to an appropriate processor in the storage network, which would perform space allocation and store the data. This way, the IFS acts as a logical conduit for I/O work between a host system and a storage processor.
Locking the subsystem
If allocation is performed in a storage subsystem, it may also make sense to do the locking functionality there too. Locking at the allocation layer provides the smallest possible granularity for locking at the record, field, or byte levels. The allocation process manages all the details of actual storage I/O on component devices. Likewise, locking conflicts could be resolved at a detailed level by the processors in the subsystem or device.
Caching and cache management
Using a traditional file system approach, caching in a storage subsystem has some severe limitations. The root of the problem is that the subsystem has no idea what data is being asked for-it only knows that it is being asked for blocks. Instead, with the allocation component of the installable file system running in a storage subsystem, the subsystem controller can immediately locate all the data that will be needed by the operation. The allocation component of the distributed file system recognizes what is being requested and performs highly accurate pre-fetching caches.
The separation of the data/file layer running in a host system and the allocation layer running in an intelligent back-end storage subsystem across a SAN is shown in Figure 3.
The system on the left is running its native file system and an installable file system. This installable file system performs the data/file layer function, which allows users and applications to locate data and submit I/O requests. On the right, a storage area network with an intelligent storage subsystem runs the allocation layer component of the IFS. All decisions about the placement of data in this file system are made in this storage subsystem.
This concept can be ex-panded to include multiple systems, each running its own data/file process. They can all be tied into the same allocation layer process in the intelligent back-end storage subsystem. Figure 3 leaves out the networking details and shows the many-to-one relationship of installable file system modules to the allocation component in the storage subsystem.
Locating organizational data
Installable file systems are no different than other file systems in the way they need to locate organizational data to maintain integrity and locate data. However, the method for doing this can be considerably different, considering the model we've been using for separating the data/file layer from the allocation layer.
For the most part, the allocation function finds the precise block storage location of data. The data/file function represents the contents of the file system to applications and users and locates directories and files. The most important directory to find in a file system is the root directory. From the root, all other directories and files can be located fairly quickly.
There are several ways the file system's organizational data can be located and structured to ensure all systems in the SAN have quick access to the same root directory. For example, the storage subsystem with the allocation layer could present a read-only copy of the root file system in a high-speed memory cache that could be accessed directly by data/file layer implementations in SAN systems.
Multiple back-end processors
Not only could systems on the SAN have multiple data/file layer implementations, but they could also have multiple back-end allocation layer implementations. Each back-end subsystem could manage the allocation for some defined subdivision of the file system. All file system entities that belong to that particular allocation subset would be managed by their respective subsystem.
Distributing the data this way means that the organizational structure of the file system would necessarily incorporate network addresses. The subdivision of file-system components across intelligent back-end subsystems could follow a predefined scheme or be managed dynamically by one of the back-end subsystems.
Figure 4 shows a many-to-many relationship with data/file layer implementations running in multiple systems and allocation layer implementations running in multiple intelligent storage subsystems. Each data/file layer can store and retrieve data on any allocation subsystem processor.
The concepts presented in this article should give readers an appreciation for the inherent challenges and opportunities of sharing data in a SAN. In general, data sharing will be difficult to achieve at a sub-file level, especially in heterogeneous environments. However, the possibility of using installable file systems to "normalize" data access may be the key to overcoming some of the shortcomings in today's systems. Whether or not the market will ever become comfortable with IFSs remains to be seen.
Marc Farley is vice president of marketing at SanCastle Technologies (www.sancastle.com), and is the author of the book (McGraw Hill).