A new class of intelligent storage subsystems is emerging, putting focus on data sharing and file systems.
By Marc Farley
NAS products enable data sharing on a file level over the data network. NAS is an entrenched technology with a great future. NAS does not provide block-level or small granularity I/O operations over a storage network. This chapter looks at distributed storage and I/O technologies based on smaller-granularity accesses and data transfers that enable data to be shared at a sub-file or record level. The term small granularity here refers to transfers that are not necessarily made to specific block-addresses, but instead convey commands and data that can be processed by intelligent storage devices or subsystems that manage their own internal storage addresses and resources.
Data sharing is used in this chapter to identify environments where multiple applications can read and write small-granularity elements of data concurrently. The focus is on sharing data as opposed to sharing storage. Specifically, disk partitions and storage pooling are not the topic of this chapter. The discussions in this chapter center around block or other small-granular-level I/O, as opposed to file I/O as is used by NAS products.
Packing power in storage subsystems
One of the most interesting developments in network storage is the integration of intelligent processors into storage devices and subsystems. For many years, intelligent processors have been used in storage subsystems to manage the functions of the subsystem. Storage subsystems have the physical space and capacity/price ratios to accommodate considerable processing power-equal in processing power to multiprocessor Unix systems.
Some of the same companies that have been building computer systems for years are now finding themselves using their expertise to develop large, intelligent open-systems storage subsystems. Intelligent storage subsystems are not necessarily new, as they have been around for many years in the mid-range and mainframe world. What is new is the ability to link multiple systems and storage subsystems together in a storage network and create a different type of open-systems computing environment.
As the trend appears to integrate higher levels of processing power in storage subsystems, one can't help thinking about the potential applications. Current I/O applications such as mirroring and RAID might not be able to exercise all the available processing power. This new class of intelligent storage subsystem will deliver higher levels of data availability and storage management performance.
Storage subsystems today are already capable of performing many of the same virtualization functions as volume managers, device drivers, and I/O controllers; what's left to work through is the distribution of these functions among higher-level, host-resident processes and the storage subsystem. An example might be a storage subsystem that integrates the storage details of a database system to achieve higher cache hit ratios for faster performance.
This notion of intelligent back-end storage processors is illustrated above. The diagram shows a hypothetical machine with two network ports, which allow access to the subsystem and whose communications are controlled by a communications director. The other functions of the subsystem are a storage management computer that manages the internal device resources and provides several varieties of storage and data management functions, including a virtual storage manager that manages aggregated or subdivided device and caching resources.
Storage pooling and volume management
One of the functions belonging to the intelligent back-end storage subsystem, as shown in the figure above, is a virtual storage manager. In essence, the virtual storage manager enables storage pooling. Storage pooling is an application of device virtualization that aggregates and subdivides storage resources. The basic idea is that storage devices and subsystems can be combined through striping, RAID, or concatenation and then parceled out into sub-units of virtual storage.
The virtual storage manager in an intelligent back-end storage subsystem can allocate its processing and storage resources in many ways. For example, the internal storage structure could be a single large RAID subsystem exported as four logical drives. This configuration is fairly simple and does not need a lot of processing power, but it does require all virtual drives to use the same RAID level. Alternatively, the subsystem could run four separate RAID implementations across different partitions or drives, or combinations of both. While this is the sort of thing that can be done in host-based volume managers, it can also be done in an intelligent back-end storage subsystem with the processing power to provide multiple RAID functions simultaneously, including parity rebuilds and snapshot synchronization.
Data sharing is a completely different technology than storage sharing. Whereas storage sharing gives each server its own virtual private storage container, data sharing involves the access of multiple systems to the same addressable storage location(s). Data sharing is on the same level as server-cluster technology in terms of its complexity. However, from our perspective in early 2000, the concept of data sharing in a storage network appears like a dust storm on the horizon that one can see coming from far away, bringing unavoidable turmoil.
Sharing data on a disk drive
The simplest model for data sharing is a single disk drive that is accessed through a single port. First, we'll examine the physical connections used to do this and then the logical side of the equation. As commands from multiple systems are received in the drive, they are placed in a buffer or queue where they may be sorted or serviced in a first in, first out (FIFO) fashion. The figure shows servers A, B, C, and D connected on a Fibre Channel loop network to a Fibre Channel drive.
The systems in the figure can be used for different applications. For instance, System A is a file server, System B is a database server, System C is an email server, and System D is a web server. As each system on this network wants to access the drive, it arbitrates for control of the loop, logs into the drive's L-port, and begins transferring data.
Shared data operations on a single device
Now, we'll take a quick look at the anticipated operating requirements for a shared disk drive and consider how it manages access from multiple systems. Assume there is a single partition in the drive that supports all four systems. In other words, the four systems share a single logical drive. There is only one partition and one addressable logical drive on the storage network being shared by all four systems.
The figure illustrates how the four systems could access the shared logical drive. System A is working with file A, system B is working with file B, and systems C and D are working on file C. In fact, any system could be working on any file on the disk drive at any time.
So, how much processing power is needed in this disk drive? To begin with, the disk drive needs to support simultaneous communications with the four systems. Even though only one server at a time is logged in, the login process is very fast compared to the time it takes to perform disk I/O operations. That means the disk drive is constantly changing its login partner and queuing commands for them. In this case, a dual-ported drive would make a lot of sense to allow concurrent communications with more than one system, but it is not explicitly used in this example in order to simplify the model.
I/O operations for each system may take several logins. With multiple sessions taking multiple logins each, it is clear that the disk drive needs to incorporate some sort of communications management that independently tracks the operations for each system. This raises a subtle but important point: Because the disk drive can service multiple systems, it needs to implement an intelligent error-recovery mechanism that allows it to recover data or sessions as fast as possible for each individual system connection. An error that causes all communications sessions to fail with all systems would negatively affect performance on the network. Such a multi-system, error-recovery mechanism requires more processing power in a disk drive than a simple single master or host-dependent error and alerting mechanism.
Now that access to the drive is possible and can be managed by the drive, it is probably necessary to implement some form of security. While systems A, B, C, and D all have access to this drive, there may be other systems that need to be explicitly restricted from accessing it. There is obviously a need for higher intelligence to be able to support a variety of security methods-from simple password security to complex encryption and authentication.
With reliable and secure communications in place, one can turn to the usual storage problem: performance. There is little doubt that caching for such a drive could be a major benefit, as it is for most drives. The value of the cache is not the question, but rather how the cache is allocated and what form of cache should be used. Does the data access require a most recently used cache or a read-ahead cache? On the write side, does it need write back or write through?
Each system using the drive could employ slightly different access requirements that would affect the selection of the cache. In a single disk drive, the question is how to provide a balance of resources and algorithms that provides benefits for each of the different applications working on the drive. One approach would be to provide separate virtual caches that can be managed as a single memory resource by the cache controller.
Managing cache as a single shared resource results from the high probability that access to the disk by all four systems will not be equally distributed during short periods of time. Instead, it is highly likely that one or two systems will determine the I/O operations during a short time interval. However, over a longer period of time, say several seconds, that could change, and a different set of systems would be driving I/O operations. Trying to understand how to adapt a device cache to fit this kind of access is a challenge and would probably require a fair amount of processing power to manage it. In the future, intelligent storage devices and subsystems may be differentiated by the effectiveness of their caching.
So, we can see that providing data sharing in something as simple as a disk drive requires much more processing power than it would for any typical disk drive working with a single server. Data sharing is also likely to increase the requirement for memory to be used as cache. It is not impossible to implement data sharing in a disk drive, but it might not be economically feasible either. For that reason, it might be easier to think about data sharing as being more practical for a storage subsystem with the additional physical space and resources available. Also, if data sharing is done on a large scale with large amounts of data, it would probably be necessary to use a large intelligent back-end storage subsystem that could accommodate the storage and memory requirements, as well as provide the processing power required to perform all the required functions.
The value of data sharing
The value of data sharing does not result from its role as a storage application nearly as much as it does from being a data management application, as its name implies. Data sharing is a way to maintain data currency, which is the ability to maintain a current view of data. Sometimes data is replicated, creating "working copies" for different applications and purposes. Unfortunately, creating multiple copies of data can lead to problems. Specifically, it can result in having data that is no longer relevant and can sometimes create a "Who's on first?" scenario, where it becomes difficult to determine which working copy is the best one to use for a particular application.
Managing a single source of data
Multiple versions of data are a natural byproduct of distributed data processing. Data generated by one system for use on other systems represents data at a certain point in time. At a simple level, copying or emailing a file creates a new version of data. In a more complex environment, a data version can be created as a result of a database export operation. In both cases, the versions cannot be updated by the original application, and the potential exists to start generating "renegade" data sets that do not resemble the "real" data controlled by the original application.
Problems with multiple versions of data
The errors and consequences of having multiple versions of data become greater as you move up the food chain in the business organization. For example, a corporate accounting document that uses the wrong data could misstate corporate financial information, resulting in ill-advised decisions and general confusion in the organization.
Many systems managers experience serious difficulties generating data versions and moving data between systems. The load on the processors generating versions can be problematic for systems managers. Data transfers are often preceded by the initial system creating an export operation, as it is with database systems, or some other format conversion, both of which require processing resources. Similarly, large data transfers can take a substantial amount of network bandwidth, which takes a long time to run and negatively affects other network processes.
One application that is particularly problematic is data warehousing. Typically, a data warehouse is built from data from many sources. Various applications process data in the warehouse and create subsets of the data that are needed by other systems in the corporation. Each of these systems depends on the timely delivery of information by the data warehouse. Depending on how the information is distributed, several different versions of the same information may be used by different machines throughout the company.
Data sharing as a solution
Database managers working with enterprise resource planning (ERP) systems understand the value of integrating applications across the organization. Still, the issue of getting the right versions of data to these integrated applications is a problem. So, the goal of achieving "normalized" network data through data sharing, similar to the concept of normalizing data in a database, is powerful. Normalization refers to the concept in relational database technology where an identified data object is represented in only one location in the database, eliminating redundant representations that can result in inconsistent internal data. It can be very beneficial not only to individuals and work teams, but also to the corporation as a whole.
Consolidating data management resources
Just as storage sharing pools physical storage components and allows them to be consolidated and centrally managed, data sharing pools information and delivers the ability to manage and protect it as a single resource. However, unlike storage sharing, where each system must manage the storage capacity allocated to it, data sharing allows data management to be accomplished by a single manager, which could significantly reduce the burden of managing data on multiple disparate systems. This means that individual file systems and database systems would have to cooperate or give up jurisdiction over the storage spaces they manage.
Space allocation for data sharing
The notion of the I/O stack is useful in understanding how data sharing can work. A hypothetical I/O stack could be represented like this:
In this stack, the data/file layer and allocation layer functions are traditionally performed by file or database systems. The block translation layer is performed in part by the file or database system and again, recursively, by any downstream functions such as volume managers or host I/O controllers that provide virtualization.
Storage sharing, or pooling, is primarily a matter of block translation technology that is already done by various products at various points in the I/O path. The difference between storage sharing and data sharing is that data sharing, at a minimum, requires the allocation layer to be jointly managed by the participating file and database systems or managed by an independent entity.
Traditionally, the allocation function is performed as a discrete function of the file or database system. However, it is at least theoretically possible for database and file systems to be restricted to the data/file layer, where they represent a hierarchy of files, directories, and database entities to application systems-while another process manages the combined storage allocation of all systems accessing the shared storage resource. In other words, the allocation layer could be performed at the storage device or subsystem-particularly an intelligent back-end storage subsystem.
A unified file-system view
One difficulty at the core of data sharing is satisfying the requirement for multiple users/applications working concurrently to get a single, unified view of the data that everybody can work with. This means that each participating file system has to have some way of determining where file and directory objects are stored. At first, this might sound a bit silly, but it is a very real problem.
A single file system can manage the allocation of storage locations on multiple real or logical disk volumes. In reality, this is done through the use of device virtualization, which creates a virtual drive for the file system to work with, as illustrated.
But this is not the same as having multiple systems managing the allocation of storage locations on a single shared drive (real or virtual). Multiple file systems can access a shared storage resource in a SAN, but they have to work together, or at least "agree" on which file system controls the allocation of space (see figure below).
The scenario illustrated here works only under specialized conditions. Each participating file system needs to be able to identify when another file system is already managing the storage, and each participating file system needs to adhere to the access rules. Non-managing file systems are not allowed to write files and update the file system information. A single, consistent order needs to be maintained in order for all participating systems to be able to find and read the file-system information located on the shared drive.
This example sheds light on one of the current problems with using Windows NT in SANs: Windows NT systems do not recognize when other file systems are already managing the storage resource and simply attempt to write their own allocation information. The results of this are unpredictable, but are not bound to be pleasant-there is a good chance data will be lost.
File-system organizational data
When a file system accesses a storage resource, it presumes to know where to find the information it needs about the data that is placed there. This is normally accomplished by setting aside a reserved range of storage locations that the file systems uses for storing its internal organizational information. In general, the file system needs two kinds of organizational data: structural and content information.
The structural information is used to help the file system maintain integrity through redundant entries and to provide regular locations that it can count on to find information it needs to function. The idea is that if the file system cannot read the data it needs in the place it expects to due to data corruption or some other reason, it can look elsewhere to find another copy.
Content information is basically the directory structure used to locate the hierarchy of directories and files that typically make up an operating system. The most important directory is the root directory. A file system can typically find everything else in the storage resource if it knows how to find the root. The diagram below shows a file system locating its organizational data on a drive.
Typically, file systems work on locally attached drives. By nature, storage networking extends the idea of local to include SAN-based drives. This isn't necessarily a big step, but it is interesting to think that file-system organizational data is now found by using a network address that is potentially available to every system on the SAN. The notion that file-system organizational data is so readily accessible on the SAN could certainly impact the deployment of security and protocols on the SAN. The figure below shows a file system locating its organizational data on a real or virtual drive in a SAN.
The only new element is the presence of the SAN, which, for the purposes of locating data, translates into an address on a storage network. In Fibre Channel networks, the network addressing is transparent to the file system and is handled by the device drivers and host I/O controllers that translate SCSI target-LUN addresses to Fibre Channel network addresses. However, this does not mean that the file cannot incorporate network addresses as part of its inherent ability to locate its organizational data on a real or virtual drive connected to a storage network. In other words, a network address can be part of the information that a file system uses to locate its organizational data.
Distributed or centralized organizational data
In addition to the matter of adding network location as part of the method of locating organizational data is the notion that file-system organizational data can be distributed over multiple resources.
The idea here is simply an extension of the previous concept, where the location of a real or virtual drive can include a network address component. Similarly, the location of organizational data within a file-system structure can also contain a network component. Readers familiar with linear algebra will recognize the network address as an additional vector that is added to the address. This opens the door for the potential of file systems to be built on top of network structures, as opposed to device structures. The figure below shows a file system that is able to find its organizational data scattered across multiple real or virtual drives in the SAN.
Notice that this approach allows a file system to span multiple real or virtual devices, without first needing them to be aggregated by a single virtualization point in the I/O path.
Marc Farley is vice president of marketing at SanCastle Technologies, which designs and builds Gigabit networking switching fabrics. He is the author of the book Building Storage Networks (McGraw Hill, 2000), as well as numerous white papers and magazine articles on storage I/O technology.
This article was excerpted from Building Storage Networks by Marc Farley. Copyright 2000 by the McGraw Hill Companies. Reprinted with permission.