Getting a grip on data sharing requires understanding shared storage, shared data, data access, and data movement techniques.
Jim Tummins, Durk Watts, and Brad Stamas
Not so long ago, it was popular for relatively independent departments within an enterprise to process and manage their own data--in effect creating islands of data and information. In today`s business environments, these islands have become liabilities. Instead, "interoperability" among departmental servers has become critical. Sharing data throughout the enterprise can provide complete, accurate, and consistent information for timely business decisions.
In concept, data sharing represents a significant opportunity in today`s enterprise. However, data sharing is often poorly understood and difficult to implement. A clear understanding of the issues involved enables better evaluation of available and future solutions.
There are several ways to share data, few of which represent "true" data sharing, where a single copy of data is available to everyone. For a variety of reasons, the techniques are fundamentally software-based.
Data sharing defined
Data sharing is the process of accessing information by multiple applications or users (see figure). This high-level definition is generally accepted in the industry. However, there may be differing views, which result from subtleties among storage, data, and information definitions. The result is a great deal of confusion. That said, various elements are driving the need for data sharing today:
- Rapid growth of duplicated information
- Increased interdependency among mission-critical applications
- Conversions from one platform to another
- Application growth on new platforms
- Increased heterogeneity
- Persistence of legacy systems
- Need to be current
- Growth of and dependency on data marts and warehouses
- High costs of data administration and storage management
Essentially, the need for data sharing is driven by the multiplicity of applications hosted on multiple types of platforms and operating on related information. As a result, organizations are saddled with high administration and support costs (for duplicate storage resources, multiple management products, multiple systems administration efforts, etc.), which prevent new applications from being introduced.
To learn how to implement data sharing--and to uncover potential constraints--it is important to first understand how data is accessed from storage devices.
There are three basic methods of access: raw, file-system, and database. All three require knowledge of both logical and physical data structures; the configuration, capabilities and geometry of the storage devices; and I/O protocols used by each storage device. The forms of data access differ from each other in terms of the degree and type of assistance provided to the application, with raw access offering the least assistance and database access offering the most.
All forms, however, rely on the operating system`s block I/O services, which provide a low-level bridge between the operating system and the physical storage devices. All systems have some form of block I/O services (e.g., the SCSI driver on Unix platforms or the I/O subsystem on MVS platforms).
The three forms of access also provide different access control mechanisms, caching services, and lock management services. Each form also generates unique meta-data for the organization and access of data under its control.
The result is distinct domains of data, i.e., database, file-systems, and raw access. Applications access data in these domains through data-access service layers using the layer`s API.
Application access to stored data is defined and controlled by the data-access service provider, such as a file system or database. Therefore, multiple applications that need access to data controlled by a particular data-access service must have an interface with that service.
This distinction is taken one step further with file systems and databases. For example, DB2 and Oracle employ different APIs, logical structures, ways of relating data items to one another, and so forth. As a result, Oracle data is different from DB2 data, and Oracle can`t be used to access data under DB2`s control. Vendors, not just platforms, further segregate data domains.
In the past, no single platform could provide all possible data-access services sufficiently and simultaneously. Different factors drove the development, implementation, and use of applications. Consequently, applications, their data, and their dependent data access services tended to coalesce on individual platforms--hence the model of multiple, specialized data servers, which continues to be used in most organizations today. For example, there may be one or more file servers, an e-mail server, database server, etc. In this environment, processing platforms, storage devices, applications, and administration tools are generally distinct from one another.
Operating with distinct data servers results in distinct storage resources, which means multiple vendors, management methodologies, and maintenance approaches.
Shared storage addresses a number of problems associated with distinct data servers. Benefits accrue (arguably) due to economies of scale, a single storage supplier, reduced administration, reduced amount of storage, flexible allocation, and ease of upgrade.
Shared storage objectives include:
- Providing shared physical storage assets for multiple users
- Lower price and total cost of ownership
- Just-in-time capacity management (re-assign capacity to new hosts as requirements change)
- Storage-based assists (such as remote dual copy)
- Enabling centralized physical storage management
- Leveraging existing resources and procedures
- Standardizing storage management procedures
Shared storage allows independent disk storage to be aggregated and centralized. Storage capacity is shared, but data is not. While different data types reside within the same physical storage subsystem, partitioned storage by itself does not enable data sharing. However, shared storage is a requisite step for data sharing.
For data to be shared by multiple applications, the applications must be able to access common or equivalent data. Multiple-application access to data can be supported through solutions that either enable access to a single instance of data, called data access, or through solutions that enable applications access to equivalent images of data, referred to as data movement (see figure below).
While multiple-application access to a single instance of data may appear to be the most efficient solution (due to reduced storage space and administration), it often isn`t (due to technical complexity). Instead, many organizations employ both data access and data movement solutions to enable data to be used by multiple applications.
One approach to data sharing is through data-access services. This method is used by Network File System (NFS), databases, many client/server applications, and clustered processing systems. Methods of this type are often described as "true" or "direct" data sharing because only a single, common instance of data is shared among multiple applications. However, these methods provide "true" data sharing only to applications that use APIs defined by their own data-access services. Data-access sharing techniques are fundamentally software-based.
In contrast, data movement, otherwise known as "indirect" data sharing, uses tools to copy data from one location to another so that other applications can access equivalent data.
The tools differ from one another in terms of sophistication (e.g., direction of data movement, granularity of information moved, selected data movement versus bulk data movement, ability to transform data, and ability to synchronize independent data stores). See the sidebar for a list of available data movement utilities.
At first glance, the data-access approach may appear to be the better choice since only one copy of data is involved, which means lower storage and administration costs. However, this approach is complex and can be expensive. In contrast, data movement methods are generally less complex to implement and often offer better performance than data-access alternatives.
In fact, according to surveys conducted by Gartner Group and other consulting firms, the use of data movement tools is increasing. This is in part due to increased use of software that allows near-real-time and asynchronous access of equivalent data.
No silver bullet
No single method addresses data sharing across all platforms and application architectures. However, a wide range of methods is available--each enabling data sharing within defined and limited contexts.
The method of choice depends on the purpose of the application, type of host and storage platform, level of data-access granularity, degree of access protection needed, level of data coherency needed across multiple data stores or applications, and kind of dependencies that exist within the data to be shared-not to mention cost and performance issues.
The block I/O interface (SCSI, ESCON, etc.) makes it possible to move blocks between the host and storage devices. Without interpreting the data contained within a block, it is impossible to know the relationship between any two blocks, which sequences of blocks comprise a file, which application is accessing data, or how to provide access control.
From a purely physical view, data sharing faces obstacles. For example, to share data via a storage device, the device must assume some of the functions of data access services. And if the storage device supports data sharing through data movement methods, an understanding of the logical organization of source and target data is needed.
The next opportunity beyond data sharing is providing intelligent information management and information accessing services. To that end, storage systems need to become more intelligent, assuming storage management and access management roles. This process began with advanced storage controllers and will continue with storage-area-network technologies.
Data sharing is the process of accessing information from multiple applications or users.
Applications can access data via raw access, file system access, or database access.
Data access enables application access to a single instance of data, while data movement techniques enable applications to access equivalent images of data.
Data movement methods and tools
The Gartner Group, an IT consulting firm in Stamford, CT, distinguishes several types of data movement utilities:
File transfer programs move entire files in one direction, from System A to System B.
Copy management tools physically move entire files, subsets of files, or databases in one direction from System A to System B. Copy management tools may utilize file transfer programs. These tools provide capabilities to better schedule file movement (as opposed to user-generated scripts used with file management tools).
Replication uses database logs or triggers to identify changes since last replication interval. Changes are sent as insert, update or delete transactions from source to target. Replication is commonly supported in a homogeneous environment (e.g., Oracle to Oracle, Notes to Notes, etc.)
Data propagation tools move all changes (as opposed to net changes) between heterogeneous environments, including non-relational and relational. These tools support data transformation functions, such as date formats.
Data synchronization tools exchange differences between two data stores to achieve information synchronization. Data transformation is also supported. Synchronization tools are generally used when two or more independently updated data stores need to be synchronized periodically.
Extraction and transformation tools are designed to support efficient bulk movement of information from one platform to another.
Jim Tummins and Durk Watts are technical marketing managers and Brad Stamas is a consulting analyst for communication and I/O technologies, at Storage Technology Corp., in Louisville, CO.