The Role of SANs in Data Warehousing
SANs can augment LANs to meet data warehousing demands.
By Dino Balafas
Today`s competitive business world is changing rapidly. Acquisitions, geographically scattered offices, and efforts to integrate diverse networking infrastructures are all commonplace. And the data used to run these businesses is growing exponentially. Making data available quickly and cost-effectively to such diverse operations is critical to business success.
For many IT organizations, data warehouses are the answer. The Gartner Group, an IT consulting firm in Stamford, CT, estimates the data warehousing market will grow at a rate of 31% per year, reaching $5.5 billion by 2000.
The idea of combining a company`s information in a central data bank, though not new, is rapidly becoming a competitive necessity. As the migration to a central system occurs, new IT needs emerge. The storage area network, or SAN, is an effective way of meeting these needs in a data warehousing environment.
Enterprise networks today are broadly heterogeneous, incorporating legacy systems (OS/390), open systems (primarily Windows NT and Unix), and proprietary midrange systems. These systems support different communication protocols and storage methods, resulting in "islands of information" because of the difficulty in sharing information.
Data warehousing is a method of consolidating data from one platform to another, in effect, "bridging" islands of information. A data warehouse organizes disparate information from different disciplines within a company, helping decision makers to understand how their enterprises` products are faring against the competitions`, to analyze market trends and buying habits, and to consolidate enterprise information for future research or sale.
Why use a data warehouse? In almost every large enterprise, mainframes collect huge amounts of data resulting from the organization`s daily transactions. Data warehousing moves the data to open-systems platforms, where it is loaded into databases for subsequent analysis.
Data warehousing has six key requirements:
- Heterogeneity (systems and protocols)
- Data integrity
- Return on investment (ROI)
Data warehouses constantly stretch the limits of the local area networks (LANs) on which they reside. LANs generally can provide distance, heterogeneity, and data integrity, but their ability to adapt to the remaining three requirements (performance, ROI, and security) is limited. SANs, which provide a high-speed subnet--a network within a network--among heterogeneous storage resources and servers, can augment LANs to deal effectively with all six requirements.
A SAN does three important things. First, it externalizes storage. By attaching storage devices to the network rather than to a particular server bus, the SAN makes all storage accessible to all servers. Second, SANs centralize storage. All storage elements and all related storage I/O operations become the domain of the storage network. Third, a SAN clusters servers to maximize data sharing and efficiency. The major benefits of SANs in a data warehousing environment are tied to these three capabilities.
SANs add performance, ROI, and security. They boost additional performance because they can be constructed from local interconnects such as Fibre Channel, SCSI, ESCON, and HIPPI or from dedicated wide-area interconnects such as Sonet, OC-3, OC-12, and DS-3/E3. This accommodates high capacity and split-second access by linking highly scalable hosts and storage servers with high-speed (and sometimes long-distance) networking. The ability to daisy chain a variety of host storage devices across the campus--or across the country--promises more scalable, manageable, and highly available data storage.
In terms of ROI, SANs are intended to work with, or augment, primary networks (LANs and WANs)--not replace them. By off-loading storage operations to a SAN, the primary network eliminates the often debilitating contention that occurs between I/O traffic and other network traffic. It takes advantage of the SAN`s high-performance data access and movement capabilities plus the SAN`s ability to use storage efficiently. The SAN allows users to continue to scale their networks with more storage capacity without reconfiguring their LANs or buying additional RAID servers (which offer storage bundled with sometimes unneeded CPUs).
Because SANs are built with channel protocols, information can be moved at greater speeds. When remote data movement is required, a SAN can be configured to use WAN technologies. This ensures greater security, not found on the Internet, while still providing high data movement capability. In addition, because remote access can lead to device consolidation, savings can be gained from the reduction of floor space, device duplication, and personnel requirements.
Since storage management is the domain of the SAN, handling storage applications is the SAN`s greatest contribution to a data warehousing environment. SANs reduce storage management hassles often encountered with data warehouses. The following applications benefit the most from SANs:
- Backup and restore
- Archiving and retrieval
- Data migration
- Data sharing
Each of these applications greatly benefit from SAN`s performance, security, and ROI characteristics, which LANs usually lack. SANs provide new levels of performance and flexibility in backup and restore, making it possible to back up data from different servers to the same automated tape library, for example. As the demands on data warehouses increase, backup and restore capabilities become more important and must be handled quickly and securely. High performance is the key to reducing recovery times, and SANs make it possible to better use available bandwidth.
As data warehouses grow, archiving data to less expensive, less immediately accessible storage has emerged as a way to manage data more cost-effectively. Archiving is typically a function of the age of the data and the need to access it. Archive networks are configured either within a company`s multiple locations or in conjunction with a business recovery vendor. SAN solutions support effective and efficient archiving from different kinds of servers to the same storage system. A SAN can provide local- and wide-area connections and the necessary gateway and conversion functions.
Data migration (moving data from one storage system or data center to another) in a data warehousing environment presents an IT organization with several issues: the available time window, the effect of moving data to online resources, and the reliability of the data after the move. A SAN enables data movement while maintaining integrity and without affecting on-line (or LAN) resources.
Beyond shared storage, the ultimate goal of data warehousing is data sharing--the extraction, movement, or loading of data among environments. There are multiple ways to share data today: network transfer (typically, TCP/IP), controller-based shared-storage transfer, and channel transfer (e.g. ESCON). SANs use channel transfer technologies to provide better performance over LANs without being locked to a particular storage vendor.
How to Implement a SAN
To realize the advantages SANs offer data warehouse environments, organizations need to consider several special requirements.
SANs need to be implemented as separate subnets, so the critical I/O traffic between server and storage is not blocked or delayed by other kinds of traffic. The solution is to provide different classes of service on a shared physical network. Fibre Channel has this capability, as does ATM. And emerging technologies like IP v6 will further enhance this prioritizing capability.
Attaching heterogeneous storage devices directly to a network requires a special kind of networking device that can support the classes of service that are established as well as channel and network protocols. These devices should also provide a fault-tolerant architecture, guaranteed data delivery and integrity, load leveling, data compression, and alternate path routing.
Security needs to be addressed differently in a SAN environment because storage devices are not protected behind servers as they are in traditional architectures. Each installation has to be examined, and security has to be designed to fit the situation. As data warehouses proliferate, IT organizations will face relentless demands for network up-time, data accessibility, and system management--all at a lower cost. SANs address these needs by taking advantage of today`s network and channel technologies, providing an optimum match for data warehousing applications.
Building a Data Mart
The following example illustrates a SAN application in a relatively simple data warehousing environment. A large bank in a major metropolitan area wants to build a data mart to better understand how their customers use banking services. In the future, the bank will combine multiple data marts into a comprehensive data warehouse for the entire organization.
The intent of the first data mart is to share data between an OS/390 mainframe and an AIX system more than 100 miles away. The mainframe stores customer information on the activities of the bank`s credit card division. The bank wants to use the data to determine the buying practices of customers, as a means for contacting customers and for future direct marketing efforts.
The networking requirement is to move data from a DB2 database on the mainframe to a DB2/6000 database data mart on an SP2. The volume of data is up to 25GB per day. Daily transactions to the database must not be interrupted, and a great deal of batch processing must be done at night. Therefore, there are only a few hours to move the data over 100 miles.
On the mainframe side, the hardware is configured to move data over an ESCON channel, load balance it through four load-leveled T-1 links, and then convert it on the other side to SCSI, where it is attached to the SP2 (see figure). Network channel linking devices on either end of the T-1 lines make the necessary channel conversions.
File movement software in the mainframe and SP2 extracts records from the DB2 database (the operational database) and buffers them in memory for movement. From the buffers, the records are transported across the network to the loading function on the destination system, avoiding intermediate disk-storage phases. The software mimics Unix Named Pipes technology to automate the whole process and obviate the need for flat files.
This solution offers the bank a long-haul data mart option for their data, thereby gaining the benefit of remote disaster recovery. In addition, the high transfer rate allows the bank to minimize the downtime of its operational database, a key business need and regulatory requirement.
Dino Balafas is senior product manager at Computer Network Technology (CNT), in Minneapolis, MN.