Storage Strategies for Data Warehouses
Scalability may be the key to a solid data warehouse storage foundation, but there are a number of other factors to consider.
By John Haystead
In marketing, information is power, and in business, power means profit. One important marketing and business tool, data warehouses, has the potential to provide companies with plenty of both. With the ability to zero in on individual product prospects, to segment customers by product type, and to provide detailed databases for "micro-marketing," trend analysis, fraud detection, and profitability forecasting, data warehouses play an essential role in the global marketplace. Despite their promise, however, many first-generation data warehouses and their smaller cousins--data marts and data malls--have not lived up to their builders` expectations.
Before they can provide powerful information, data warehouses must be equipped with powerful information management and storage tools. And, more importantly, each element of the total data warehouse solution (i.e., the CPU, software, connectivity, networking and, above all, storage) must interoperate efficiently with and maximize each subsystem`s capabilities--something many systems fail to do. Says Carson Schmidt, director of massively parallel processing engineering at NCR Corp., "We see a lot of commodity players trying to play in a specialized market; products, which when combined, result in data warehouse systems and architectures that ultimately can`t meet the user`s need." As a result, a new level of attention is being paid to the specific needs of data warehouse applications.
Summed up in one word, the single greatest factor distinguishing data warehouse storage requirements from other applications is scalability. Although many of the functional requirements of a data warehouse are similar to those of other high-end storage applications such as on-line transaction processing (OLTP), the gigabytes of new data accumulated on a daily basis and the growth rate of data warehouse storage requirements are unparalleled. Warehouses with hundreds of gigabytes are already common, and terabyte-size systems are rapidly catching up. To put the growth rate in perspective, "drive capacities are increasing at about 50% per year, yet data warehouse users are adding new storage capacity even faster," notes Schmidt.
The trend toward greater and greater storage capacity will continue. One major data warehouse user--Sears, Roebuck and Co--already has a sales and inventory data warehouse with more than 14TB of storage. Using EMC`s Symmetrix enterprise storage systems connected to a 48-node NCR WorldMark 5100 Scalable Data Warehouse (SDW) running NCR`s Teradata software, the system supports 2,700 managers, buyers, and other users with daily updates on sales of thousands of products in Sears` 2,000+ stores. The company`s data warehouse storage requirements have doubled or tripled every two to three years since the systems were first implemented in 1994, according to Ken DeWitt, Sears` vice president of corporate information systems.
Thinking Too Small
According to Peter Hodge, EMC`s data warehouse market requirements manager, one common mistake made by many first-time data warehouse implementers is underestimating their initial storage requirements and growth rates. "Their systems fail because they haven`t adequately planned for growth and can`t meet their users` demand for data or accessibility when they roll out the system."
"Ultimately," says Hodge, "until the system is up and running, you don`t really know what your requirements are going to be, so you`re much better off to start with a readily-scalable enterprise storage system than to try to do so after the fact."
To illustrate the point, Hodge describes one company`s recent planning session. Initially, the company projected 12 months of detailed data, but later agreed that adequate trend analysis would actually require 27 months of data. Then, concerns for data availability were raised, and the company decided to mirror all data, effectively doubling the anticipated storage requirement yet again. According to Wayne Eckerson, senior consultant at the Patricia Seybold Group, a consulting firm in Boston, "As a general rule of thumb, companies should purchase three to four times more storage than their most optimistic initial projections."
Many users recognize the importance of providing for linear scalability in their initial storage-system model. "While [users] may initially want to start small with an entry-level capability of 100GB or less," says Anton Murphy, manager of worldwide alliances at Storage Computer, "they recognize they will eventually want to grow the system to hundreds of gigabytes or even terabytes." Many users build their corporate data warehouses by starting with and later combining departmental-size data marts.
Data warehouse systems must scale not only in capacity but also in performance. This is a drawback for many existing storage solutions because performance scalability involves trading off the four "R`s" of storage--writing, reading, recalculating, and reliability. Many storage systems are optimized for some, but not all, of these elements.
RAID configuration is one case where tradeoffs must be made. For a number of reasons, RAID parity vs. mirroring is still an issue in data warehouse architecture design. For example, a key characteristic distinguishing data warehouses from OLTP is the length of time that data is kept. In contrast to the 60- to 90-day turnaround of an OLTP system, data warehouses typically contain 5-to 10-years` worth of data. Ed McCarren, manager of cross-industry storage solutions marketing at Digital Equipment, says this makes data integrity and security a crucial concern.
Another important consideration when selecting the appropriate RAID level is system up-time. While up-time has always been a business-critical requirement for OLTP systems, mandating some level of RAID protection and often requiring full redundancy, it is now a growing consideration for data warehouse applications. "While system availability may not have been a critical consideration for data warehouse applications 10 years ago, it definitely is today," says NCR`s Schmidt. This fact, notes Schmidt, is reflected in increasingly demanding requirements for drive reliability. "We now require a mean time between failure of at least 300,000 hours."
But up-time and data availability must be weighed against other important factors such as ease of access, cost, and system architecture. OLTP systems are generally optimized for random access and update of both indices and data. As a result, they can be easily optimized for parity-RAID configurations such as RAID 0+1 and RAID-5, both of which are cost-effective and offer high random-write performance.
Data warehouses, on the other hand, are usually constructed to service both random and sequential I/O write-access patterns. Random access is used for indices, while the data itself--such as table scans--usually involves sequential access.
In fact, data warehouse servers are often sequentially written to from other transaction servers. In these environments, there can be a significant write penalty associated with RAID-5, with many implementers opting instead for the bandwidth benefits of data mirroring (RAID-1). Despite the costs associated with using twice as many disks, Schmidt says RAID-1 can still offer the best price/performance in many applications.
Big Questions, Big Data
Getting data into the warehouse is only half the job. End-users` key criterion is query performance.
While the use of large caches and drive-based parity generators can greatly reduce parity-RAID write penalties, because data warehouse applications are analysis- oriented (vs. transaction-oriented), they are extremely I/O-intensive at the query level. In complex queries, every disk in a system may be accessed. "These bandwidth-intensive queries require Mbps data rates, as opposed to the request-per-second I/O-intensive environment of OLTP systems," says McCarren.
"Ultimately," says Jim Dietz, NCR`s director of scalable data warehouse product management, "data warehouse systems will tend toward RAID-1. Doing parity groups across a terabyte of storage with RAID-5 usually doesn`t make sense, especially as per-megabyte storage costs continue to drop dramatically." Dietz predicts the shift to RAID-1 will begin in earnest with the availability of 18GB to 36GB drives. While EMC`s Hodge believes RAID-5 (which EMC refers to as RAID-S) is still appropriate for many data warehouse applications, he also agrees that it has its limitations. "If the user is really concerned about availability, we typically recommend they go with RAID-1."
In all data warehouse environments, capacity and performance are directly related. For example, warehouse servers do not usually sort through and calculate responses to large complex queries from scratch. Instead, they use on-line analytical processing, which pre-calculates and stores the answers to questions for later retrieval and compilation. Although the availability of this pre-calculated multidimensional data improves performance, it also quickly increases the amount of storage space needed.
Likewise, while data mirroring enables more users to simultaneously access the same data without slowing response times, it usually means a doubling or tripling of storage capacity.
Ultimately, says McCarren, storage resources need to be tailored to meet different I/O requirements. Digital offers two versions of its StorageWorks RAID Array 10000: one optimized for transaction processing and the other geared toward high-bandwidth, decision-support applications like data warehouses. The systems support from 200GB to multiple-terabyte environments with disk mirroring, Digital`s Adaptive Parity RAID (3/5), and intermixed arrays.
Similarly, Artecon` SuperFlex 5500 dual-active RAID controller system for Windows NT supports mixed RAID levels (0, 1, 0+1, 3, and 5). The system incorporates from 32MB to 128MB of cache per controller with UltraSCSI connectivity to both the host and drives to allow for scalable performance and load balancing among several logical drives.
Off-loading functionality and management tasks from the network or host CPU through storage-based software tools can greatly enhance performance. "In a sense," says EMC`s Hodge, "you`re buying valuable bandwidth and CPU time that can be applied to other important operations. Users want to be able to move large quantities of data without impacting the rest of their environment."
Likewise, users want to be able to address and assign storage resources to multiple hosts according to their business needs. It may not be ideal for an off-line decision-support warehouse system to access an operational database, but users may want to replicate storage areas within a system or from one system to another. This would allow, for example, a production database running on one server and a data warehouse on another to access the same storage platform. Several vendors, NCR`s Dietz adds, are focusing on disaster recovery and reload software, growing requirements in data warehouse storage systems.
Some examples of storage-based management tools include EMC`s TimeFinder and Symmetrix Multihost Transfer Facility (SMTF). TimeFinder copies active production volumes in background mode, allowing simultaneous tasks to run in parallel, while SMTF off-loads bulk file transfers between hosts over the network. Similarly, Storage Computer`s OmniForce real-time mirroring software independently replicates data for multiple-attached host environments.
Other tools, such as those from Hitachi Data Systems (HDS), allow client/server storage to be administered and controlled from a central console. SNMP facilitates the configuration and management of multiple networked systems and applications from a single point of control, while configuration management software provides disk-storage analysis and monitors multi-vendor environments for component-level failure.
Show Me the Data
Efficient database management is also a critical component of an optimized data warehouse solution. In fact, according to Artecon`s Naegel, "if there`s a current limitation on the size of a workable data warehouse, it`s the ability of database software to efficiently manage the data."
Says Dan Harrington, NCR`s vice president of data warehouse marketing, "A growing number of customers are telling us that their existing databases are hitting performance ceilings, which is limiting the value of their data warehouses." Harrington says problems range from restrictions placed on the number of users allowed to access the warehouse, stipulations on the type of data that can be used for analysis, and limitations on the amount of data that can be retrieved.
A recent study conducted by HDS revealed that companies often restrict power users (those requiring extremely complex query information) from accessing the data warehouse during the day. One case in point was a DB2 implementation for which 95% of the warehouse`s CPU and I/O resources were being consumed by only 5% of the queries. Says Jerry Shattner, president of HDS Canada, "Clearly, we need to provide ways to deliver these complex results to power users without impacting the general user population."
As for database software, it must work well with the host platform and overall storage system architecture. Says Artecon` Naegel, "This means that both hardware and software must be configured to take advantage of the type of data you will be accessing." For example, table-intensive databases are best optimized with access patterns that make maximum use of parallel hardware. Since requests are simultaneously made to a number of database servers, with results collated at workstations, the database software should have high parallel-table-scan performance vs. random-access-search of indices for particular data.
Motivated by the results of its user survey, HDS has developed the Nucleus Series database system, which the company claims reduces query times by as much as 99%, enabling users to search large databases in seconds or minutes instead of hours or days. Based on Sand Technology Systems` Nucleus database technology, the Nucleus Series package includes the database system, the HDS 5700 family of storage servers, and a 64-bit SMP server with up to 12 parallel processors.
Core technologies include a bit-compression storage mechanism combined with mathematical formulas designed to rapidly search the compressed, bit-encoded data. Unlike other relational database architectures, which store tabular information as literal records, the Nucleus Series stores tables using a columnar organization.
As users and developers decide how to best wire data warehouse systems together, they must choose between --or mix and match--SCSI and Fibre Channel for both host-to-target and "in-the-box" connectivity. So far, Fibre Channel has not made major inroads into the data warehousing arena. One reason, according to EMC`s Hodge, is simply that fully capable data warehouse systems can be implemented today without Fibre Channel. However, he predicts that as the technology matures, "it will be a powerful tool that users should be ready to implement."
Digital`s McCarren contends that UltraSCSI is still adequate for most data warehouses, but that Fibre Channel will be an important component in evolving data warehouse environments, particularly in data mining applications." Looking ahead, NCR`s Schmidt expects "switched versions will be of considerable interest."
Fibre Channel may be particularly useful for data mirroring applications at remote locations, for which its superior connectivity distances become a factor. For example, Fibre Channel will be a key enabling technology for storage-area networks.
In the real world, data warehouses are not discreet entities, but integral parts of a company`s overall IT infrastructure, often requiring the integration of several different host platforms and networks. Therefore, users need storage solutions that will interoperate with their current environments.
Many businesses are moving toward a tiered data warehouse/mart strategy. "This requires both efficient, centralized management and an acceptable cost structure, regardless of whether databases are distributed throughout departments or geographic areas," says Digital`s McCarren. Sears Roebuck, for example, is looking at a tiered approach. In addition to its sales and marketing system, the company has a customer analysis and credit data warehouse, and plans call for additional financial data and maintenance agreement warehouses.
Although Sears` DeWitt believes a totally integrated data warehouse would be too large to support, he does expect to share information across the databases as needed. Accordingly, he says, "it`s essential that we have a shared storage-management architecture. When you start having larger and larger repositories of information, as well as legacy client-server operating systems, you can`t continue to pass data from one to another."
DeWitt foresees an architecture that stores detailed data in a common data warehouse, which serves as a data source for individual data marts. The data marts will in turn store summary data for analysis. DeWitt expects such a system to pose many challenges in addition to cost and implementation time, including dealing with data redundancy, data definition, and data cleansing. "Ultimately, in terms of access and data sharing, you need to choose whether direct access or a data replication or summarization strategy will be more efficient. You then need to decide whether you`re going to have physical or logical connectivity and whether this will be via local or wide-area networks."
As users look ahead toward combining departmental data marts into company-wide data warehouses, they must also pay close attention to the initial design of their data structures. As noted by Naegel, "If the metadata isn`t exactly the same between databases, there will be significant additional work necessary to access and compare data sets. This is where the real challenges and cost factors lie."
SSD Boosts Data Warehouse Access
By storing database information in high-performance DRAM memory, solid-state disk (SSD) technology can provide near-instantaneous, multi-stream information access to data warehouses. SSD access times range from 40 to 100 microseconds, compared to 10 to 15 milliseconds for a hard-disk drive.
"The advantage of SSD for data warehouse applications is that it provides a system-memory-class performance device in the storage pipeline--one that will become increasingly important as network speeds continue to approach internal system speeds," says Tom Fisher, product line business development manager for Quantum Corp.`s SSD Group.
Although storing a complete data warehouse database in DRAM would be prohibitively expensive, SSD can serve as a fast and efficient media for storing indexes and lookup tables, boosting overall database transaction rates by as much as 5 to 10 times.
Says Fisher, "There`s a growing trend toward highly interconnected databases, which allow searches by just about any field or classification. With anywhere from 20 to 100 times more indices to database files, access to these databases can be dramatically improved by SSD, which removes hard-drive search-and-sort latencies."
Frequently accessed database records can also be stored on SSD, further increasing overall system transaction rates by eliminating numerous transfers and re-transfers of the same information.
Although SSD prices continue to come down, per megabyte SSD is still significantly more expensive than magnetic disk memory ($25 to $50 per MB for SSD vs. $.05 to $.10 per MB for hard disks). However, Dennis Waid, president of Peripheral Research, a consulting firm in Santa Barbara, CA, points out price is not the deciding factor in many applications. "There should be a cost-performance budget analysis that compares the performance factors with the total system cost," he says. --JH
John Haystead is a freelance writer in Hollis, NH, and a regular contributor to InfoStor.