What to Look for in Data Mart Storage
The keys to a successful data mart include data availability, storage performance, and scalability.
By William W. Reed
In the past, each functional department within a company collected its own data. Manufacturing was the keeper of parts and supplier data; marketing had all the information about recent trade shows and ad campaigns; and sales knew how much product was sold when and where. Although you could get useful data from your own department, you could not access that of other departments for a complete picture of your company`s business.
Companies soon began to realize the value of gleaning data from many sources, across many departments. In retrospect, the benefits were obvious: competitive advantage and increased profitability. If all departments had access to a set of comprehensive consolidated data, each would work more effectively.
Data warehouses seemed to be the answer, or were they? They solved one problem but introduced another. They enabled companies to store multi-terabytes of information consolidated from disparate sources, but implementation was both costly and slow (often taking two or more years to implement at a cost of several million dollars), undermining any expected return on investment. And databases became so large that they were cumbersome and ineffective. The data was there, but it was hard to get and hard to use.
Today, data marts--application-specific or single-subject data warehouses--are rapidly solving the problems created by such mega-warehouses. By focusing on particular applications or business problems, data marts are more effective than data warehouses and are more likely to result in a higher return on investment. Data marts provide the data access users need but more quickly and at a significantly lower price cost than data warehouses. Typically, it takes about 90 days and less than one million dollars to implement a data mart. Because of these advantages, it is the fastest-growing segment of the data-warehousing industry. According to The Meta Group, an industry research firm in Stamford, CT, data marts constitute 50% of all data warehouse solutions being implemented today.
But there is a cost. By building a data mart instead of a data warehouse, departments no longer have access to, or share, a set of consolidated data. By building in bite-size "chunks" of data and then linking multiple data marts together, the distributed data-mart model combines the advantages of the comprehensive, everything-included data warehouse with the focused targets of smaller, topic-specific databases.
Distributed Data-Mart Model
Early data marts were independent, stand-alone systems. Newer data marts take advantage of client/server computing to provide communication among data marts, so users have access to information across many topics. The fact that the data comes from various marts is transparent to the user.
A simple distributed data mart is shown on page 22. The top row shows where data is captured (e.g., information obtained when a customer buys a retail product). This data is then cleaned, transformed, and stored in the relevant data mart (e.g., the sales data mart). Users can then download data of interest through the server to the workstation (e.g., manufacturing identifies fluctuations in sales and adjusts production accordingly). The distributed data-mart model allows fast access to each data mart and the sharing of information across data marts.
Note that there are potential bottlenecks even in the distributed system. To work, stored data must be available whenever it is needed; the bandwidth of the sys- tem must be broad enough to allow rapid and frequent downloads of updated information; and the storage system must be able to accommodate ever-increasing amounts of captured data and more users without affecting performance.
Storage for Data Marts
The average cost of implementing a single data mart is less than one million dollars--of which about one third is hardware costs and half of that, storage costs. Selecting the correct storage element is crucial in creating an effective, efficient, and useful data mart. The key criteria are availability, performance, and scalability.
- Availability. First and foremost, the data must be available whenever it is needed. System downtime directly translates into lost time and productivity. Not only is downtime costly, but it also makes the data mart less effective, which lessens the data mart`s credibility as a useful tool. Whether the data is being accessed by sales teams working in different time zones or by a production crew trying to identify variations in a new process, the storage system must be available 24 hours a day, every day of the week, every week of the year.
- Performance. Performance is critical, not just at the initial rollout of a data mart but as more and more simultaneous users access the system and as data amounts increase. Performance affects two distinct areas. First, users must have a quick response to any and all queries. Whether the user is analyzing straightforward information in a spreadsheet or performing sophisticated data mining and analysis, the system must support an exploratory, nonlinear questioning approach. The results to an initial inquiry, for example, may trigger several questions or ideas, which users may need to explore immediately. Also, users need quick responses to these new queries so they can follow through on new lines of thought.
Bandwidth must also be sufficient to allow rapid and frequent downloads of updated information. Data capture is ongoing, and the best analysis comes from current information--last night`s information may not be current enough. If the system can`t handle the necessary downloads, uploads, or batch processes fast enough, users will be tempted to use old data. Sufficient bandwidth is also needed for downloading data during system backups.
- Scalability. Data marts are growing rapidly because the amount of data (and the need to analyze and understand the data) is growing rapidly. According to some market research reports, the capacity growth rate of data marts exceeds 55% per year. In general, users do not want to delete old data or relegate it to difficult-to-access archives. Storage systems must be scalable so that a company`s initial investment in data-mart hardware and software is protected as storage capacity needs increase.
Disk Arrays for Data Marts
Disk-array storage devices must meet all three criteria for data marts: 100% availability, high performance, and scalability.
- Availability. Disk arrays must be reliable and resilient. They should be built from standard modular components with customer-replaceable and hot-swappable disks, power supplies, fan modules, and controllers. Full redundancy with automatic failover is a must.
- Performance. Data marts use larger page sizes than on-line transaction processing (OLTP), and rather than following a set of stepped questions, queries tend to be random. And, of course, when more people use the system, it can bog down. With well-designed disk arrays, multi-threaded I/O operations and increased read/write cache can help boost performance without having to retool. Adding a second active RAID controller or cache can increase performance as the user count increases (see figure 2).
For maximum RAID performance, look for an array with a dedicated CPU, which offloads processing from the host CPU. The host CPU does not become bogged down with storage controller and RAID tasks.
Also, look for arrays with Ultra SCSI and Fibre Channel interface options so you can match the bandwidth required with the appropriate interconnect technology or upgrade from Ultra SCSI to Fibre Channel array controllers when a bandwidth boost is needed.
- Scalability. Data marts need to be scalable in terms of capacity and performance. They should scale quickly and easily, while protecting a company`s initial investment.
Data mart hardware costs can be as high as $300,000, up to half of which goes toward the storage system itself. Choosing the best storage solution is crucial to building an effective, efficient, and useful data mart--one that will continue to be useful as system demands increase.
Enhanced Performance for Data Marts
- Multi-threaded I/O operations
- Large read/write cache
- Fast/Wide UltraSCSI host interface: 40MBps
- Fibre Channel interface: 100MBps
- 7,200rpm or 10,000rpm disk drives with up to 50 spindles per controller
- 200MBps drive channel capacity
- Internal 133MBps PCI bus architecture
- Dual-active controller option
The distributed data-mart model takes advantage of client/server computing to provide communication among data marts so users can access information across many topics.
Adding a second active RAID controller or additional cache can increase array performance.
William W. Reed is managing director of Symbios` MetaStor business unit in Wichita, KS.