How to Manage Terabyte Data Warehouses
Managing large warehouses starts with RAID and disaster recovery decisions and may end with a SAN implementation.
By Dave Coombs
If your data warehouse is a terabyte or larger, then you`re wrestling with more than just the sheer volume of data to be stored. This amount of data almost invariably means the warehouse has attained mission- critical status, so performance and application availability are critical. If it is important enough to store this amount of data, users need to be assured they can access that data quickly and in the event of a system failure.
Implementing a terabyte storage system with requisite performance, availability, and recoverability capabilities requires planning and preparation--not simply slapping a lot of disk arrays together. It requires a storage infrastructure, starting with a foundation of RAID. Along the way, you will have to make decisions concerning the configuration of the RAID storage, backup processes, disaster recovery, storage management, and distributed deployment. And you`ll have to balance issues of cost, performance, and manageability.
To lay the groundwork for availability and recoverability, multi-terabyte data warehouses should be built on a RAID infrastructure. Typically, organizations use either mirroring (RAID-1) or parity RAID (RAID-3 or RAID-5) for very large databases. The choice comes down to striking a balance between cost and availability. Disk mirroring delivers higher availability than parity RAID, but doubles total disk storage (enough to mirror the existing capacity), making it more costly than parity options.
Parity RAID subsystems combine disk striping with parity calculations to enable storage systems to recover data in the event of a drive failure. The data availability of parity RAID is slightly less than that of disk mirroring; however, parity RAID requires only 20% to 25% more storage capacity than a non-RAID system, making it a less costly option than mirroring. For all but the most mission-critical systems, trading a little less availability for more capacity or lower cost is often a viable option.
Another key consideration is whether to use one giant disk array or multiple small arrays. Usually organizations configure terabytes of storage out of multiple medium-sized (100GB to 300GB) storage arrays, each served by its own controller, rather than one large array served by a single controller. The decision to deploy a multi-controller model often comes down to a question of performance.
A database that is spread across multiple controllers allows for a higher degree of parallelism, load balancing, and better performance. Also, the multiple controller strategy is scalable, which in turn increases ROI. As disks are added to an array controller, performance increases linearly until you hit a point of diminishing returns, with an eventual leveling off. While an individual medium-sized array typically bottlenecks at lower levels of performance than a large-scale single controller array, lower costs and higher parallelism enable a combination of multiple medium-sized arrays to exceed the performance and scale of a single large array. By adding controllers that balance performance, connectivity, and cost as capacity grows, you can scale to multi-terabyte size while achieving the greatest cost/ performance benefits.
Once an organization assembles the basic RAID storage infrastructure, it must then consider its disaster recovery strategy. Since terabyte data stores generally serve mission-critical functions, business continuance and disaster tolerance are critical factors.
Multiple disaster recovery options are available, including host standby and extended clustering techniques. The most commonly deployed disaster recovery technique, however, is a strategy that stores copies of data in an off-site location.
RAID recoverability by itself addresses individual disk drive or component failures within an array. However, RAID does not protect against problems such as fires or floods, or regional disasters such as earthquakes. Despite the recoverability of RAID, it is no substitute for a comprehensive backup and recovery strategy.
There are two dimensions to the backup and recovery problem: the time it takes to back up a given volume of data and the time it takes to restore the volume of data. As databases grow in size, reaching the terabyte level, restoration time becomes an equal or greater concern than total backup time.
The combination of faster tape drives and the use of multiple tape subsystems in parallel enables organizations to backup and restore systems much more quickly. Backup procedures based on daily incremental backups in a sequential rotation or procedures that back up only changed data speed the backup process by eliminating the need to fully backup a terabyte or more of data every day.
Compounding the problem is the around-the-clock (24x7) nature of today`s business environment, which demands continuous application availability. Storage managers used to complain about the shrinking overnight window for backing up critical business systems: 12 hours to 8 hours to 6 hours. Given the speed at which data could be copied to backup devices, data could not be backed up in the allotted time.
Now, storage managers are faced with backing up terabytes of data. With the advent of the Internet and electronic commerce, the overnight backup window has disappeared. Systems need to be up around the clock, which prevents businesses from taking their systems off-line for backup.
Clones and Snapshots
In these situations, organizations have adopted a clone approach. This approach uses mirroring to continuously copy information to a clone or several clones of the storage system. A cloned copy can then be backed up in the conventional way without affecting the primary storage system, which continues to operate in its normal fashion. While cloning greatly increases the amount of disk capacity required, steadily falling storage prices minimize associated expenses.
"Snapshot backup" is another increasingly popular approach. Snapshot backup involves making an immediate copy of the metadata for a particular disk, which creates a virtual image of the disk. With snapshot techniques, database updates are cached until the backup is completed. This allows normal operations to continue during the backup procedure. Snapshot backups can be fast and inexpensive for systems that cannot be taken off-line for backup purposes.
Fibre Channel and SANs
Distributing and managing storage today is difficult due to the lack of a high-speed storage interconnect. Fibre Channel, however, promises to change the way organizations deploy storage, enabling them to logically consolidate and manage both the glass house data warehouse and distributed server-based data marts. With Fibre Channel, organizations will be able to effectively deploy storage area networks (SANs)--dedicated storage subnets that can support many storage arrays spread across a campus or small metropolitan area.
The Fibre Channel/SAN combination will effectively centralize the management of stored data--in essence, "recentralizing" physically distributed storage. In conjunction with cloning and snapshot backup techniques, it will also enhance the recovery of data and ensure business continuance.
Today, data warehouse storage infrastructures that support terabytes of mission-critical data are becoming common among large companies. The largest business data warehouses are hitting 30TB or more. Add to this the terabytes of storage attached to data marts, departmental and workgroup servers, and the terabytes more embedded in desktop systems, and organizations already have massive amounts of storage capacity. The challenge will be how to scale and manage this volume of storage to ensure that critical data is always available and that businesses can tolerate disruptions.
With the cost per megabyte of disk storage falling 60% per year, the problem isn`t having enough capacity to throw at the task, but creating a data warehouse storage infrastructure and organizing their terabytes of data in the most effective way to ensure the information is always available.
The result? More effective decisions, at a lower level of cost, based on efficient use of all of an organization`s data.
As disks are added to an array controller, performance increases linearly, until you hit a point of diminishing returns.
A multi-controller model offers improved scalability, parallelism, load balancing, and performance.
Dave Coombs is vice president, storage sales and marketing, at Digital Equipment Corp., in Maynard, MA.