—Despite all the marketing talk about “intelligence” in the storage network, we still have a ways to go. The truth is that most storage devices today are simply not as aware as they should be of applications, data access patterns, and workflows.
Established vendors have built general-purpose, block-based storage arrays capable of running a wide spectrum of workloads. However, these systems are not optimized for any particular workload and have no intelligence about the application, its data formats, and access patterns. On the other end of the spectrum, especially over the past five years, there has been a trend toward more-specialized storage appliances. These systems combine application intelligence, or workload optimization functionality, with core storage capabilities to deliver tailored solutions for particular applications or business needs.
While NAS is probably the oldest example of specialized storage appliances replacing general-purpose computers, more recently content-addressed storage (CAS) has evolved into a specialized class of storage focused on the requirements of archival and compliance data. Also, with the growth in high performance computing (HPC) applications, vendors such as DataDirect Networks and Isilon have delivered storage systems optimized for specific I/O profiles, such as large-block, sequential I/O. As another example of the trend toward specialized storage appliances, a number of vendors, such as Teneros, are delivering appliances tailored for continuous availability in e-mail environments.
This trend toward specialized storage architectures and devices is occurring in the database space, too. In fact, several key drivers are transforming how large-scale databases (greater than 1TB) are stored, managed, and scaled. Five factors are leading to the emergence of a new class of database storage optimized for data warehousing and business intelligence workloads:
Users are facing a tsunami of structured data—Based on Taneja Group research, many end users’ databases, particularly data warehouses, are doubling in size every year. The primary driver for this growth in database size comes from the line of business. Business decision makers recognize the value of maintaining more historical data online longer for better analytics and decision-making purposes. A secondary driver fueling the size of databases is a tightening regulatory and compliance environment. The need to keep more data online longer exacerbates issues of database performance, scalability and management, and makes general-purpose storage approaches less attractive.
The need for speed—The need for more database performance is insatiable. Database and storage administrators are being asked to manage much larger databases and storage environments, while improving data loading times and query responses and delivering deeper data analytics. Unfortunately, the overall performance and response time of current RDBMS systems is impacted as the database size increases. This fact is particularly true as databases grow beyond 1TB. Techniques such as database archiving allow IT to prune the size of a database to improve performance, but don’t necessarily allow that data to be kept online and fully “query-able.” IT faces huge challenges in coaxing significant I/O throughput and response times out of the underlying storage system to meet the insatiable requirements of large data warehouse implementations. Clearly, the overall throughput and response time of the underlying storage infrastructure directly affects what end users see in terms of response time.
Current database scalability approaches have significant drawbacks—Three architectural approaches to scaling database performance have emerged: Buy a larger symmetric multi-processor (SMP) server to run the database, implement a clustered shared-disk database architecture such as Oracle Real Application Clusters (RAC), or deploy a massively parallel processing (MPP) architecture (e.g., Teradata). SMP systems are by far the most common deployment model for OLTP databases and small data warehouses or data marts, but a high-end SMP server can cost more than $1 million and cannot be modularly scaled on demand. Clustered databases offer the promise of near linear scalability, but require laborious partitioning to reduce synchronization overhead and achieve optimum performance for data-intensive workloads. MPP systems that partition data and parallelize queries have emerged as the de facto approach for large-scale data warehouses. However, traditional MPP systems require constant tuning and repartitioning, and as a result ongoing OPEX cost can run into the tens of millions of dollars for a large-scale data warehouse. There is no silver-bullet approach that offers low acquisition cost, scalability, and ease of management.
OPEX costs mount for tuning and managing large databases—As the database size grows, the administrative overhead of managing a database grows exponentially along two dimensions—database management/tuning and storage management/tuning. The type of tuning and management required to maintain a large-scale database requires highly skilled professionals. As can be imagined, as the database grows, the amount a business must spend to maintain and grow it increases dramatically. The cost of administering a large-scale database does not grow linearly or in proportion to the database size; instead, the OPEX costs scale exponentially as the size of the database grows. OPEX costs can be the number one inhibitor to growing a very large database.
Databases and storage are becoming more intertwined—Increasingly, storage administrators must have a working knowledge of the database architecture, table layout, and how the database places data on disk to deliver the desired performance SLA. As a result, database vendors such as Oracle incorporate core storage features like automatic volume management into their database kernels as a way to more tightly couple storage with the databases engine. A data warehouse appliance takes this convergence to the ultimate endpoint—collapsing database intelligence and moving it closer to the physical storage to minimize network roundtrips and gain performance. This convergence of storage design and host-level software is not unprecedented. File systems have evolved to the point where they are now considered extensions of the underlying storage infrastructure. Furthermore, NAS appliances subsume file systems as a key component of a NAS system. It is natural for databases and storage to become more tightly coupled as the need for optimum performance grows.
Data warehouse appliances
The Taneja Group has begun tracking how this historical trend toward specialized storage appliances is being applied to structured data. We have identified an emerging category of “data warehouse appliances” over the past three years. Although the term “data warehouse appliance” is recognized in DBA circles, the term has almost no meaning or mindshare within the storage community. However, data warehouse appliances have far-reaching implications regarding how structured data will be managed and how access to that data will be scaled in the future. Ultimately, we see data warehouse appliances morphing into a new class of storage in much the same way that NAS and CAS became new types of storage.
The origins of the term “data warehouse appliance” can be traced back to 2002 or 2003 when Foster Hinshaw, the founder of Netezza and now founder and CEO of Dataupia, coined the term. Essentially, a data warehouse appliance is a turnkey, fully integrated stack of CPU, memory, storage, operating system (OS), and RDBMS software that is purpose-built and optimized for data warehousing and business intelligence workloads. It uses massive parallelism such as MPP architectures to optimize query processing. Through its knowledge of SQL and relational data structures, a data warehouse appliance is architected to remove all the bottlenecks to data flow so that the only remaining limit is the disk speed. Through standard interfaces such as SQL and ODBC, it is fully compatible with existing business intelligence (BI) and packaged third-party applications, tools, and data.
At its core, a data warehouse appliance simplifies the deployment, scaling, and management of the database and storage infrastructure. Ultimately, the vision of a data warehouse appliance is to provide a self-managing, self-tuning, plug-and-play database system that can be scaled out in a modular, cost-effective manner. To that end, data warehouse appliances are defined by four criteria:
- Workload optimized: A data warehouse appliance is optimized to deliver excellent performance for large-block reads, long table scans, complex queries, and other common activities in data warehousing;
- Extreme scalability: A data warehouse appliance is designed to scale and perform well on large data sets. In fact, the sweet spot for all data warehouse appliances on the market today is databases over 1TB in size;
- Highly reliable: A data warehouse appliance must be completely fault-tolerant and not be susceptible to a single point of failure; and
- Simplicity of operation: A data warehouse appliance must be simple to install, setup, configure, tune, and maintain. In fact, these appliances promise to eliminate or significantly minimize mundane tuning, data partitioning, and storage provisioning tasks.
A number of vendors are shipping data warehouse appliances. The original data warehouse appliances came from Netezza. However, since Netezza’s market entry, several other firms such as DATAllegro, Dataupia, and Kognitio have entered the market with variations on the original concept.
Although architectural approaches to data warehouse appliances vary widely, there are four main points for assessing different vendors’ approaches. First, does the data warehouse appliance replace existing database software with it own purpose-built kernel? Most of the data warehouse appliances replace traditional database kernels (e.g., Oracle, IBM DB2, and Microsoft SQL Server) with their own optimized database kernel. One exception is Dataupia. Unlike other data warehouse appliances, Dataupia’s software interoperates with, but does not replace, existing database systems.
Second, does the data warehouse appliance use low-cost industry-standard building blocks or customized ASICs and FPGAs to achieve higher levels of scalability and performance? Netezza, for example, uses custom ASICs and FPGAs to increase performance and scalability, while other vendors (DATAllegro, Dataupia, and Kognitio) use industry standard building blocks in order to offer the best-price/performance combo. The total cost and overall price-performance of the solution can be directly affected by the underlying components.
Third, does the data warehouse appliance make use of a highly parallelized design to gain greater scalability and performance? All vendors leverage some degree of parallelism to deliver the requisite performance and scalability. However, with any highly complex product, the devil is in the details. End users should scrutinize and understand the various architectural trade-offs and benefits of each approach and assess whether the trade-offs are well-suited to their database workload.
Fourth, what is the entry price of the solution, and can users scale storage capacity in increments that match how their data warehouse is growing? Data warehousing appliance vendors have widely divergent price points. Several solutions are priced from hundreds of thousands of dollars and can easily top out at several million dollars. Moreover, some solutions require users to purchase additional storage capacity in relatively large chunks (sometimes greater than 10TB). As a result, some appliances may be cost-prohibitive for smaller data warehousing deployments. (See table, below.)
Over the next few years, workload-optimized storage appliances, such as data warehouse appliances, will become key elements of the storage infrastructure in most data centers, much the same way that NAS and CAS became data-center staples. Data warehouse appliances represent another point in the historical trend toward more-specialized, workload-optimized storage systems. However, that is not to say that general-purpose storage devices will be replaced or rendered obsolete by these optimized appliances. Workload-optimized storage devices will carve out specific market niches where application-specific scaling, performance, and management requirements are unique and not easily met by general- purpose storage designs.
Large-scale data warehousing represents a significant headache for IT today. The continuing data tsunami, the need to keep structured data online longer, and the insatiable need for faster and more-responsive databases are driving users to consider new storage alternatives. Add to the mix that the current database scaling technologies are too cost-prohibitive or inflexible to meet the ever-increasing demands of the business. Specialized storage approaches, such as data warehouse appliances, offer a novel approach that provides cost-effective scalability and simplified management of structured content. End users must realize the new requirements of structured content and be willing to embrace new approaches to solve the problems of scaling and managing large-scale data warehouse implementations today and in the future.
Steve Norall is a senior analyst with the Taneja Group research and consulting firm (www.taneja.com).