Capacity optimization has long been a tale of secondary storage, where large amounts of duplicate data – often the product of disk-to-disk backup – and less demanding I/O patterns rule the roost. The solutions that have risen to the challenges in this domain have been nothing short of impressive, and through innovative combinations of technologies – file single instancing, fixed- and variable-length sub-file deduplication, compression, and more – solutions in this space have demonstrated they can sometimes drive stored data down to 1/20th the size of its original footprint, or better. With such compelling promises of capacity optimization, the market for these solutions has grown beyond $2 billion.
For the IT manager struggling to support nearly out-of-control data growth, these technologies surely look useful elsewhere (i.e., beyond secondary storage). And the most obvious candidate is the real root of all of that data – primary storage. But for the IT manager advancing such an interest beyond fanciful thinking, more often than not, experimentation ends with a “whoops.” Unlike most secondary storage environments, primary storage tends to be extremely performance sensitive, and does not consist of large amounts of nearly identical data that can immediately benefit from the application of deduplication technology.
It takes a unique set of capabilities to optimize the space that stored data takes up in primary storage. In particular, the optimization of primary storage, whether file or block, is very difficult to do without disrupting application performance. Random I/O patterns in primary storage, combined with the stringent performance requirements of production applications, have made optimizing the capacity behind primary storage an almost insurmountable challenge.
Until recently, there was no viable way to overcome these issues. Existing deduplication technologies generally fall short, due to their impact on storage performance, data integrity and/or the data management process. Fortunately for users, a new class of data compression solutions is now emerging that promises to address the specific challenges of primary storage optimization, or PSO. Before we examine these, let’s take a closer look at the specific criteria a solution must meet to satisfy PSO requirements.
While primary storage is a logical place in the storage hierarchy to apply optimization, it has proven challenging for vendors. We believe that a data reduction technology must meet the following criteria to be considered PSO-capable in the enterprise:
• Reliably and consistently reduce primary storage capacity requirements by 50% or more (depending on the data type)
• Do not degrade performance of primary storage in terms of I/O or latency, even for data streams that are fully sequential or completely random I/O
• Completely preserve the original data set
• Provide full transparency (requires no changes to existing IT infrastructure or processes)
Let’s see how the two primary contenders, data deduplication and compression, stack up against these criteria.
Data deduplication technology can reduce the size of a data set by identifying redundant blocks of data and storing just one copy. But while deduplication typically provides efficiency rates of 12:1 to 20:1 for backup streams, that rate can drop to no better than 50% in most primary storage environments.
More importantly, data deduplication can come with a performance penalty that runs against the grain of primary storage, where storage systems are often over-provisioned just to get a performance-increasing spindle count. While next generation technologies such as solid-state disk (SSD) drives are improving performance and enabling organizations to reduce the number of spindles, memory- and latency-intensive deduplication is still not ready.
Deduplication simply introduces too much cycle time and overhead on today’s controllers with processes that revolve around caching of data blocks, tracking references across in-memory b-trees or similar indexes, expanding and contracting pattern-matching windows, and caching and rewriting incoming data streams on disk. Such overhead in place of what is traditionally a controller-to-disk data transmission alongside rotational disk latency measured in milliseconds (or less with SSD) visibly affects the performance of many application workloads, particularly those with a high degree of active data. So the data for which deduplication can be effectively applied remains limited. The data reduction of deduplication may provide enough benefit to justify extra latency in highly redundant and read-biased data sets, and some users are finding that this is true with user home directories and VMware or Hyper-V boot images, which contain a high degree of overlapping and relatively static data.
Evaluating compression for PSO
Let’s now turn to data compression approaches, and assess their readiness for primary storage optimization. Until recently, data compression could not be performed in real time without degrading performance or compromising data integrity. But new technologies from vendors such as Storwize have altered the equation, making compression not only viable but an attractive option for PSO.
Compression solutions vary, and “state of the art” today means in-the-network (in-band) appliances that can compress data beyond built-in or controller capabilities, and can do so for sets of data distributed across multiple systems. Such in-band solutions can deliver full-speed storage performance without burdening existing controllers (in reality they can actually optimize controller interactions), while simultaneously bringing to bear state-of-the-art compression algorithms.
How does real-time compression work? The approach commonly utilizes an appliance that sits in-line between a NAS storage array (running either NFS or CIFS) and users of the data. Using a standard compression technology such as Lempel-Ziv (LZ), each file written in compressed format fully preserves the integrity of the original data, and all the information needed to access or re-create the original file is contained within the newly compressed file. In contrast to deduplication – which replaces data patterns at file and sub-file levels with pointers to other data that may be theoretically subject to an incorrect reference – this “lossless” compression methodology ensures that data integrity is not compromised, which is essential to achieving compliance with key industry regulations such as HIPAA and Sarbanes-Oxley.
In-band solutions can deliver even greater integrity by providing end-to-end verification throughout the data path. By preserving data integrity and operating in an appliance, this type of compression satisfies our third and fourth criteria for PSO readiness.
But more importantly, real-time, in-line compression reduces primary storage capacity needs on average by 50% to 90%. As you would expect, the effective compression rate varies with the type of data being stored. For example, database and text files typically achieve compression rates in excess of 80%, while at the other end of the spectrum, PDFs and other formatted documents are usually compressed by no more than 50%.
Compression can also improve the overall performance of the underlying storage system as well, having an additive affect that more than offsets minimal appliance overhead. Such solutions compress data on the initial write, which generates fewer and smaller disk I/Os and reduces the disk workload. The data is compressed before it gets to the storage array, which increases the effective size of the storage cache and allows the array to serve more requests from its read/write cache. Reads and writes are faster because they can be serviced in the cache instead of on disk. To accelerate reads even further, in-band appliances can augment array cache with appliance cache.
The primary vendor delivering solutions in this market today is Storwize. In a series of performance tests run jointly by IBM and Storwize, the companies sought to compare the throughput and response time impact of compression on various application workloads. In TPC-C benchmark tests, the compression appliances significantly reduced response time, increased throughput, and reduced CPU and disk utilization on the NAS system, compared to the baseline case in which no compression was applied. The bottom line is that such compression solutions will at minimum maintain, and in many cases enhance, application performance.
Moreover, compression can have trickle down benefit throughout the lifecycle of stored data, beyond just primary storage. Unlike deduplication, in which the data must be re-inflated when it is accessed, compressed data can remain compressed: there is no need to “re-inflate” or de-compress the data when it is accessed, as long as the compression appliance is in-band. Since compression remains in place as data is moved across tiers, compressed data can optimize storage even around sticky issues such as compliance. Moreover, when moving into nearline or offline repositories, compressed data can still be optimized by deduplication algorithms from other vendors.
Benefits through the data lifecycle
As an enabling PSO technology, real-time compression delivers some compelling benefits:
- Reduced CAPEX. Compression reduces storage capacity needs, meaning that fewer dollars will be spent on storage infrastructure.
- Energy savings. By significantly shrinking the data footprint at the point of origin, compression reduces space, power and cooling requirements.
- Lower OPEX costs. The data reduction benefits and associated OPEX savings cascade throughout the entire data lifecycle, as data files are managed, migrated, replicated, backed up and archived.
- Faster backup and archiving. Backup and archiving operations are much faster because less data needs to be moved.
- Increased storage performance. The throughput and response time of many applications will improve due to an increase in effective storage cache size.
The rapidly growing data stores of today’s primary storage environments, coupled with the mandate to maintain or enhance performance service levels, make the optimization of primary storage a must. The stakes are high, both for storage managers and their companies’ bottom lines. Storage managers can no longer afford to address the dual challenges of runaway capacity growth and tougher, user-driven SLAs by throwing more capital and operating resources at the challenges. Moreover, the battle between capacity and performance is being raised to a new level – SSD technology is poised in the wings, waiting for the right storage architectures and price points to become mainstream. But when it does, the smaller capacities of SSD will exacerbate the capacity problem facing storage managers. Adding SSD may well mean that over-provisioning capacity simply to increase performance is a practice of the past, and will bring new pressure for storage administrators to optimize primary capacity.
Data deduplication technology from vendors such as EMC (Data Domain), Exagrid, FalconStor, NetApp, Quantum, Sepaton, and others have done a great job on nearline and backup data. For some vendors with advanced deduplication and optimization architectures – such as GreenBytes, Ocarina and Permabit – deduplication is expanding its ability to optimize data in the primary storage repository. But deduplication is not yet the solution of choice for the mission-critical production workloads of primary storage.
In-band compression is a viable alternative to data deduplication. We recommend that users consider the cost implications of on-disk data in their primary storage infrastructure, and consider external PSO appliances that compress data in-line and in real time, before it gets to the storage array. In addition, to avoid compromising data integrity and regulatory compliance, IT managers should only consider solutions that provide lossless compression.
Users that choose the right data compression solution stand to reap a number of benefits, including increased storage efficiency and reduced capacity and costs across the entire data lifecycle. In a broader context, a capacity optimization strategy combining data compression for primary storage with data deduplication for secondary storage will enable organizations to maximize the returns on its storage-specific CAPEX and OPEX investments.
Jeff Boles and Jeff Byrne are senior analysts with the Taneja Group research and consulting firm.