BY ERIC BURGENER
Data is growing at explosive rates, but the real news around that topic concerns new technologies designed to help users manage this growth.
Storage capacity optimization (SCO) technologies, such as enhanced compression, single instancing, and data de-duplication, are being used to reduce the amount of physical storage capacity required to store a given amount of data. SCO technologies include enhanced compression, file-level single-instancing, and data de-duplication, among others.
Generally defined as capacity optimization approaches that achieve at least a 3:1 data reduction ratio, solutions built around these technologies can often deliver ratios as high as 20:1 or greater, depending on the data types. Savings accrue in the areas of storage infrastructure, management overhead, and energy and floor-space costs, while significantly increased storage densities can enable new tiering strategies that provide the foundation for a variety of operational improvements.
Since its initial introduction in 2004 by Data Domain, SCO technology has achieved strong penetration in both large and small enterprises, with purchase intent on the increase for 2009. Today, SCO is primarily deployed in secondary storage applications such as backup and archive—market segments Taneja Group refers to as secondary storage optimization (SSO).
Data migrates to different storage tiers as it ages, traversing the network. This makes the network a potentially advantageous location for deploying SCO technology.
Data migrates to different storage tiers as it ages, traversing the network. This makes the network a potentially advantageous location for deploying SCO technology.
The other major market segment within SCO is primary storage optimization (PSO), where vendors are applying capacity optimization technologies against primary application environments, such as databases, home directories, online image repositories, and other unstructured data environments that don’t have anything to do with data protection. Most of the major industry players have entered the SSO space, whether by acquisition or OEM agreements, including EMC, HDS, HP, IBM, NetApp, Sun, and Symantec, and there are a number of smaller vendors in this space as well, including Active Circle, Data Domain, Exagrid, FalconStor, Hifn, NEC, Overland, Permabit, Quantum, Sepaton, and Tarmin. PSO is a space that is still primarily dominated by smaller companies.
PSO is defined separately from SSO because primary storage is different from secondary storage in two key ways:
- Access latencies for primary storage are generally much more stringent than for secondary storage. Because SCO processing introduces latency, approaches and algorithms that have worked successfully in the SSO space do not necessarily translate well for use with primary storage.
- Primary storage generally exhibits significantly less redundancy than secondary storage. SSO algorithms that are just looking for data redundancies do not necessarily produce the highest capacity optimization ratios when used against primary storage. These two characteristics have led a number of vendors to introduce capacity optimization algorithms that are specific not only to primary storage but also to specific data types in an effort to achieve higher capacity optimization ratios.
A lot has happened in the PSO market in the last year, including two recent developments that are likely to herald some significant changes in the industry over the next two years. In this article, we’ll review the available architectures, trends and issues in this emerging space, and examine two critical developments.
PSO products have developed along lines similar to those used with SSO products. There are in-line and post-processing architectures, generic and application content-aware algorithms, and different locations where the technology can be deployed. Since early 2007, when there was only one vendor (Storwize) focusing on PSO, there has been a trend towards architectural proliferation, with six vendor offerings in the market now representing a variety of different approaches. With thousands of production deployments in the market today, some trends are starting to emerge that end users can use to help guide future deployments.
First, let’s take a look at the different architectures:
- In-line vs. post-processing. With in-line processing, data is capacity optimized in real time so that it is already in capacity optimized form before it is ever written to a storage target. In post-processing approaches, data is first written to the storage target in its original form, then a secondary process picks that data up, capacity-optimizes it, and writes it back to primary storage. In-line approaches require less overall raw storage capacity, but processing speed may be an issue as there is a concern that primary application performance may be negatively impacted. Post-processing approaches introduce no additional latencies that may impact primary applications, but they do require more storage, with the actual amount of incremental storage depending on how quickly the data is processed into capacity optimized form.
In-line vendors include greenBytes and Storwize, while post-processing vendors include Ocarina Networks. Two vendors,NetApp and Hifn, offer PSO technologies that can be deployed either in-line or post-processing, depending upon implementation or configuration.
- Generic and content-aware algorithms. Generic approaches use the same capacity optimization algorithms against all data types, whereas content-aware algorithms first identify the data type and then apply an algorithm that was specifically developed for that particular data type. Generic approaches introduce less processing latency,but they may not offer capacity optimization ratios as high as content-aware approaches. Content-aware solutions, however, may be limited in the number of data types they handle. Vendors leveraging generic algorithms include greenBytes, Hifn, NetApp and Storwize, while onlyOcarina Networks deploys a content-aware approach.
- Different deployment locations. Source-based solutions capacity optimize the data on the source that created the data, whereas target-based solutions use off-host resources to capacity-optimize the data. Source-based approaches draw on host-based resources, and can potentiallyimpact application server performance in ways that target-based approaches do not. Although source-based solutions are available in the SSO market today, most notably integrated with backup clients, all currently available PSO products are target-based today. That is likely to change over the next 6 to 12 months, however, with major vendors such as Microsoft and Sun talking about possibly integrating SCO technologies into their operating system platforms, as well as the recent introduction of hardware-based PSO on a card by Hifn, an OEM vendor which does not sell SCO solutions directly to end users.
PSO issues and trends
When SCO technologies were first introduced, end users were concerned about a number of issues. Performance (throughput and latency) was a concern for both in-line and post-processing solutions, and those evaluatingin-line approaches additionally focused on the latencies introduced by PSO algorithm processing as data was being written to primary storage.
Data reliability was also top of mind: Could solutions offer a way to validate that thedata that was initially written into thecapacity-optimized store was the data thatyou actually got back out?
And finally, how were availability issues addressed? If PSO solutions failed, how did this impact data availability, and howquickly could recoveries occur?
For most products in the market today, these issues have been addressed. Solutions that can handle wire speeds of up to 800MBps are available, making them applicable to many of even the most mission-critical applications. Added latencies during reads (and re-conversion) are well under 10 milliseconds for many of these solutions. Data fingerprinting based on 128- or 256-bit hashing algorithms minimize hash collision risks, and separate validation checks—generally performed through a form of checksumming—verify that data is being reliably retrieved. Availability is generally addressed by deploying PSO appliances in pairs, with indexes mirrored between appliances in cases where that is required. Some vendors, such as Storwize, leverage enhanced compression algorithms that do not rely on an index and so do not need to mirror them.
Some applications compress data using standard Lempel-Ziv-based algorithms before storing them as standard operating procedure. In the 2007 release, Microsoft Office does this for all data files in Word, PowerPoint, andExcel. Office files are a major contributor to the explosive data growth the industry is experiencing. As more applications take thisapproach, this may affect capacity optimization ratios that solutions using enhanced compression techniques can achieve against these data types. This makes it more important than ever to evaluate PSO technologies against your data sets prior to deployment.
While capacity optimization ratios achievable with PSO can vary wildly bydata type, users should expect to see lowerratios against primary storage than what has been achieved with SSO against secondary storage. Note, however, that primary storage tends to be significantly more expensive than secondary storage on a $/GB basis, so lowerratios can still result in higher overall savings,depending on the size of the data sets against which PSO is deployed.
And don’t rely on vendor claims of average capacity optimization ratios being achieved with their technologies: The only important metric is what type of capacity optimization ratios you achieve against your data. Most vendors offer “predictor” software that can be run on a server to predict the ratios achievable with their technology to get an idea of the value it offers before you install the complete solution. Discussions with references can indicate the accuracy these predictors have achieved against data sets similar to your own.
With deployments on the rise over the course of the last year, PSO technology is proving itself across a variety of different vertical markets, including financial services, rich media, social networking, medical and life sciences, oil and gas, telecommunications, and manufacturing environments. All PSO vendors support NAS interfaces to their products (greenBytes uniquely also supports an iSCSI interface), so by definition, all of the target data sets are unstructured.
PSO solutions shipping today do seem to be applicable across a variety of different data types, and vendors in this arena seem uniformly concerned that their technologies might be pigeonholed for use against a particular data type. Still, it’s interesting to note that Ocarina Networks has racked up some wins with image-based data, boasting a 20PB installation with Kodak for photographs stored on-line, while Storwize is the only vendor quoting performance data (throughput and capacity optimization ratios) against OLTP database environments (Oracle running on NAS) based on production installations. NetApp doesn’t appear to have developed any particular data type affinities, but the company has packaged its solution, called NetApp DeDupe for FAS, with every Data ONTAP shipment, and so has a larger market footprint than the other vendors. As a recent market entrant, greenBytes is pushing the “green” envelope farther than other players in this space by integrating massive arrays of idle disks (MAID) technology into its in-line, target-based PSO solution. Virtual machine environments present a great opportunity for PSO, and most of the vendors have a number of customers using their technology in this way.
Today, all SCO solutions are software-based. But Hifn’s introduction of a hardware approach, based on a data de-duplication ASIC on a card, will make it easier and less expensive for OEMs to integrate SCO technologies into their server and storage platforms. Based on Hifn’s Express DR 255 card, this product must be combined by the OEM with separate indexing and data redundancy technologies not available from Hifn to form end-user solutions, so it will likely not be until 2H09 that we start to see products based on this technology. But it is apt to draw a loose analogy between what happened to software-based compression products when hardware-based compression was introduced. Also, the higher performance that presumably will be enabled by running de-duplication in hardware could eventually broaden the applicability of SCO technologies against primary storage where performance is more of an issue, particularly for in-line solutions.
In the long run, the availability of hardware-based de-duplication will lead to broader deployment of SCO technologies and will drive prices down. It is a shot across the bow of both PSO and SSO vendors today, whose solutions are based on software.
The other game-changing development is the impending release of a network-based SCO technology that the vendor, Riverbed Technology, claims can be used against both primary and secondary storage. Due in the first half of 2009, this approach leverages Riverbed’s Steelhead line of wide area data services (WDS) appliances as the capacity-optimization engine. At this point, there are still a lot of performance and scalability questions around this particular implementation, but if SCO can be applied as a network service, it could effectively recast source- and target-based solutions as niche plays. Niche plays can be profitable business models in the PSO space, given sufficient data volumes, but such a development is likely to change the SCO market landscape.
In gauging the effect of these two developments on the market over the next several years, Taneja Group believes that the availability of network-based SCO, applicable to both primary and secondary storage—and possibly supercharged by hardware-based capacity optimization algorithms—will force end users to think more strategically about how they deploy SCO technology. If data can be capacity-optimized soon after its creation, and kept in that form except when it is being used by applications, this could go a long way towards managing explosive data growth cost-effectively. The network could be an efficient and potentially cost-effective central point where data is capacity-optimized and/or re-expanded prior to use. Ultimately, it will come down to the performance, scalability, and reliability of network-based SCO implementations, but theoretically, the advent of this model is a game-changer.
The value of SCO technology to primary storage is as interesting as its use against secondary storage. If vendors can supply SCO technologies that can be effectively used against both primary and secondary storage, there will be a clear end-user preference for these solutions over more niche-oriented solutions that can just address primary or secondary storage, but not both.
But this will not happen anytime soon. The value to be gained from SCO technology is very much linked to the overall capacity of data, which must be retained, regardless of whether that is primary or secondary data. With data growing 50 to 60% a year or more, many enterprises will soon be managing hundreds of terabytes of primary data, if they are not already. At this scale, and given that the cost of primary storage can be 10x the cost of disk-based secondary storage, capacity optimization ratios do not need to be very high before PSO becomes a compelling economic imperative.
PSO’s bright future
Our forecast of these developments should not preclude the tactical purchase of either PSO or SSO solutions today, provided you can show compelling economic cases to do so based on near-term hard cost savings. Given sufficient data volumes, both PSO and SSO technologies offer huge value to enterprises today, but more comprehensive solutions, which cover both primary and secondary storage, could replace niche solutions over the next three to five years.
The PSO market has expanded from a single vendor to six vendors, and some of the major industry players are expected to enter the market this year. PSO has proved its value, and is being widely considered for use in enterprises of all sizes, but it is definitely not yet a mainstream technology. Given the cost of primary storage and explosive data growth rates, though, it’s a technology whose time has come.
ERIC BURGENER is a senior analyst and consultant with the Taneja Group research and consulting firm (www.tanejagroup.com).