Capacity optimization is no longer limited to secondary storage. The advantages are the same when applied to primary storage.
By Eric Burgener
The amount of information generated by enterprises has greatly increased over the last five years. Because enterprises often keep multiple copies of data for recovery and other purposes, storage capacity growth is a multiple of information growth and it is spiraling out of control, hitting rates of 50% to 100% per year or more for many companies. Growth at this level puts undue pressure on IT organizations, not only to pay for and manage all of this storage, but also to find floor space for it as well as to power and cool it.
In the 2004 time frame, technologies began to emerge that allowed information to be stored with much lower raw storage capacity requirements. These technologies, sometimes referred to as capacity optimized storage (COS), have now become widely available, with end–user surveys indicating strong growth for COS products over the next 6 to 12 months. Vendors in this space include Data Domain, Diligent Technologies, FalconStor Software, Hifn, NetApp, Quantum, Sepaton, and others.
COS technologies were originally designed for use against secondary storage such as that used primarily for data–protection purposes. Secondary storage has certain characteristics that figured heavily in how COS solutions were built. First and foremost, since there was so much redundancy in the data stored for data–protection purposes, COS solutions heavily leveraged technologies such as data de–duplication and single–instancing to achieve their data–reduction results. In addition, since most secondary storage was stored in offline rather than online environments, the capacity optimization process did not have to meet the stringent performance requirements of online application environments. Using COS solutions, it is realistic over time to achieve data reduction ratios against secondary storage of 15:1 or greater.
But COS left out a huge amount of data found in all application environments: primary storage, which is different from secondary storage in two critical respects: 1) Primary storage is used in online, performance–sensitive environments that have stringent response time requirements; and 2) Primary storage has little, if any, of the redundancy that makes technologies like data de–duplication and single instancing so effective against secondary storage. Recently, how–ever, a few vendors have begun shipping capacity optimization solutions specifically for use against primary storage, and the COS market is now splitting into two separate segments: primary storage optimization (PSO) and secondary storage optimization (SSO).
This article discusses the emerging PSO market, reviews the architectures and technologies, and highlights some of the vendors’ products in this space.
Defining an emerging market
In discussions with end users, Taneja Group has discovered that many companies have tried COS technologies on primary storage in test environments. There was a burning curiosity to see how effective COS technologies would be against the data sets used by online applications. What end users have discovered, and many COS vendors undoubtedly have proven in their own internal testing, is that while COS technology offers huge benefits in cost and floor space savings against secondary storage, it achieves much lower data–reduction ratios against primary storage. Because of their strong market showing in the SSO market, data de–duplication vendors may have a natural advantage in going after the PSO market, but they will clearly need to develop new technologies to do so effectively.
Because it has very different customer requirements and requires different technologies, PSO is clearly a separate market from SSO. Up to this point, use of the term COS has been synonymous with secondary storage. But with the advent of solutions specifically targeted at primary storage, it is useful to re–define the COS market. One approach to this might define a new, higher–level market called “storage capacity optimization” (SCO), with its two related sub–markets of PSO and SSO (see figure). The overall customer requirement in the storage capacity optimization market is to reduce the amount of raw storage capacity required to store a given amount of information. The two sub–markets define the different sets of technologies that are required to achieve that for primary versus secondary storage.
Approaches to SCO
Vendors tend to fall into two camps with respect to defining SCO approaches: inline and post–processing approaches. Each approach may be implemented using different architectures. Most solutions offload the capacity optimization processing from the application server, performing it either with dedicated resources in a card, appliance, or storage subsystem, but capacity optimization algorithms embedded in operating systems (such as those offered by Microsoft in Windows) and backup software agents (such as the Symantec Net–Backup PureDisk Agent) leverage host resources. This latter approach is a form of inline capacity optimization.
The inline group believes that maintaining the lowest possible storage requirements at all times is the most important metric. These vendors’ products intercept the data and capacity–optimize it before it is ever written to disk. While this does keep overall storage requirements at their lowest possible levels at all times, it does pre–sent a performance challenge. Whatever work must be done to capacity–optimize the data must be done so it does not impact performance in a meaningful way. For applications using secondary storage, the performance bar is low since they are not interactive. For primary storage applications, however, this is much more of an issue. Although performance impacts assert themselves differently for PSO and SSO solutions, vendors must architect their solutions so they do not impact performance or impact it only minimally from an end–user point of view.
The post–processing group believes that the impact on application performance is the key metric. With SSO solutions, these vendors take approaches that will not impact the performance of the initial backup in any way. Post–processing approaches generally write non–optimized data directly to disk and then make a second pass at the data to perform the capacity optimization. Think of this approach as analogous to a trash compactor that can be used on–demand (or on a scheduled basis) to reduce the raw storage capacity required to store any given data set. Policies can be implemented that will perform the capacity optimization when the storage capacity reaches a defined threshold. The downside to this approach is that there must always be enough storage available to initially store each new data set in non–optimized form.
Underneath the covers of each approach, vendors offer different methods to actually perform the capacity optimization. Approaches that examine the data at a lower level of granularity (e.g., sub–file level instead of file level) tend to offer higher–capacity optimization ratios, as do approaches that can apply variable–length windows (when doing data comparisons) instead of just fixed–length windows. Format–aware approaches, such as the tape–format–aware algorithms offered by FalconStor, can offer some additional data reduction relative to non–format–aware approaches. Discussion of how these algorithms work is beyond the scope of this article.
One final comment on general SCO approaches concerns scalability. When looking for redundant data, capacity optimization basically breaks data down into smaller components and looks to eliminate redundancy at the building–block level. A data repository is maintained of all the building blocks that the solution has “seen” before, and when it finds another instance of one of these building blocks, it inserts a reference to the instance of that building block that it is retaining in the repository and removes the duplicate object. Solutions that retain the latest instance of a given building block, as opposed to the original instance (which tends to become fragmented over time), tend to offer better read performance. The larger the data repository, the greater chance that any given building block already resides in it. Regardless of how large any single repository can be, solutions that can cluster repositories together to build a very large logical repository tend to offer better scalability. For many capacity–optimization vendors, the ability to offer clustering forms the basis of their claim to be an enterprise solution.
The first solutions in the PSO space started shipping in 2005 from Storwize. In 2007, NetApp released a product that is now called NetApp De–Duplication for FAS (formerly called Advanced Single Instance Storage, or A–SIS). Although NetApp initially positioned this utility in DataOnTap for use with secondary storage, in 2008, the company began messaging for the primary storage market as well. Also this year, Hifn, an OEM supplier of hardware–accelerated security and enhanced compression solutions that appear “under the covers” in many products from enterprise storage suppliers, added PSO to its existing SSO message.
And Ocarina Networks entered the PSO market with an announcement at this month’s Storage Networking World conference.
Interestingly, each of these vendors uses a different architecture. Note that these different approaches characterize what happens to writes to storage. All PSO solutions handle reads of capacity–optimized data at wire speeds.
Storwize uses an inline approach, with an in–band appliance—the STN Appliance—that offers real–time capacity optimization at wire speeds with no impact on application performance. Targeted for use with IP networks, Storwize’s appliance uses patented methods designed specifically for primary storage, based on enhanced Lempel–Ziv algorithms, to achieve data–reduction ratios averaging in the range of 3:1 to 9:1. The STN Appliances maintain caches that can in some instances provide better than native read performance. Multiple appliances can be clustered against a large, back–end data repository to support hundreds of terabytes of storage capacity. Deployment is transparent, and does not require network reconfiguration. As the pioneer in the PSO market, Storwize offered solutions about two years before other vendors began to address this space.
NetApp and Ocarina Networks offer post–processing approaches to avoid impacting the performance of online applications using primary storage.
NetApp believes that, over the next several years, capacity optimization technologies will migrate into the infrastructure layer and will be available as part of server, storage, and/or operating system platforms. NetApp’s De–Duplication for FAS was recently bundled into its DataOnTap software at no additional charge, adding to the overall value of NetApp’s storage platforms. Because it resides in a storage server that can support either file– or block–based storage, it can be used against either or both to achieve data reduction ratios of up to 6:1 against primary storage and up to 20:1 against secondary storage.
In NetApp’s case, there are two advantages to implementing capacity optimization as an operating system utility. First, it leverages a close integration with NetApp’s WAFL file system to incur extremely low overhead. For reliability purposes, WAFL already calculates a unique checksum, called a fingerprint, associated with each block of data. To de–duplicate data, NetApp just uses these fingerprints (which are already calculated anyway) to perform the search for and identification of duplicate blocks as a separate batch process, adding no overhead to ordinary file operations. Second, it can be easily integrated with other DataOnTap features to provide higher–level solutions. For example, integration with NetApp’s Thin Provisioning supports an “autosize” feature that will run capacity optimization algorithms as needed to keep a given volume under a size defined by the administrator. Integration with SnapVault and Snap–Mirror allows capacity optimization to be leveraged by policy to help minimize storage requirements for snapshots or to minimize the amount of data sent across the network for disaster–recovery purposes.
Targeted for use with IP storage, Ocarina Networks is the only vendor so far that offers format–aware optimization against primary storage. Although some SSO vendors have claimed that application–specific approaches to capacity optimization are not effective, this tends to be relevant only for inline capacity optimization. Format–aware capacity optimization takes more CPU cycles and does take slightly longer, but does not present performance issues when used by out–of–band approaches.
At least one SSO vendor—FalconStor—is also using a post–processing, format–aware approach. (FalconStor uses a tape–format–aware method in its SIR product.)
Ocarina Networks’ format–aware optimization recognizes file types and their contents, de–layers complex compound document types, can optimize already–compressed formats, and can perform de–duplication at the object level. Its out–of–band appliance, called the Optimizer, selects files by policy after they have been written to storage, identifies each file by type, and then routes it to the appropriate format–aware optimizer.
Ocarina offers format–aware optimizers for a number of common primary storage environments, including Microsoft Office workloads that contain home directories, pdfs, and digital photo sets (gif, jpeg), and Internet e–mail mixes that include blogs, e–mails, and text messages. Ocarina claims to achieve data reduction ratios approximately 3x better than that achieved with enhanced Lempel–Ziv algorithms, but its products are too new to have end users in production who can support those claims.
With its Express DR line of hardware acceleration cards, Hifn offers its OEM customers the option of deploying either inline or post–processing approaches. Hifn’s customers often embed these cards into their virtual tape library (VTL) or backup appliance products, leveraging it as one of several capacity optimization methods that are serially applied against secondary storage. Hifn deploys the same set of proprietary methods, based on its own enhanced Lempel–Ziv compression algorithms, against both primary and secondary storage, but achieves realistic data reduction ratios against primary storage of 2:1 to 4:1. Because Hifn’s cards can handle wire speeds of up to 1GBps, they can also be used in inline solutions.
All of these vendors offer or will be offering a predictive tool that can be deployed in an hour or two to provide an estimation of the type of data reduction ratios their PSO solutions can achieve in a particular environment. Since achievable data reduction ratios are very sensitive to the characteristics of different data types, use of such a tool prior to purchase is highly recommended. Storwize and Ocarina Networks offer such tools today, and Hifn plans to offer one later this year. NetApp does not need a separate tool for this, since their capacity optimization functionality ships at no additional charge with their operating system, and it can be enabled at the volume level to test it out against snapshots of primary production data without impacting ordinary file operations at any time.
Benefits of PSO
Capacity optimization offers many of the same benefits for primary storage as it does for secondary storage. PSO reduces raw storage capacity growth in primary storage, lowering not only the spending for new storage capacity, but also lowering costs associated with storage management, floor space, power, and cooling. Note that for many IT shops, the cost–per–terabyte of primary storage is greater than that of secondary storage due to higher performance and reliability requirements. While the data reduction ratios may not be as great with primary storage as with secondary storage, many shops will be enjoying greater savings against each “reclaimed” terabyte of primary storage.
Because it lowers the overall capacity of primary storage, PSO offers enterprises of all sizes other benefits as well:
- Shortened overall backup–and–restore times since less data must be written to or retrieved from disk for any given data set; and
- In cases where data sets must be shipped across networks, the smaller, capacity–optimized data sets require less bandwidth, thereby reducing network traffic.
Note that PSO can be a complementary technology to SSO. Solutions that use different capacity optimization methods for primary and secondary storage can actually offer additive data reduction advantages. Data reduction ratios with combined use will vary based on the actual solutions used and the workload types. The only way to really understand the benefit PSO, or a combination of PSO and SSO, together will provide is to test it on specific workloads.
Challenges with PSO
Several issues need to be evaluated as PSO technology is considered. First, does it really pose no performance impact in your environment? This is not just a concern for inline approaches. Understand how long it will take a post–processing solution to complete its capacity optimization task. What is the impact (if any) to online application performance during this process? This is less of a concern for out–of–band approaches than in–band approaches, but keep in mind that out–of–band approaches do actually move data back and forth to primary storage during the process.
Although it is not an unfamiliar challenge, another concern with PSO is how to retrieve capacity optimized data in the event of a problem with the PSO solution. For hardware–based solutions, simple redundancy at the appliance or card level can be sufficient to handle any single points of failure. Note that unlike with encryption, there is nothing random about how data is capacity–optimized. The same capacity–optimization methods are predictably used across all models of a certain type in a vendor’s product line, so any other similar model could be used to retrieve the data.
A final concern is one shared by both PSO and SSO. Because the technologies basically refer to a single instance of an object that appears multiple times, if for some reason that object gets corrupted, the damage can potentially be much greater than just losing that object in non–capacity–optimized storage. Depending on the data reduction ratio achieved, loss of a single object could potentially affect thousands of instances of that object in each of the files, file systems, or databases where it also appears. The first line of defense against this is that most customers are already using some form of RAID, providing redundancy against single points of failure at the hardware level.
Over and above the hardware RAID approach, vendors offer two additional and optional methods to address this issue: integrated metadata and multiple spindling.
With the integrated metadata approach, a PSO solution effectively implements a virtualization layer that handles the abstraction of a single physical copy of an object to however many redundant copies exist. Each time the virtualization layer creates a reference to an object, metadata is saved along with that object that effectively enables its re–creation in the event the primary instance of it becomes corrupted. In a manner similar to how chkdsk can be used to rebuild an NTFS file system block by block, this metadata can be used to re–create any object (albeit at a relatively slow rate).
The multiple spindling approach ensures no single data element exists only on one spindle in a given logical volume. Think of this as “mirroring” at the object level, ensuring any single data element is always available on at least two spindles. Both of these approaches, while offering improved data reliability, do lower the overall data reduction ratio slightly.
Primary storage is an area ripe for capacity optimization, and enterprises of all sizes have a lot to gain from deploying reliable implementations of this technology over time. The compelling economic payback is achieved not only against your most expensive storage tier, but also the benefits of data reduction then roll through other tiers (nearline, offline). If you are considering deploying both PSO and SSO (which is a smart idea in the long term), choose complementary technologies for each tier to maximize your overall data reduction ratios. SSO has achieved penetration rates of about 20% in the industry—slightly higher in small and medium–sized enterprises—but shows strong purchase intent over the next 6 to 12 months across all segments. As enterprises come to trust SSO, this will help PSO to achieve similar penetration rates more rapidly.
As with any emerging technology, there are questions of performance and reliability. Demand solutions that impose no perceivable performance impacts. Use high–availability approaches such as clustering PSO appliances to help address reliability issues. While there are many reference customers, enterprises have been cautious in their deployment of PSO to date. Decide up–front whether inline or post processing approaches are best–suited to your environment and requirements. Check references and use predictor tools if possible. If a PSO solution cannot offer data reduction ratios of at least 3:1 against your particular workloads, it may not pay to implement it over basic compression technologies that are proven and widely available. Expect to achieve realistic data reduction ratios across varied workloads in the 5:1 range with PSO technology.
Eric Burgener is a senior analyst and consultant with the Taneja Group research and consulting firm (www.tanejagroup.com).
The difference between compression and data de–duplication
Often based on Lempel–Ziv (LZ) algorithms, compression uses an encoding scheme against data to reduce the number of bits required to represent it. The result can then be decoded to retrieve the original data.
There are two types of compression: lossless and lossy. Lossless compression can exactly re–create the original string, while lossy compression will only be able to re–create a close approximation of the original string; lossy compression tends to offer slightly higher data reduction ratios. Data reduction ratios using compression generally are not greater than 2:1.
Data de–duplication, like compression, takes advantage of the fact that most data has some statistical redundancy to reduce the number of bits required to represent it. Using higher–level algorithms that generally operate at the sub–file level (compression operates at the file level), data de–duplication looks for patterns within files that also appear in other files, and generally achieves much higher data reduction ratios than standard compression.
Some vendors offer global data de–duplication repositories that can be used to de–duplicate data across systems, whereas compression references are specific to one system. Data reduction ratios using data de–duplication can be 15:1 or greater for secondary data sets such as repeated backups over time.