InfoStor Article Categories:
![]() |
![]() |
|
|
|
|
![]() |
|
InfoStor Online Article
|
||||||||||||||||||||||||||||||||
It's time for primary storage optimization Capacity optimization is no longer limited to secondary storage. The advantages are the same when applied to primary storage. By Eric Burgener April 11, 2008—The amount of information generated by enterprises has greatly increased over the last five years. Because enterprises often keep multiple copies of data for recovery and other purposes, storage capacity growth is a multiple of information growth and it is spiraling out of control, hitting rates of 50% to 100% per year or more for many companies. Growth at this level puts undue pressure on IT organizations, not only to pay for and manage all of this storage, but also to find floor space for it as well as to power and cool it. In the 2004 time frame, technologies began to emerge that allowed information to be stored with much lower raw storage capacity requirements. These technologies, sometimes referred to as capacity optimized storage (COS), have now become widely available, with end-user surveys indicating strong growth for COS products over the next six to twelve months. Vendors in this space include Data Domain, Diligent Technologies, FalconStor Software, Hifn, NetApp, Quantum, Sepaton, and others. COS technologies were originally designed for use against secondary storage such as that used primarily for data-protection purposes. Secondary storage has certain characteristics that figured heavily in how COS solutions were built. First and foremost, since there was so much redundancy in the data stored for data-protection purposes, COS solutions heavily leveraged technologies such as data de-duplication and single instancing to achieve their data-reduction results. In addition, since most secondary storage was stored in offline rather than online environments, the capacity optimization process did not have to meet the stringent performance requirements of online application environments. Using COS solutions, it is realistic over time to achieve data reduction ratios against secondary storage of 15:1 or greater. But COS left out a huge amount of data found in all application environments: primary storage, which is different from secondary storage in two critical respects: 1) Primary storage is used in online, performance-sensitive environments that have stringent response time requirements; and 2) Primary storage has little if any of the redundancy that make technologies like data de-duplication and single instancing so effective against secondary storage. Recently, however, a few vendors have begun shipping capacity optimization solutions specifically for use against primary storage, and the COS market is now splitting into two separate segments: primary storage optimization (PSO) and secondary storage optimization (SSO). This article discusses the emerging PSO market, reviews the architectures and technologies, and highlights some of the vendor' products in this space. DEFINING AN EMERGING MARKET Because it has very different customer requirements and requires different technologies, PSO is clearly a separate market from SSO. Up to this point, use of the term COS has been synonymous with secondary storage. But with the advent of solutions specifically targeted at primary storage, it is useful to re-define the COS market. One approach to this might define a new, higher-level market called "storage capacity optimization" (SCO), with its two related sub-markets of PSO and SSO. The overall customer requirement in the storage capacity optimization market is to reduce the amount of raw storage capacity required to store a given amount of information. The two sub-markets define the different sets of technologies that are required to achieve that for primary vs. secondary storage. APPROACHES TO SCO The inline group believes that maintaining the lowest possible storage requirements at all times is the most important metric. These vendors' products intercept the data and capacity-optimize it before it is ever written to disk. While this does keep overall storage requirements at their lowest possible levels at all times, it does present a performance challenge. Whatever work must be done to capacity-optimize the data must be done so as not to impact performance in a meaningful way. For applications using secondary storage, the performance bar is low since they are not interactive. For primary storage applications, however, this is much more of an issue. Although performance impacts assert themselves differently for PSO and SSO solutions, vendors must architect their solutions so they do not impact performance or impact it only minimally from an end-user point of view. The post-processing group believes that the impact on application performance is the key metric. With SSO solutions, these vendors take approaches that will not impact the performance of the initial backup in any way. Post-processing approaches generally write non-optimized data directly to disk and then make a second pass at the data to perform the capacity optimization. Think of this approach as analogous to a trash compactor that can be used on-demand (or on a scheduled basis) to reduce the raw storage capacity required to store any given data set. Policies can be implemented that will perform the capacity optimization when the storage capacity reaches a defined threshold. The downside to this approach is that there must always be enough storage available to initially store each new data set in non-optimized form. (See sidebar.)
null Underneath the covers of each approach, vendors offer different methods to actually perform the capacity optimization. Approaches that examine the data at a lower level of granularity (e.g., sub-file level instead of file level) tend to offer higher-capacity optimization ratios, as do approaches that can apply variable-length windows (when doing data comparisons) instead of just fixed-length windows. Format-aware approaches, such as the tape-format-aware algorithms offered by FalconStor, can offer some additional data reduction relative to non-format-aware approaches. Discussion of how these algorithms work is beyond the scope of this article. One final comment on general SCO approaches concerns scalability. When looking for redundant data, capacity optimization basically breaks data down into smaller components and looks to eliminate redundancy at the building-block level. A data repository is maintained of all the building blocks that the solution has "seen" before, and when it finds another instance of one of these building blocks, it inserts a reference to the instance of that building block that it is retaining in the repository and removes the duplicate object. Solutions that retain the latest instance of a given building block, as opposed to the original instance (which tends to become fragmented over time), tend to offer better read performance. The larger the data repository, the greater the chance that any given building block already resides in it. Regardless of how large any single repository can be, solutions that can cluster repositories together to build a very large logical repository tend to offer better scalability. For many capacity-optimization vendors, the ability to offer clustering forms the basis of their claim to be an enterprise solution. PSO ARCHITECTURES And Ocarina Networks entered the PSO market with an announcement at this month's Storage Networking World conference. Interestingly, each of these vendors uses a different architecture. Note that these different approaches characterize what happens to writes to storage. All PSO solutions handle reads of capacity-optimized data at wire speeds. Inline approaches Post processing approaches NetApp believes that, over the next several years, capacity optimization technologies will migrate into the infrastructure layer and will be available as part of server, storage, and/or operating system platforms. NetApp’s De-Duplication for FAS was recently bundled into its DataOnTap software at no additional charge, adding to the overall value of NetApp’s storage platforms. Because it resides in a storage server that can support either file- or block-based storage, it can be used against either or both to achieve data reduction ratios of up to 6:1 against primary storage and up to 20:1 against secondary storage. In NetApp's case, there are two advantages to implementing capacity optimization as an operating system utility. First, it leverages a close integration with NetApp's WAFL file system to incur extremely low overhead. For reliability purposes, WAFL already calculates a unique checksum, called a fingerprint, associated with each block of data. To de-duplicate data, NetApp just uses these fingerprints (which are already calculated anyway) to perform the search for and identification of duplicate blocks as a separate batch process, adding no overhead to ordinary file operations. Second, it can be easily integrated with other DataOnTap features to provide higher-level solutions. For example, integration with NetApp's Thin Provisioning supports an "autosize" feature that will run capacity optimization algorithms as needed to keep a given volume under a size defined by the administrator. Integration with SnapVault and SnapMirror allows capacity optimization to be leveraged by policy to help minimize storage requirements for snapshots or to minimize the amount of data sent across the network for disaster-recovery purposes. Targeted for use with IP storage, Ocarina Networks is the only vendor to date offering format-aware optimization against primary storage. Although some SSO vendors have claimed that application-specific approaches to capacity optimization are not effective, this tends to be relevant only for inline capacity optimization. Format-aware capacity optimization takes more CPU cycles and does take slightly longer, but does not present performance issues when used by out-of-band approaches. At least one SSO vendor—FalconStor—is also using a post-processing, format-aware approach. (FalconStor uses a tape-format-aware method in its SIR product.) Ocarina Networks' format-aware optimization recognizes file types and their contents, de-layers complex compound document types, can optimize already-compressed formats, and can perform de-duplication at the object level. Ocarina's out-of-band appliance, called the Optimizer, selects files by policy after they have been written to storage, identifies each file by type, and then routes it to the appropriate format-aware optimizer. Ocarina offers format-aware optimizers for a number of common primary storage environments, including Microsoft Office workloads that contain home directories, PDFs, and digital photo sets (GIF, JPEG), and Internet e-mail mixes that include blogs, e-mails, and text messages. Ocarina claims to achieve data reduction ratios approximately 3x better than that achieved with enhanced Lempel-Ziv algorithms, but its products are too new to have end users in production who can support those claims. Hybrid approaches All of these vendors offer or will be offering a predictive tool that can be deployed in an hour or two to provide an estimation of the type of data reduction ratios their PSO solutions can achieve in a particular environment. Since achievable data reduction ratios are very sensitive to the characteristics of different data types, use of such a tool prior to purchase is highly recommended. Storwize and Ocarina Networks offer such tools today, and Hifn plans to offer one later this year. NetApp does not need a separate tool for this, since their capacity optimization functionality ships at no additional charge with their operating system, and it can be enabled at the volume level to test it out against snapshots of primary production data without impacting ordinary file operations at any time. BENEFITS OF PSO Because it lowers the overall capacity of primary storage, PSO offers enterprises of all sizes other benefits as well:
Note that PSO can be a complementary technology to SSO. Solutions that use different capacity optimization methods for primary and secondary storage can actually offer additive data reduction advantages. Data reduction ratios with combined use will vary based on the actual solutions used and the workload types. The only way to really understand the benefit PSO, or a combination of PSO and SSO, together will provide is to test it on specific workloads. CHALLENGES WITH PSO Although it is not an unfamiliar challenge, another concern with PSO is how to retrieve capacity optimized data in the event of a problem with the PSO solution. For hardware-based solutions, simple redundancy at the appliance or card level can be sufficient to handle any single points of failure. Note that unlike with encryption, there is nothing random about how data is capacity-optimized. The same capacity-optimization methods are predictably used across all models of a certain type in a vendor's product line, so any other similar model could be used to retrieve the data. A final concern is one shared by both PSO and SSO. Because the technologies basically refer to a single instance of an object that appears multiple times, if for some reason that object gets corrupted, the damage can potentially be much greater than just losing that object in non-capacity-optimized storage. Depending on the data reduction ratio achieved, loss of a single object could potentially affect thousands of instances of that object in each of the files, file systems, or databases where it also appears. The first line of defense against this is that most customers are already using some form of RAID, providing redundancy against single points of failure at the hardware level. Over and above the hardware RAID approach, vendors offer two additional and optional methods to address this issue: integrated metadata and multiple spindling. With the integrated metadata approach, a PSO solution effectively implements a virtualization layer that handles the abstraction of a single physical copy of an object to however many redundant copies exist. Each time the virtualization layer creates a reference to an object, metadata is saved along with that object that effectively enables its re-creation in the event the primary instance of it becomes corrupted. In a manner similar to how chkdsk can be used to rebuild an NTFS file system block by block, this metadata can be used to re-create any object (albeit at a relatively slow rate). The multiple spindling approach ensures no single data element exists only on one spindle in a given logical volume. Think of this as "mirroring" at the object level, ensuring any single data element is always available on at least two spindles. Both of these approaches, while offering improved data reliability, do lower the overall data reduction ratio slightly. RECOMMENDATIONS As with any emerging technology, there are questions of performance and reliability. Demand solutions that impose no perceivable performance impacts. Use high-availability approaches such as clustering PSO appliances to help address reliability issues. While there are many reference customers, enterprises have been cautious in their deployment of PSO to date. Decide up-front whether inline or post processing approaches are best-suited to your environment and requirements. Check references and use predictor tools if possible. If a PSO solution cannot offer data reduction ratios of at least 3:1 against your particular workloads, it may not pay to implement it over basic compression technologies that are proven and widely available. Expect to achieve realistic data reduction ratios across varied workloads in the 5:1 range with PSO technology. Eric Burgener is a senior analyst and consultant with the Taneja Group research and consulting firm. Page 1 of 1
|
|
|||||||||||||||||||||||||||||||
|
|
|