Data reduction for primary storage: Benefits and options

By George Crump, Storage Switzerland

-- While usually thought of for backups, data reduction for primary storage has been in existence for quite some time. There have been OSs and add-on utilities to OSs that compress data either in real time or in the background on primary storage since the mid 90s. But until recently, as drive capacity has become steadily cheaper, primary storage data reduction had become largely irrelevant. In the past few years, however, interest in space saving technologies has renewed and many organizations are again considering data reduction technologies for primary data.

What has caused this resurgence and what are some of the options for data center managers to consider?

The resurgence of primary storage data reduction is being driven largely by two conditions: an increase in file retention and a decrease in the availability of "cheap" storage.

While the capacity demands from users continue unabated, the limits of storage are being reached. More files are created, those files are bigger, and the retention of those files is either legally or emotionally required. Users also know that up until now access to capacity has been relatively cheap. In the last year or so many data centers have crossed the line where it's no longer less expensive to add more capacity. Along with the cost to manage multiple storage systems, the impacts of storage on power, cooling and floor space are now measurable variables that have added to the total cost of storing all this data. This has been further compounded by a tight economy and the close scrutiny of storage expenses, forcing storage managers to make better use of their capacity resources.

Data reduction technologies are high on the list of tools to help them achieve that goal.

If IT managers are being honest with themselves, data reduction of primary storage is really treating the symptom, not the problem. Ideally, a large percentage of the data on primary storage should either be deleted or moved off to a secondary storage device. Reality, however, seldom agrees with the idyllic vision of what the data center should look like. Storage is often set up as a service to the users and as a result, they don't want their data moved and certainly they don't want it deleted. Any actions such as these have to be transparent to them. The path of least resistance, then, is to implement a technology that squeezes more capacity out of the same storage space, without moving users' data around. Acceptable data reduction techniques should therefore mean limited change, with most of the optimization being seamless to users.

Optimization techniques

Two optimization techniques are at the top of the list when considering data reduction for primary storage: data compression and data deduplication.

While data deduplication captures much of the headlines, compression may have greater value in the context of primary storage. This is because deduplication, to be effective, requires redundant data, which of course is why backup produces such a good return on the deduplication investment. All those weekly full backups are nearly identical. Primary storage is not, or at least should not be, nearly as redundant as backup data, with the possible exception of virtual machine (VM) images. As a result, deduplication's efficiency on primary data is typically 1/3rd of what it is on backup data.

Data compression, on the other hand, works with almost all data types. While it's not as effective on highly redundant data as deduplication is, most data on primary storage can be compressed. Data compression can also be fine tuned for data types. At the expense of processing resources, special compressors can be leveraged to reduce particular data sets.

Ideally, both data compression and deduplication should be used together to provide maximum reclamation of primary storage capacity.

Where can data reduction occur?

Where the capacity optimization takes place, and what handles the optimization, is another consideration. Data reduction is, today, most commonly found on systems that serve files, such as NAS or file servers. The choices to reduce data are typically either supplied by the provider of the storage hardware or file system or something that's delivered via a third party as an add-on. In the NAS hardware case, it's usually the file system itself that handles the data reduction tasks. There are also vendors that offer a stand-alone file system or NAS software which can be installed on existing hardware to provide data reduction.

Obviously, the file system approach only works for the data center if the NAS or file system currently in use has this capability. It also means that only that vendor's NAS storage hardware devices are supported. If there is a mix of vendors in the environment, or if the vendor does not currently provide data optimization services in their system, then the user needs to look to third-party ISVs to provide the capability. This sometimes can bring other benefits, such as greater flexibility, a more universal approach to optimization, the ability to move data between different vendors' platforms, as well as the advantage of specialization that these vendors often enjoy. From a product development perspective, they can focus on just data reduction and do not have to maintain a whole file system.

Thus far there has not been much optimization activity on block-based systems. While a file system loaded on a traditional LUN may provide the capability, most storage array hardware does not yet do this. However, this may become an option in the near future. As vendors begin to roll out their automated tiering strategies, which will move blocks of data between tiers of storage, it's not a far leap to assume that they may be able to optimize that data as well.

Active and near-active data

Depending on the research study, as much as 85% of the data on primary storage is no longer being frequently accessed. This has been the case for many years now and has created technology initiatives such as hierarchical storage management (HSM), data archiving and the now infamous information lifecycle management (ILM).

While all of these initiatives have merit and should be explored, reality is that many data centers need a quick capacity fix now and don't have the time or staff resources to implement a complete data management strategy. As a result, primary storage in the real world typically holds the entire range of data classes: extremely active, near active and, unfortunately, inactive (old) data. The good news is that all of this data can be optimized.

Admitting that primary storage holds this variety of data classes is critical because each optimization strategy has its own unique impact on the storage eco-system. One of the early decisions to make is when data should be optimized. Should it be optimized in real-time as it's being accessed, or should it be optimized after it has become less-frequently accessed?

There are a few solutions that provide real-time data compression and are positioned inline between the storage and its access point. In most cases these systems do not negatively impact performance. Standard, non-content-aware compression is a relatively efficient algorithm that does not impact performance. Also, the hard work of compression is frequently off-loaded to a stand-alone appliance. The data coming to and from the storage appliance is already reduced, which reduces the load on the storage system.

There are even a few real-time deduplication solutions where data is compared to other data as it's being stored. While there is some performance impact in these systems, depending on the workload, it may not matter. Real-time deduplication for primary storage has not yet reached broad acceptance and should be applied carefully. In either case the storage manager has to be prepared to address concerns on how real-time optimization will affect storage performance.

The more common method of implementing data optimization is to optimize the data after it‘s become stagnant for a period of time. Even if that timeframe is only a few days of inactivity, the chances of that data being accessed again is typically low.

Optimizing data as part of a background process allows very active data to stay in its native form and removes the dispute over data optimization affecting storage performance on very active files or databases. During maintenance times the unoptimized data on the file systems can be examined to see if it now qualifies to be optimized. If it does, it can then be compressed and/or deduplicated. If not (meaning it's still within the active range), it can continue to be stored in its native form.

Some systems have the ability to differentiate between accessed data and modified data. They can then deliver data, meaning it can be read in its optimized form. In most cases data reduction has very little performance impact when data is simply being accessed or read, the heavier workload being when it first needs to be optimized. Optimization, when done as a secondary process and not performed on all data, does optimize most of the data and removes most of the concerns about performance impact.

Data aware

Some data reduction systems, especially those that handle data reduction as a secondary process, also can be instructed to take additional time to understand the type of data they're optimizing. As mentioned earlier, compression in particular can be fine tuned to the data type. Special compression algorithms are available for a variety of data types that don't respond well to standard compression engines. Good examples include audio, video and image files.

Images are particularly difficult data types to reduce and their storage demands now impact more than just photo sharing web sites. Most organizations store images of documents, photos of employees, job sites, etc. Given the luxury of extra time, some optimization solutions can even go a step further on visual data types and actually reduce the size of the image file. This is typically called a "lossy" data reduction technique, because some of the image quality is lost. As is the case when reducing the resolution of a photo, the lower the resolution, the less space on disk it will consume. While this sounds less than desirable, some of these systems also have the ability to make an image visually lossless, meaning that to the naked eye the image looks the same as it did before compression. As image repositories grow for all types of businesses, these methods will become increasingly important.

The archive alternative

Any discussion about data reduction would not be complete without covering the subject of data management. The downside to reducing the footprint of primary storage is that while there are the same physical components to manage, the amount of data on a typical system continues to increase. A case could be made that optimization makes the situation worse because the problem is no longer "seen" by management in the form of more physical capacity or equipment. Additionally, the gains enjoyed by primary storage data reduction often ends at the primary storage tier. When moving that data to other tiers of storage or to the data protection process, it often needs to be "re-inflated" to its original size, and then "re-optimized" when it gets to the secondary storage location. While data reduction vendors are moving to address this problem, it's still present today.

The answer is to use data reduction not as the "be all and end all," but as one component in an overall plan, a plan which should include a data archive. The goal of archiving is to move that data out of the primary storage path, which also removes it from the data protection process, but still keeps it readily accessible when needed.

Archive systems have similar data reduction technologies as primary storage, but they also add the ability to leverage denser and less expensive drives as well as potentially powering those drives down. The end result is to delay the purchase of additional storage capacity even further than would be possible with data reduction alone.

Data reduction for primary storage delivers a hard ROI. After implementation there is, in most cases, at least 50% additional storage capacity. As long as there was an original intention of buying additional storage anyway, these solutions should pay for themselves very quickly. This is an excellent way to kick off a more far-reaching data management strategy.

George Crump is president and founder of Storage Switzerland, an analyst firm focused on storage, virtualization and cloud computing.

Related article:
How to combat storage entropy

This article was originally published on April 28, 2010