Storage administrators can optimize backup assets such as tape libraries, servers, and media to deliver substantial savings in both cost and manageability.
By Stephen Foskett
By paying attention to key metrics such as the ratio of data on tape to data on disk, administrators can quickly identify inefficiencies in backup. Often, a simple change in image retention or backup schedules can deliver tremendous savings over time and can allow hardware purchases to be deferred.
This article presents two key metrics to monitor and gives recommendations for addressing overflowing backup systems. It also illustrates these points with a case study in which a financial institution was able to defer library and media purchases and dramatically reduce off-site tape storage requirements.
During the roaring Internet economy, many IT departments were given carte blanche to purchase whatever hardware was needed to keep infrastructure problems out of the way of business demands.
Today, this attitude is gone: IT managers increasingly must make do with existing equipment, and optimization is demanded. While this crunch affects all areas of IT infrastructure, the ugly challenge of data backup and recovery often gets the least attention.
The irony of backup's second-class status is that it is one area where efficiency can often be improved quite easily, reaping immediate cost savings. Many IT departments struggle to feed new tapes into the backup machine, only to send them off-site to be forgotten. Backups fail often enough, but even when they complete successfully, they may be failing to protect data appropriately.
Most backup administrators try to cover for backup lapses by over-protecting their environments. Daily full backups, duplicate tape copies, and making table space, export, and whole-system backups may appear to be a prudent approach to data protection; however, this just leads to escalating costs and oversized hardware requirements. A truly conservative approach balances data protection with cost, uses data classification to dictate protection, and avoids wasteful complexity.
Three demands of data recovery
At a fundamental level, data on disk must be protected appropriately to meet expectations in the areas of file restore, system recovery, and content archiving (see sidebar, "The three-headed monster of data protection," above). File recovery generally has just short-term requirements, and these can usually be met with automated tape rotation using a robotic library. This pool of tapes can be a closed system, requiring new media purchases only when data is added or tapes fail. Automated tape management requires a library sized to hold a set of tapes, normally the entire pool. Relatively new options for file recovery include snapshot technology and disk-to-disk backup.
System image retention and data archives demand off-site storage, and their open-ended nature often requires additional media purchases. A key problem arises when an attempt is made to leverage the rotating short-term pool for longer-term uses. While it would seem efficient to simply ship the rotating pool of data off-site, doing this introduces more problems than it solves.
System recovery in the event of a disaster requires a complete system image, including operating system files, applications, and data. While file-recovery requirements generally call for just a subset of data files to be protected, adding the requirements of system recovery to the short-term pool can easily double the volume of data to be backed up. Since disaster recovery also calls for off-site storage of images, while timely file recovery demands online access, the location of backup tapes also becomes a critical issue.
Archiving makes the situation even worse. Marking a set of system backup tapes for permanent storage and sending them off-site just creates an ever-growing pile of media. The likelihood of actually recovering useful data from a five-year-old tape set is slim; decade-old tapes are virtually useless. This need not be because the media itself has degenerated. It can also be because the format of data on it is no longer relevant.
If each of these three tasks (file, system, and archive protection) is approached separately, significant improvements in resource utilization can be realized. Data classification can allow a subset of data on disk to be backed up daily for file recovery, a separate save group (or even application) can back up whole systems for disaster recovery, and a special periodic backup of special data can be sent off-site as an archive. This approach minimizes the amount of data in a library, the number of tapes purchased, and the number sent for off-site storage, keeping expenses down. It also improves the effectiveness of the backup process, ensuring proper recovery.
Backup software makes it easy to tell if waste is a problem, because it catalogs the files on connected clients and sums up the space used by them. In this way, a backup system can be used as a basic storage resource management (SRM) application.
Data on tape vs. data on disk
The total data on disk to be protected can be determined from the backup application. This can then be compared to the amount of data stored on tape to determine (at a very basic level) the ratio of data stored on tape to that on disk. This is a very useful metric since it shows the overall efficiency of the chosen retention policies at a glance.
The actual value of this metric varies, but the conclusion drawn from it should be obvious. Whether the multiple of tape to disk data is 4:1 or 40:1, it will be the foundation of an assessment of the efficiency of the overall backup process. As a rule of thumb, traditional backup systems and policies will create between five and 15 copies of data on tape for each copy on disk, although exceptions exist.
For example, a site that has 500GB of data to be backed up, plus another 500GB of application and operating system files, has a total of 1TB of data to back up. Applying the logic from above, whole-system images could be taken once a week and sent off-site, with retention of two weeks, yielding 2TB of system backup tapes. A weekly full/incremental schedule for the data could be implemented with another 600GB per week, or 2.4TB for a four-week retention scheme. A small amount would be archived and sent off-site permanentlyprobably less than 100GB per quarter. This example yields about 4.5TB of data on tape, or a ratio of 4.5:1.
Clearly this is an example of a highly tuned environment, but it would also be highly effective at protecting data. If an assessment of the actual number of media in this environment indexed by the backup software revealed a ratio of 20:1, then an obvious area for improvement is revealed. Perhaps it is time to revise retention periods, reclassify data, and use some other mechanism to provide for whole-system disaster recovery.
Another smoking gun is the ratio of the number of tapes kept off-site to the number of tapes calculated above. If our theoretical environment needed 4.5TB of tape storage to meet their data-recovery requirements, yet had 400 50GB DLT tapes stored off-site, a disconnect is immediately apparent.
Unlike media costs, which are a one-time expense, off-site storage of tapes requires a continual flow of money. Storage companies usually charge by the visit, box, and month for tape storage, and these costs can add up. Rotating all tapes off-site, or retaining a huge volume of tapes for archival purposes, also requires excessive media purchases.
This equation helps calculate how much data should be stored off-site. Dividing the total by the average media size yields the approximate number of tapes.
This is compounded by poor tracking of off-site media. Some companies have lost track of which tapes are off-site and just pay to retain them in bulk. If a tape is unknown, what value does it bring to the business? In this situation, all unknown tapes should be recalled immediately and re-cataloged. This project will probably reap immediate financial dividends since many of those useless tapes can be used for backups, deferring further media purchases.
Reducing the volume of data to be backed up brings many immediate benefits:
- Often, existing hardware and media can be made to handle future growth without new purchases;
- In-library retention periods can be extended, allowing restore requests to be granted more quickly and with less effort; and
- Less data to be backed up means less data traveling through the system and shorter nightly backup windows.
It is also critical to watch for unexpected spikes in media costs and library space due to policy changes.
Simply deciding to archive data monthly instead of yearly could cost thousands of dollars per year for new media and off-site storage.
Recently, a large financial institution asked us to assess their backup environment. They had run out of tape slots in their library and were looking for justification to buy a new one. A look at their backup configuration quickly revealed a different story entirely.
The company used less than 4TB of storage but had 51TB of data under management by Tivoli Storage Manager (TSM), a ratio of 13:1. Additionally, the tapes required by this backup volume filled 91% of their library. Clearly, there was a problem with data-retention policies. By adjusting the lists of files to be backed up and the retention periods for data, the company could reduce its backup volume by more than 50%. This meant that the new library was not needed after all.
An analysis of the company's archiving practices produced further savings. The company had been sending out quarterly full copies of its entire backup pool, requiring the purchase of more than 1,500 tapes per year. By changing to an incremental approach for archiving, the company could reduce this to just 80 tapes per year, after an initial full backup. At $75 per tape, this produced an immediate savings of more than $100,000.
Changing the content of the archives can also produce substantial savings. Rather than sending off all system data as is, archives should be focused on important data in a useful format. A few SQL exports would be much smaller and more useful than raw backups of the entire data warehouse.
Another side effect of our analysis was in the area of storage utilization. Backup system logs can be a valuable way to assess the amount of usable storage that contains actual data. In this case, storage utilization averaged 37% across the environment, which is a bit higher than normal. However, some individual systems were using less than 10% of their storage, indicating other potential areas of savings.
Although often ignored, improvements in backup systems can yield substantial cost savings with existing equipment. Often, overwhelming backup volume rather than undersized backup hardware is the real problem. Once an environment is assessed for the potential to trim backup volume, library contents, and off-site media storage, significant savings can be realized in deferred purchases of hardware and media, as well as reduced overall storage costs.
Stephen Foskett is a senior consultant at GlassHouse Technologies (www.glasshousetech.com) in Framingham, MA.