Q: My backups are slow and unreliable. I've heard that putting in disk, in the form of a VTL, will solve my problems. Are there any gotchas to watch out for?
By Noemi Greyzdorf
—A virtual tape library (VTL) is one way that disk can be integrated into a backup system. Since backing up to disk has become a more-viable solution in the past few years, with the cost of disk arrays coming down rapidly (thanks in part to Serial ATA), IT organizations have become convinced that implementing disk (typically as a VTL) will eliminate their backup-and-restore woes and save the day. However, this can be a dangerous assumption. Backup-and-restore systems are very complex, and bottlenecks that affect performance exist everywhere along the data path—from the client server to the storage media used as a target. Before deciding on a VTL or any other disk-based backup solution as the answer to all your prayers, you must take the following three steps:
First, define business requirements (including recovery-time objective, or RTO, and recovery-point objective, or RPO) for applications, data, and systems. This step takes the most time and is the most difficult.
Second, once the requirements have been defined, assess your environment against these requirements. This includes identifying areas that are over- or under-utilized, where performance is not optimal, and configurations and policies don't follow recommended practices by manufacturers.
Finally, allocate existing resources where they can make the greatest impact, and consider technologies that will have to be added to reach the objectives defined in the business requirements.
Although these three steps seem simple, it is important to pay close attention to each one. Without successfully completing each of them, in order, the success of the overall project may be compromised.
Define business requirements
The terms RTO and RPO are frequently used in industry publications, White Papers, and marketing collateral. Both are relatively simple to understand, so the question is: "How do you implement technology/products to meet your RTO and RPO requirements?"
Basically, RTO relates to the question, "How quickly does something need to be recovered?" The word something is used on purpose. Consider a situation in which data corruption has occurred. How quickly do you need to get that data restored? The answer depends on the type of data and its value to the business. Categorizing applications and data sets to identify what needs to be available in what time and at what level will help in evaluating the current data-protection system and designing a new one.
The term RPO relates to how much data you are willing to lose. Of course, no one is willing to lose any data, yet backups typically occur once per day, so if there were a corruption to a database the recovery point would become last night's backup. A more-relevant question is: "If data were lost, what would it cost to recover the data created since last night's backup?" Not all data sets require little-to-zero data loss. Again, it is important to categorize data sets and establish criticality.
Once data and applications have been categorized by business criticality, it is now important to define specific criteria for recovery. Too often we focus on faster backups, but rarely do we focus on restores. Why do a backup if you don't care about the restore?
Consider the following example. An organization has a mail server, a file and print server, a database server (maybe many), and a homegrown application. The requirement may be for the file and print server to restore at the file level within 24 hours. For databases, the data needs to be available within an hour. For the mail server, there is a need to restore individual messages and calendar items within six hours. Finally, for the homegrown application, data loss is unacceptable. The more details you have on these RPO requirements, the easier it will be to evaluate and design a new solution.
Current system's health
Now that the requirements have been defined, the current backup/recovery system can be assessed for performance, availability, recoverability, efficiency, ease of management, scalability, and flexibility. Some IT organizations choose to implement tools that help them gather the necessary data. The purpose of this exercise is to acquire metrics that help identify bottlenecks in the system as well as to provide a foundation for trending and forecasting. Some of the areas where bottlenecks can occur include the following:
- I/O throughput is dependent on the ability of a backup server to push out data to the backup media (disk or tape). If the server is not sized appropriately, it may become a bottleneck;
- Tape drives/libraries or disk arrays used as backup targets may become bottlenecks if the data stream is bigger than the tape drive or disk array can handle. Tape drives may also become bottlenecks if the server sending data to it can't stream it efficiently, thus causing it to "shoeshine;"
- Data travels from the server being backed up (client) to either a backup target or a backup server. The backup server may aggregate a number of slow streams into one stream to improve performance (multiplexing) or send each stream to its own target. Not having appropriate network interfaces in the client servers or backup servers may create a bottleneck. A slow network in general may also cause slowdowns, depending on load; and
- Clients (servers whose data is being backed up) can become bottlenecks in a number of ways. One, their processing power may not be adequate to handle multiple processes, thus causing everything to slow down. Two, the server may have a large number of small files, causing the file system to become the bottleneck. Three, the client servers must have appropriate network interface cards (NICs or HBAs).
In addition to identifying bottlenecks, system assessments help identify areas where resources are being underutilized, and thus can be redeployed, or resources may need to be added to make the system function as required.
Designing a data-protection system
Once requirements have been defined and you have ascertained real or potential bottlenecks, it is then time to select technologies and products and plan for implementation. After all this analysis, you might decide that a VTL is what you need. You could also discover that upgrading your network, increasing I/O throughput, or modifying backup schedules will eliminate bottlenecks.
In terms of product selection, some of the more common decision criteria include maturity of the technology, level of integration with existing system, cost, and management complexity. In the end, keep in mind that backup-and-restore systems are never static and you will need to continually assess performance and utilization and make adjustments accordingly.
For more information on VTLs, see the following articles: