Q: My backups are slow and unreliable. I’ve heard that putting in disk, in the form of a VTL, will solve my problems. Are there any ‘gotchas’ to watch out for?
BY NOEMI GREYZDORF
A virtual tape library (VTL) is one way that disk can be integrated into a backup system. Since backing up to disk has become a more-viable solution in the past few years, with the cost of disk arrays coming down rapidly, IT organizations have become convinced that implementing disk (typically as a VTL) will eliminate their backup-and-restore woes. But this can be a dangerous assumption. Backup-and-restore systems are very complex, and bottlenecks that affect performance exist everywhere along the data path—from the client server to the storage media used as a target. Before deciding on a VTL or any other disk-based backup solution, you must take the following three steps:
First, define business requirements (including recovery-time objective, or RTO, and recovery-point objective, or RPO) for applications, data, and systems. This step takes the most time and is the most difficult. Second, assess your environment against these requirements. This includes identifying areas that are over- or under-utilized, and where performance is not optimal. Finally, allocate existing resources where they can make the greatest impact, and consider technologies that will have to be added to reach the objectives defined in the business requirements.
Without successfully completing each of these steps, in order, the success of the overall project may be compromised.
The terms RTO and RPO are relatively simple to understand, so the question is: “How do you implement technology/products to meet your RTO and RPO requirements?” Basically, RTO relates to the question, “How quickly does something need to be recovered?” The word “something” is used on purpose. Consider a situation in which data corruption has occurred. How quickly do you need to get that data restored? The answer depends on the type of data and its value to the business. Categorizing applications and data sets to identify what needs to be available in what time and at what level will help in evaluating the current data-protection system and designing a new one.
The term RPO relates to how much data you are willing to lose. Of course, no one is willing to lose any data, yet backups typically occur once per day, so if there were a corruption to a database the recovery point would become last night’s backup. A more-relevant question is: “If data were lost, what would it cost to recover the data created since last night’s backup?” Not all data sets require little-to-zero data loss. Again, it is important to categorize data sets and establish criticality.
It is now important to define specific criteria for recovery. Too often we focus on faster backups, but rarely do we focus on restores. Why do a backup if you don’t care about the restore?
Consider the following example. An organization has a mail server, a file and print server, database servers, and a homegrown application. The requirement may be for the file and print server to restore at the file level within 24 hours. For databases, the data needs to be available within an hour. For the mail server, there is a need to restore individual messages and calendar items within six hours. For the homegrown application, data loss is unacceptable. The more details you have on these RPO requirements, the easier it will be to evaluate and design a new solution.
Now that the requirements have been defined, the current backup/recovery system can be assessed for performance, availability, recoverability, efficiency, ease of management, scalability, and flexibility. Some IT organizations choose to implement tools that help them gather the necessary data. The purpose of this is to acquire metrics that help identify bottlenecks as well as to provide a foundation for trending and forecasting. Some of the areas where bottlenecks can occur include the following:
- I/O throughput is dependent on the ability of a backup server to push out data to the backup media (disk or tape). If the server is not sized appropriately, it may become a bottleneck;
- Tape drives/libraries or disk arrays used as backup targets may become bottlenecks if the data stream is bigger than the tape drive or disk array can handle. Tape drives may also become bottlenecks if the server sending data to it can’t stream it efficiently, thus causing it to “shoeshine;”
- Data travels from the server being backed up (client) to either a backup target or a backup server. The backup server may aggregate a number of slow streams into one stream to improve performance (multiplexing) or send each stream to its own target. Not having appropriate network interfaces in the client servers or backup servers may create a bottleneck. A slow network in general may also cause slowdowns, depending on load; and
- Clients can become bottlenecks in a number of ways. Their processing power may not be adequate to handle multiple processes, thus causing everything to slow down. The server may have a large number of small files, causing the file system to become the bottleneck. The client servers must have appropriate NICs or HBAs.
In addition to identifying bottlenecks, system assessments help identify areas where resources are being underutilized, and thus can be redeployed, or resources may need to be added to make the system function as required.
Designing a data-protection system
Once requirements have been defined and you have ascertained real or potential bottlenecks, it is then time to select technologies and products and plan for implementation. After all this analysis, you might decide that a VTL is what you need. You could also discover that upgrading your network, increasing I/O throughput, or modifying backup schedules will eliminate bottlenecks.
In terms of product selection, some of the common decision criteria include maturity of the technology, level of integration with existing system, cost, and management complexity. Keep in mind that backup-and-restore systems are never static and you need to continually assess performance and utilization andadjust accordingly.
For more on disk-based backup/restore trends and VTLs, see the Special Report, p. 26.
Noemi Greyzdorf is a senior solutions consultant at Cambridge Computer (www.cam bridgecomputer.com). She can be contacted at firstname.lastname@example.org.