A number of end-user surveys reveal IT managers’ biggest problems with data protection, as well as solutions that small and medium-sized businesses should consider.
By Farid Neema
The explosion of corporate data in distributed environments and the advent of tighter regulations are putting tremendous pressure on corporations to protect and access their data. Typically, IT departments have tried to protect data by using high-availability devices with redundant systems, backing up data regularly to tape, and using data replication techniques. Increasingly, however, more-sophisticated methods of ensuring the integrity and availability of important corporate data are being used.
Data protection is a multi-step workflow of interconnected processes that extend far beyond simple on-site backup to encompass continuous backup and fast restoration, on-site and off-site storage, archiving, and disaster protection and recovery. If your data-protection solution does not address the complete protection lifecycle, your company risks the unacceptable exposure of only partial protection, which could result in the loss of irreplaceable data and costly downtime.
As shown in the figure, protecting data is ranked highest among IT managers’ storage challenges, according to a survey conducted by Peripheral Concepts and Coughlin Associates. Among the various data-protection concerns, IT managers rank recovery problems at the top. Asked about the most significant data-protection related problems for which they are ready to spend money on, the three that top the list, as shown in the figure on p. 26, are improving time to restore backed up data, faster recovery from system failure, and easier recovery from disasters.
Current data-protection and recovery tools available to SMBs often offer only piecemeal solutions that are not only expensive to deploy, but also in many cases fail to live up to today’s business requirements. As a result, new backup-and-recovery management technologies are coming to market.
This article provides an overview of current practices and challenges in data protection, exposes some of the drawbacks to traditional solutions, covers newer backup-and-disaster-recovery techniques, and suggests criteria for selecting solutions to meet your requirements.
Backup is essentially copying live data so that it can be restored in the event of a crash or a failure causing loss or corruption of the primary data that usually resides on disk. Backup is a fundamental component of business continuance.
All backup techniques involve creating a copy of the data to be protected. The fundamental requirements for backup are that the process ensures the integrity of the data, allows rapid and simple recovery, and causes minimal disruption to system processing. But there are other attributes that differentiate backup systems, including the following:
- Protection of important files-continuously and in real-time;
- Minimal backup windows;
- Point-in-time-based recovery with the ability to roll back to arbitrary versions;
- Options to set high- and low-priority files;
- Multiple backup/replication targets;
- Protection of file servers and transient connected endpoints; and
- Retention of data files for predefined lengths of time.
The box, above, indicates the attributes of the “perfect backup.”
The basic parameters that define the completeness of a backup process are
- The backup time window;
- The recovery point objective (RPO), which determines the periodicity of backup; and
- The restore time, or recovery time objective (RTO).
The backup window defines the length of time the data is inaccessible due to the backup process. In the past, backups were typically performed at night or on the weekend. While this is still the practice for the majority of an IT organization’s data, for the most critical applications a backup window of any length is no longer an option. End-user surveys show that the backup window is still a primary concern and needs to be reduced at more than half of the sites.
Two methods to minimize the backup window of traditional backup are to increase backup speed and to minimize the amount of data to back up. Recent, more-efficient techniques enable transparent generation of a snapshot or point-in-time copy of the data that can be either maintained as a separate mirror or used as the source of a point-in-time backup to tape, or both.
To minimize the amount of data that needs to be backed up, systems allow incremental or differential backups. An incremental backup takes an image backup at set times of the disk sectors that have changed since the last backup, which can be either a full or another incremental image. In the event of a crash, the user restores the original image and overlays the image with the disk sector changes copied as increments. This is more efficient in data storage usage and backup time window, but is much less efficient in restoration time.
A differential backup takes an image of the disk sectors that have been modified since the last full disk image. This simplifies the restoration process since restoring deals with only one differential backup and one full backup, but backups might involve more data. Incremental images provide greater granularity, since they can go back to a more precise point in time than a differential image, which takes you back to the most recent full image.
In the past, data protection meant tape backups (and in many cases it still does). Some online protection could be obtained by using RAID to keep data intact and available in the event of a hard drive failure. Most system administrators relied on copying file-based data to tape and then moving some of those tapes off-site. Terabyte-size databases required hundreds or even thousands of tapes to keep track of. This is still the most common form of data protection in large enterprises, but is only part of a larger suite of techniques available for safeguarding data.
Keeping track of the location of all these tapes and recycling expired media is a complex administrative task. Other management capabilities include monitoring and reporting successful and failed backups, equipment status, media availability, performance, and resource utilization. A typical large enterprise will back up hundreds of items every day. Administrators need to ensure all relevant data is being backed up. Centralized backup management offers a number of tools that facilitate the administrative tasks, but the cost of traditional file-based backup and recovery remains one of the highest costs of managing storage.
Recent developments in backup-and-recovery processes include disk-to-disk backup and snapshots.
Shift from tape to disk
Disk-to-tape backup/recovery has been the dominant method of data protection for a number of reasons, including low-cost media and ease of moving tapes off-site for disaster protection and archiving. However, tape presents two major problems: It is not reliable, and because it uses sequential access, it cannot provide fast restores. In our surveys, about 45% of the respondents experience unsuccessful backup rates of 10% or more at first attempt, which implies an equal percentage of unsuccessful restores. This can represent a daily problem for some organizations.
With higher-capacity disk drive systems decreasing in price and increasing in reliability, the lure of faster recovery times has led to a shift among IT managers. The number of sites using disks for at least part of their backup operations doubled last year. For the first time, more than half of the surveyed managers are considering the possibility of eventually moving to a tape-less IT operation. This aspiration for eliminating tapes is even more pronounced in very large IT operations. By the end of this year, there will be more data backed up on disk than on tape.
Factors driving the shift to disk-based backup/recovery include
- Faster data recovery;
- Ease-of-use and a rapid ROI;
- Better compliance; and
- The development of a number of disk-based tools that greatly simplify data management, including snapshots, continuous data protection (CDP), storage tiering, hierarchical storage management (HSM), information lifecycle management (ILM), and virtualization techniques.
Snapshots and PIT backups
A point-in-time copy actually creates a separate, physical copy of the disk, while a snapshot is a logical representation that gives the appearance of creating another copy. Point-in-time copy is a simple solution, but one that requires a lot of extra disk capacity. Standard practices require that several point-in-time copies be maintained throughout the day, enabling operations to retrieve an uncorrupted version of the database very close to the point at which the problem occurs.
Click here to enlarge image
Snapshot-capable controllers configure a new volume, but point to the same location as the original. No data is moved, and no additional capacity is required, and the copy is “created” within seconds. Additional capacity is required when the volume is updated. Before updates are allowed, the snapshot saves the old data blocks retaining its original content. Because backups using snapshots are quick and less resource-intensive to create than mirrors, it is possible to make frequent backups and therefore ensure quicker, full restores.
At the extreme end of the spectrum is continuous backup, or CDP, where data is backed up in real-time whenever any change is made. Continuous backup establishes a journal in which changes to a set of data are recorded with time stamps. The current version of the data can be processed back to any instant in time. In this way, the effect of a logical error can be undone. If the current data is not available because of a physical error, a full backup can be processed with updates from the continuous backup journal, creating a new up-to-date version of the data.
Although most companies have implemented solutions for protecting-to a degree-enterprise data, data residing in workstations and laptops is notoriously under-protected, even though it amounts to a large percentage of all corporate data. Some estimates put it at 60% of all corporate data. The reason is that processes have been time-consuming and complex for an average user, and the cost of centralized management prohibitive. However, recent techniques have gone a long way to making backup and disaster protection of workstations and laptops more affordable.
There are many reasons why a corporation might lose important data. Broadly, they can be broken into the following categories:
- Natural disasters: Floods, earthquakes, hurricanes, and terrorists;
- Security breaches: When an intruder breaches the network, server, or storage defenses of a company;
- Accidental data loss: Users delete, overwrite, or misplace critical files or e-mails, a backup tape is overwritten, power is lost; and
- System failure: A hard drive crash, software bugs, failed software updates and installations, data corruption, viruses, power outages.
There are several reasons for spending money, time, and effort on disaster protection. The single, most important reason is fear of financial loss, not only from lost sales, but also from a potential lawsuit due to your inability to access data required by a court order or government agency. Another driver is the recent wave of regulations that define what information must be retained, for how long, under what conditions, and demand that privacy of the information be ensured. A third driver is loss of productivity, as employees are idle or able to work only in a reduced capacity waiting for systems to be restored after a failure.
Cost of downtime
Downtime costs vary from industry to industry, based on dependency upon technology and typical labor costs. Half of the surveyed sites estimate their business to be at great risk within the first hours of interruption, as shown in the figure, above. Companies in the e-commerce and financial industries accrue an average of nearly $6 million in losses for each hour of downtime. SMBs lose $18,000 per hour of downtime, as shown in the table, above.
RTO and RPO
Two metrics characterize disaster-recovery plans: RTO and RPO. RTO is the time required to recover data after downtime; RPO is the maximum time-window of data loss the business can afford. RTO and RPO help determine what kind of data protection and recovery technology you need.
The Cost of Downtime
Click here to enlarge image
Most surveyed sites measure RPO and RTO for critical applications in hours, although some measure it in minutes. The latter obviously need mirroring and data-protection technologies that work in real-time. If you measure RPO and RTO in days, you can just do a single backup overnight. It is not cost-effective for every application to have guaranteed “five-nines” uptime.
Click here to enlarge image
The figure on p. 27 shows the relationship among recovery speed, cost, and applications. IT operations are moving toward having multiple levels of data protection. The survey population aiming at RTO and RPO of less than 20 minutes has doubled since our 2004 surveys.
Traditional disaster protection
Disaster-protection strategies are designed to protect businesses against the catastrophic destruction of their computing facilities. The usual approach is to create duplicate facilities at remote sites. Sometimes the primary systems are duplicated; in other cases, smaller systems may be set up to execute only the most critical business applications.
Backing up data at a remote site provides an increased level of protection. Volume replication often is used to perform mirroring across separate hosts. The key to recovering from disasters is the ability to access, or recover, the backup copies of crucial data and to restore the data to backup systems that will take over the operation of critical applications.
When executing a disaster-recovery plan, data is restored to a backup system. This new system could have a different configuration than the original one, so the recovery process is more complex than a simple data restoration to the original system. Also, because the backup system typically has a different configuration than the original system, elaborate coordination with applications is necessary. Some solutions incorporate a resynchronize function as part of the replication services to help IT administrators transfer data back to the primary site for normal operations.
To handle a diverse set of failures, IT systems must be able to access and recover something as small as a single e-mail or file to something as large as the data center for an entire site. The systems must be able to recover data to a consistent state when the data is linked to active applications that are continually accessed by users. It might also be important to have the ability to recover data to any point-in-time in the past. Ideally, the ability to restore an image to dissimilar hardware should be a function of the disaster-recovery software to minimize the amount of time required to reconfigure the backup servers.
Click here to enlarge image
Most discussions about disaster recovery end at the notion of having systems available during a disaster. However, the real challenge is actually the process of restoring all the systems at the primary site, or new primary site, back to normal afterwards.
For the majority of sites, tape is still the primary means of disaster recovery. Data retrieval from disaster has remained largely unchanged, although advancements have occurred in the tape storage industry. To recover data, the correct tape cartridge has to be located, the correct section of the tape located, and the data retrieved, uploaded, and reintegrated. Most large organizations can only make full copies of their most vital data once a week. In the time between, they typically make copies of the changes to data. However, having to reintegrate that incremental data is extremely time consuming, difficult, and subject to data loss.
Surveys show that most large enterprises have a disaster-recovery process in place-although whether it has actually been tested, or how often it is updated or reviewed, is another question entirely. A disaster recovery plan that has been written but never tested is essentially worthless. Midrange, and especially smaller companies, are less likely to have a written and tested disaster-protection process.
Disaster protection for SMBs
Historically, protecting against a disaster was seen as a huge task facing IT managers. Consequently, only large corporations could afford it. However, with new technologies the implementation costs can be significantly lower and take much less effort and time.
Some key components for cost effective disaster protection include virtualization, bare-metal recovery, hardware duplication on standard equipment, and leveraging lower-cost networks.
The opportunity to really change how disaster recovery is done has emerged in large part as a result of developments in storage virtualization, particularly storage virtualization products that rely on commodity hardware and operating systems. These virtualization solutions leverage existing networks and the ability to mirror to/from any storage device attached to the network.
The benefits of using virtual machines in a data center are numerous. Instead of being confined to one operating system on each physical computer, companies can leverage virtual server technology to deploy multiple environments on the same server. Companies can use virtual servers to eliminate the costs of managing and upgrading legacy hardware by migrating older applications onto virtual machines running on new, reliable hardware. They can also consolidate low-use departmental servers onto a single physical server to decrease management complexity.
Test and development groups have long used virtual machines to simplify the creation of realistic test environments, but the introduction of a number of recently announced products has made broader utilization of virtualization practical for such application as server consolidation and legacy-application support.
To meet disaster-recovery requirements for production systems, administrators must design and implement protection strategies that afford these virtual machines the same safeguards as traditional servers. Simply backing up a host server can be insufficient to ensure data within virtual machines is recoverable. Newer products provide a comprehensive, reliable, data-recovery solution that backs up both a host server and all individual virtual machines on that server.
Virtualization provides major cost and management benefits for corporate data centers. With advances in hardware speed (such as 64-bit processors), multi-processor servers, and the accompanying increase in CPU power, servers will increasingly be capable of supporting larger numbers of virtual machines.
Bare-metal backup and recovery allows restoration of critical data in a matter of minutes by automating processes that would otherwise require manual reconfiguration of hard disks and installation of operating systems. By using automation, procedures are more likely to be predictable and simple. Users will not require as much training, and therefore data recovery will be more reliable and less time- consuming.
Dissimilar hardware restoration
To protect against hardware failure and still allow for automated system recovery, many organizations purchase duplicate hardware for the most critical computers. However, maintaining duplicate hardware for an entire site is so cost- prohibitive that only the most critical IT operations can justify it. Besides, it is even difficult to guarantee that the same model can be available some time after the initial purchase, which implies that purchases must be done at the time of the original system acquisition.
The ability to save and restore servers on any system provides flexibility and significant savings. It enables you to recover a single-processor computer to a multi-processor computer, or recover from expensive Fibre Channel or SCSI drives to ATA or Serial ATA (SATA) storage devices.
Leveraging IP networks
Remote copy allows a second storage system to act as a hot backup or to be placed out of harm’s way and be available for the disaster-recovery site to use. Remote-copy systems used to be expensive. The telecommunications needed to support them pre- sent IT managers with a high recurring expense rated high among the impediments to acquiring efficient data protection. The costs involved with remote copy have tended to relegate its use to high-end applications and very large companies.
Today, however, storage can be distributed on the Internet or an Ethernet network, with many clients or servers able to access many storage units. These remote copy, disk-based backup, and distributed data stores are much easier to implement and manage.
Click here to enlarge image
As was shown in the figure on p. 26, achieving quick recovery is by far the most important data-protection challenge facing the IT manager. Other problems center on reliability, backup window, and cost. Most of these concerns are also among the major impediments to acquiring the data protection that IT managers need, as shown in the figure, below. Other factors include additional management complexity and staff shortages. The best way to measure the value of a data-protection product is to assess its features in the context of these characteristics.
The issue for users is not backup, but how quickly they can recover and be up and running. Fast recovery from disaster is given the highest rating among the selection criteria that guide the choice of IT managers for a business continuance solution, as shown in the figure, right. Recovery-related questions include the following:
- Is disk-based backup and point-in-time recovery available?
- Can the system store critical files locally on disk?
- Can critical files be quickly recovered and work resumed while the rest of the system is restored?
- Does the system allow differential backups?
- Can backup and recovery be initiated from a remote location?
- Can the recovery be executed onto dissimilar hardware?
Tape-based backup processes are not reliable. Reliability problems are primarily attributed to manual intervention, media, robotics libraries, and software. The management of a large number of tapes leads to confusion. What needs to be checked are the following:
- Is disk involved, and is the level of redundancy adequate?
- Is the backup system automated?
- Are cross-platform backups consolidated for consistent procedure and policy?
- Is any data management software included, such as SRM, HSM, or ILM?
- Can the system consistently back up virtual servers and virtual machines on a server?
The backup window is raised as a problem by more than half of our surveyed IT managers. Full backups can last days and need to be scheduled during weekends. Is snapshot or CDP technology available? Does the system enable a reduction of the number of full backups? Can open files be backed up? Can the system avoid duplication of redundant data? Does it compress or compact data? Can backups be selective?
The real cost of storage is not in the hardware and software, but primarily in the labor involved in managing storage and in the productivity loss. Therefore, the total cost of ownership (TCO) needs to be taken into account, including productivity gains due to increased performance, simplified management, better utilization of resources, and increased data availability.
Click here to enlarge image
Questions to be addressed include the following:
- What are the TCO and ROI?
- Does the system include a suite of offerings, such as virtualization, to maximize storage utilization?
- Does the system cover all aspects of data protection compatible with your RTO and RPO goals (e.g., backup, recovery, replication, archiving) or can it integrate with other vendors’ products?
- Is the system capable of dissimilar hardware restoration using off-the-shelf equipment?
- Can the basic system cost-efficiently expand to cover your desktop, departmental, and enterprise needs?
Managing storage is one of the most important costs in administering a networked IT operation. The first important criterion in manageability is the ability to manage all storage components and servers from a centralized point or from any point distributed on the network. Management can automate the creation and distribution of critical information, consolidate management of geographically distributed heterogeneous storage solutions, and measure and communicate outcomes. Also, consolidation of data services makes it easier for IT administrators to manage their data than to operate via number of piece-meal solutions.
- Can open files be backed up and replicated transparently?
- Can the system create real-time automated backup-and-recovery points?
- Does the system provide analytical reporting with statistical data?
- Is bare-metal recovery available, and how long does the recovery process take?
- Is the system fast enough, and can it support next-generation hardware and software?
Backup and recovery is a time-consuming, tedious chore. Backup ties up hours of an IT professional’s time, and still more time in the morning reviewing logs and troubleshooting. Recovery is yet much more time-consuming, as it is not predictable and causes the most disruption. Your data-protection solution should
- Take minimal time to install;
- Have centralized management;
- Offer a single system interface;
- Allow automated backup and client file restoration; and
- Allow unattended remote server restoration.
Traditional data-protection solutions have long been missing the necessary tools to help IT professionals manage backup and recover data and match results to SLA and compliance requirements.
Next-generation data protection has the ability to enable SMBs to implement backup/recovery and disaster-protection solutions required in today’s business environment. This level of protection was previously available only to those who could afford to spend top dollars on proprietary storage solutions. Now, cost efficient data-protection systems offer business continuity through advanced network-efficient replication with integrated snapshots and bare-metal disaster recovery. These systems enable total protection based on quick recovery by making the entire process
Faster: Snapshot, point-in-time restoration and bare-metal disaster recovery;
Simpler: User-friendly, Web-enabled, policy-based management;
Efficient: Minimum impact on LAN and WAN traffic;
Proactive: With user’s ability to restore files;
Reliable: Fault-tolerant disk configurations; and
Inexpensive: Low initial investment and low TCO.
Farid Neema is president of Peripheral Concepts Inc. The references and figures in this article are from a series of virtualization and data-protection reports based on end-user surveys, published by Peripheral Concepts (www.peri concepts.com) and Coughlin Associates.