The name of the game isn’t backup and recovery; it’s data protection.
Perhaps no single event has influenced the direction of the data-protection/disaster-recovery industry more than the attacks on the World Trade Center and the Pentagon last September.
“It has made people reconsider their data backup-and-recovery procedures,” says Joe Flach, vice president of Eagle Rock Alliance, a management consulting firm specializing in disaster recovery and business continuance planning in West Orange, NJ. “And it has probably accelerated the trend toward disk-based, versus tape-based, backup-a shift that predates the September 11 attacks.”
The appeal of a disk-based data-protection scheme is simple: faster backup and faster restore. “The delay in business operations due to tape-based recovery has become unacceptable,” writes Phil Goodwin in the META Group report, “Backup is dead- Long live the recovery.” Goodwin is program director of server infrastructure strategies at the Stamford, CT-based IT consulting firm.
The answer isn’t necessarily an all-tape or all-disk solution, but a combination, says Flach. The key is being able to identify risks and then being able to categorize data according to its overall importance within an organization to determine what can be backed up to disk and what requires disk’s faster restore times. (See “Disk and tape forge new partnership in backup arena,” InfoStor, November 2001, p.1.)
The META Group categorizes risks as either “generalized” (e.g., user error, system failure, and fire) or “regionalized” (e.g., earthquakes, tornadoes, and hurricanes). An effective disaster- recovery plan assesses these risks and then weighs the importance of the data or the application involved, explains Goodwin. “For example, a 30-mile separation may be sufficient to protect data centers from loss due to tornado, but it may be an insufficient distance in hurricane-prone locales.” (For more information, see the META report, “Storage portfolio: Are you ready for a recovery disaster?”, by Phil Goodwin.)
Likewise, some types of applications/data are more critical than others and therefore warrant more-sophisticated-and more-costly-data-protection/data-recovery schemes. Data in relational databases, for example, is generally deemed more critical than data on end-user PCs or laptops.
Mission-critical data warrants a different set of recovery techniques from non-critical or end-user data types, explains Goodwin. Techniques like point-in-time copy, snapshot, mirroring, and asynchronous/synchronous replication can be used to protect mission-critical and/or organizational-critical data, while on-site tape backup and off-site tape vaulting are common recovery methods for less-critical data (see table).
|Matching recovery methods with application/data categories|
|Snapshot||Mirror||On-site tape||Duplicate tape||Off-site tape vault||Asynchronous remote replication||Synchronous remote replication|
|Source: META Group|
Goodwin says that storage resource management (SRM) products like BMC Patrol SRM and Computer Associates’ BrightStor SRM can help users determine the relative importance of data within their data centers by identifying data dependencies within application environments.
Additionally, look for other technologies like clustering, journaling, virtualization, and peer-to-peer remote copy to play an increasing role in IT data-protection schemes, and for end users to hand over backup/recovery control of their data to IT administrators. According to META, by 2004, IT administrators, not end users, will be responsible for not only protecting data stored on PCs and laptops, but also for providing a recovery mechanism for that data.
The Cost Of Downtime
The hourly cost of downtime, depending on the application, can run between $80,000 and $2 million, says META’s Goodwin. An online survey by Contingency Research Planning, a division of Eagle Rock Alliance, and Contingency Planning & Management magazine, revealed similar findings early last year.
Of the 163 respondents polled in the survey, 46% said their hourly downtime costs ranged from $50,000 to $1 million, while 100% of the respondents said their company’s survival was at risk within the first 72 hours of a disaster.
According to Contingency Research Planning, the leading cause of computer downtime of 12 hours or more is power outages/surges, followed by storm damage (20%), flood and burst pipes (16%), fire and bombing (9%), and earthquakes (7%).
So, when you consider that many of the financial companies in or around the Trade Center had up to seven days to recover data and to restore critical business applications, you realize that the loss-at least from a business perspective-could have been far worse, says Flach.
“The sheer magnitude of the September 11 disasters made it easier for large organizations, including some Fortune 100 and Fortune 500 companies, to respond,” says Flach. “A more localized attack-one that had not affected so many companies and had not forced the closure of trading markets-would likely have had much more serious business consequences.”
Flach says the extra time enabled IT organizations to “mask” the inefficiencies of their disaster-recovery plans. “Hopefully, we’ve all learned something from this lesson,” he says.
According to Eagle Rock, the typical time-to-recovery objective among IT organizations is 24 hours. META expects four hours to be the “best practice” for the recovery of mission-critical applications/data by 2005.
One of the most common mistakes made by IT administrators when implementing disaster-recovery schemes is putting too much emphasis on backup and not enough on recovery. But that’s changing, as evidenced by the increasing number of recovery-specific announcements from leading backup and software providers (see “Vendors focus on recovery,” right).
“Many customers overlook recovery entirely,” says META’s Goodwin. Why? IT organizations tend to focus on the cost of implementing a disk-based recovery infrastructure versus that of implementing tape backup, he says. “They really need to compare the cost of downtime versus the cost of the disk infrastructure. It’s one thing to focus on backup, but you’re losing money when you’re down.”
Goodwin says that the expense of implementing a disk-recovery infrastructure pales next to the potential downtime costs-which can range from $80,000 to $2 million per hour-of not having such a system in place. This differential, he explains, justifies the transition to disk-based recovery. Goodwin says that tape will continue to play a key role in data archival and in off-site data-recovery sites.
But perhaps the greatest impact of the transition to disk-based recovery will be on backup and software vendors, not tape manufacturers. According to META, the trend toward disk-based recovery will give birth to data-protection suites, which will ultimately-in the 2004/2005 time frame-make current backup-and-recovery products obsolete.
Early components include various snapshot, mirroring, and remote-mirroring technologies from vendors such as Compaq, EMC, IBM, Network Appliance, and Veritas. New file systems, enhanced SAN management capabilities (including dynamic policy-driven tools), and improved storage virtualization capabilities will be added down the road.
Users still rely on tape
Despite the emerging trend toward disk-based, versus tape-based, backup, many IT organizations remain committed to tape.
For example, Baseline Financial Services, a division of Thomson Financial and a provider of products and services to about 10,000 institutional portfolio managers in 800 organizations throughout North America, still backs up 200GB to 300GB data to a single Quantum/ATL P1000 tape library each night despite having lost about two days’ worth of data in the attacks on the World Trade Center in September.
Baseline’s headquarters, as well as 175 employees, occupied the 77th and 78th floor of Tower 2. The company also has offices in Philadelphia and San Francisco.
Addison Tso, the company’s systems engineer, says that while the events of September 11 have made disaster recovery a higher priority at Baseline-and freed up more dollars to do it-he doesn’t have any plans to bring disk into the mix. In stead, he plans to stick with tape. Recognizing the importance of having some level of redundancy, Tso says he plans to purchase a second Quantum/ATL P4000 for one of the company’s other locations.
Before the attacks, Baseline maintained a single Quantum/ATL P1000 library equipped with three DLT7000 drives at its data-center operations in Manhattan. Since then, the company has upgraded to a P4000 and Super DLTtape technology.
Similarly, John Wadsworth, director of computer operations at Mohegan Sun, a gaming and entertainment facility in Uncasville, CT, backs up about 8TB of casino and operational data daily to two, seven-drive IBM LTO 3584 Ultrascalable tape libraries-one at each of Mohegan Sun’s two campuses. The two facilities are about half a mile apart. Half of the company’s NT servers are at one location, half at the other.
For redundancy, Wadsworth says several of the company’s business-critical systems have a sister system at the other location. This setup allows for fail-over in the event of a failure or disaster.
To speed up the backup process, Wadsworth used to back up some of the clients to disk first and then to tape, but he says that the LTO tape configuration has made this unnecessary.
Choosing the right backup/recovery approach
Making sense of your tape, disk-to-disk, point-in-time, and snapshot options.
By Dianne McAdam
Tape is an economical, well-understood, and pervasive backup medium, but users needing more speed may choose to back up directly to disk or to make use of point-in-time copy or snapshot tools.
Unfortunately, this costs. While disk vendors are happy to supply the extra capacity, a vanilla disk-to-disk solution will double users’ disk costs. But falling prices are making disk-based options more popular.
So, as you rethink your data-protection budgets, think about disk-to-disk, and especially the point-in-time approach. It can reduce backup-and-restore times to virtually zero.
Disk-to-disk copies reduce the time data is accessed both online and in backup mode to less than disk-to-tape operations (see table). The effect on production is insignificant; response time is affected for minutes rather than hours.
- Establish a specific time to make a copy of the database volume;
- Place the database in backup mode (or, if it has no hot backup capability, quiesce it);
- Copy the database (disk-to-disk) at the specified time;
- Take the database out of backup mode; and
- Mount the copy on a backup server and let the backups to tape begin.
That’s the current state of the art-a little downtime in exchange for a lot of data. That might make a casual observer believe the backup problem is finally resolved, right? Well, not quite.
Point-in-time solutions are now being marketed by vendors such as Compaq, EMC, Hitachi Data Systems, IBM, and XIOtech. These products can reduce backup time to almost zero-because there is no “backup.” After the initial full copy, point-in-time technologies only copy changed data, minimizing the time it takes to make that copy.
- At intervals during the day, establish and split off multiple physical point-in-time copies of the volumes;
- When a restore is required, restart the application pointing it to the last good copy of the database; and
- Bring up the database, and reapply any transactions that occurred since the last point-in-time copy was established.
- By having a database on disk that is an updated copy made at close intervals, and by moving users between the two disk copies, restore time is virtually eliminated.
Point-in-time is a simple solution, but one that requires a lot of extra disk capacity. How many point-in-time copies do you need? The real question is: How lucky do you feel? Databases get corrupted, so standard practices require that several point-in-time copies be maintained throughout the day, enabling operations to retrieve an uncorrupted version of the database very close to the point at which the problem occurs.
Snapshot technologies from vendors such as Compaq, IBM, Network Appliance, and StorageTek provide logical point-in-time copies of volumes or files with initially no additional capacity requirements. Though many use the terms “point-in-time” and “snapshot” interchangeably, “snapshot” defines a unique approach to providing point-in-time copies.
How long does it take to back up 1TB of data?
|Technology||Relative cost||Time to back up||Time to restore||Notes|
|Disk-to-tape||2 to 3 tape cartridges||1 hr||2 to 3 hrs||Traditional, expensive, slow.|
|Disk-to-disk||Additional 1TB disk capacity||20 min||20 min||Since no reformatting is required, restores take the same amount of time as backups.|
|Point-in-time||Additional 3+TB disk capacity||20 min; incrementals only take a few min.||Seconds. No restore required.||Restart applications by switching to point-in-time copy.|
|Snapshot||Initially, no additional capacity required.||Seconds||Seconds. No restore required.||Restart applications by switching to snapshot, provided all data blocks are valid.|
Both provide a second copy of the data. Point-in-time actually creates a separate physical copy of the disk, while snapshot gives the appearance of creating another copy. Snapshot-capable controllers configure a new volume but point to the same location as the original-no data is moved, no additional capacity required, and the copy is “created” within seconds. Additional capacity is required when the volume is updated. Before updates are allowed, snapshot saves the old data blocks retaining its original content.
Restarting from snapshot copies can get you past a number of different problems, such as human errors and software failures, and back to a consistent state. Unfortunately, snapshot alone cannot protect against physical storage failures. Since multiple logical snapshots can point at the same physical data blocks, if one of the blocks goes bad, multiple snapshots are thereby invalidated. Logical snapshots therefore must be paired with another technology such as mirroring to reduce the likelihood of physical data loss.
Tape backups will continue to have their place within the data center-for a second line of defense behind disk-based backups, as well as for the backup of non- mission-critical information, the retention of historical information, and the archiving of data. IT departments need to carefully review their backup strategy and determine the best method for each application.
Vendors focus on recovery
By Heidi Biggar
Vendors such BakBone Software, StorageTek, and Veritas all recently made announcements that indicate an industry-wide focus on recovery. For example, Veritas, in the first of a series of anticipated acquisitions in this space, purchased the Kernel Group, an Austin, TX-based provider of automated system-recovery software.
This type of software, generically known as bare-metal restore, automates the restore process of critical system information (e.g., user information and file-system structures) and eliminates the time-consuming and often error-ridden process of manually reconfiguring hard drives and installing operating systems.
Veritas will initially offer Kernel’s Bare Metal Restore software as a separate product offering but plans to integrate it into its software management family. Veritas claims that this software will let users do full- system restores in as few as 15 minutes. The software supports Windows, Solaris, HP-UX, and IBM AIX platforms and is disk/tape agnostic.
Meanwhile, later this quarter, BakBone says it will expand its VaultDR bare-metal restore module with support for Unix platforms. The capability will be available as an option to NetVault 6.7 and is priced at $495 per client. BakBone currently offers this capability for Intel-based operating systems.
And StorageTek recently threw its hat into the business-continuance/disaster-recovery market with the unveiling of its Lifeline brand of hardware products (tape and disk) and services.