Rapid storage growth has left many data centers grappling with multiple pools of “virtual” storage volumes from multiple vendors. Through block aggregation across multi-vendor storage pools, virtualization appliances can provide storage administrators with high-level virtualization to simplify the complexities of a heterogeneous environment. Alternatively, system administrators can use the construct of a dynamic disk within Veritas Storage Foundation for Windows (SFW) to leverage block aggregation over multiple vendors’ arrays, just like a virtualization appliance.
Moreover, Veritas SFW provides IT with a number of advantages compared to virtualization appliances. SFW is a pure software solution intended for use by a system administrator, without involving storage or database administrators. Through the construct of dynamic disks, SFW enables system administrators to improve storage resource utilization by combining the capacity of many disk arrays into a single, vendor-agnostic resource—a dynamic disk group that they can easily manage from a central interface.
Under SFW, disks that are upgraded from basic to dynamic contain a database located in a special partition at the end of a disk. That database includes a four-level description schema that contains information about the volume, component, partition, and disk. The database is then replicated over each member of a dynamic disk group. The database makes it possible to virtualize storage resources through block aggregation across multi-vendor storage pools.
The dynamic disk construct provides IT with a non-disruptive mechanism to isolate and manage logical devices completely free of any physical constraints. Beyond the application of dynamic disks to storage virtualization spanning arrays from multiple vendors, administrators can use Veritas SFW to ensure business continuity and simplify risk management.
When openBench Labs began testing the advanced business continuity functions of SFW, we immediately discovered that its initial scheme to define dynamic disk groups by storage tiers based on performance characteristics was overly restrictive. To simplify the configuration and management of critical features, such as volume snapshots and disk mirrors, SFW leverages the dynamic disk group construct. In order to follow best practices for storage and utilize less-expensive, lower-performing storage resources for secondary copies of data, we needed to take a broader and more application-centric approach to grouping dynamic disks.
In particular, we needed a dynamic disk group for use with Exchange that contained multiple performance-based tiers of storage, including iSCSI. The Veritas Enterprise Administrator (VEA) console provides an easy way for disks to be added to or removed from a dynamic disk group. Via the VEA console, openBench Labs easily restructured its initial SFW-managed dynamic disks groups to have a broader scope.
Along these lines, small to medium-sized sites that primarily need the advanced functionality of Veritas SFW to manage resources for SQL Server and Exchange will likely find that placing all disks, excluding the OS and boot disks, in a single secondary dynamic group is a simple and effective model. For large enterprises that need to corral a plethora of spindles, a good starting point is to investigate creating dynamic groups that contain all the necessary storage resources to satisfy each class of Service Level Agreement.
Just as global operations, governmental regulations on records retention, and the new focus on e-discovery in civil litigation have changed the nature of backup, so too have these forces altered the fundamental constructs of recovery. The old notion of data recovery from the previous day’s backup is in no way sufficient to satisfy regulatory requirements. For that reason, IT must now plan for the recovery of applications along two dimensions: the time that can lapse before the application is back online and processing transactions, which is dubbed the recovery time objective (RTO), and the amount of data that can be lost in the recovery process, which is dubbed the recovery point objective (RPO).
For every enterprise, the factors of time and data loss create a complex and unique dynamic. For IT, the direct costs associated with the recovery process go up as the RTO is diminished. IT costs also rise, as the acceptable amount of data loss for the RPO grows smaller. The opposite is true, however, when indirect costs associated with lost business are factored against the length of a processing outage.
All of those costs, which are unique for every enterprise, need to be weighed and balanced. While there is no question that data is valuable and downtime is costly, the cost of a disaster-recovery solution for an enterprise must be proportional to the business value that particular enterprise derives from specific IT applications. Without a detailed business risk analysis, it is entirely possible to spend more money to recover from a data-center outage than would be lost by the enterprise over the recovery time period.
This means that IT needs to analyze all disaster-recovery options along with their costs. The RTO spectrum of recovery options ranges from simple vaulting of daily backup tapes to electronic vaulting and the use of hot sites. RPO options involve the use of log files and culminate with the implementation of two-site, two-phase commits for the highest level of data integrity.
With the new focus on e-discovery, any IT organization running an e-mail application, such as Exchange, needs to take a high-level approach to RTO and RPO backup-and-recovery options. No matter what the company size, IT organizations running Exchange need to maintain uninterrupted and consistent access. This can be a daunting task for a small IT organization, as Exchange I/O involves small, random I/O transactions to multiple databases. Every Exchange Storage Group, which is used for mailboxes and public folders, contains up to five databases (.edb files), along with transaction logs and a checkpoint file.
To meet the level of availability needed for Exchange, IT must recover Exchange Storage Groups or databases significantly faster than by restoring from standard backup media. To accomplish this, the first line of defense is the creation of Storage Group level, point-in-time images of the production databases and transaction logs that remain on the server running Exchange for faster restoration. These snapshot images of Exchange data can be used for a full restoration to the point in time represented by the snapshot. Additionally, IT can use the transaction logs to roll forward from the point-in-time snapshot restoration to the point of failure.
Nearly every SAN disk array provides a proprietary snapshot mechanism for that array. Veritas SFW also provides a snapshot mechanism: Veritas FlashSnap. From the VEA console, FlashSnap can utilize any disk with available space in a dynamic disk group, which means snapshots can be configured on multiple vendors’ arrays.
Frequently, disk arrays employ copy-on-write or metadata snapshots, which save disk space and processing time by only copying changed blocks to the snapshot. As a result, the snapshot volume is only useful if the original volume remains online.
FlashSnap does not employ copy-on-write technology to create snapshots that can be moved from disk to disk and server to server at will. Nonetheless, during resynchronization of its snapshots, it copies only changed blocks to quicken the process. In particular, FlashSnap uses a split-mirror process to create an exact duplicate of the original volume (i.e., a snapshot). With these snapshots, IT can perform backups, application testing, or any other processing on another server without affecting production performance.
More importantly, Veritas FlashSnap integrates with the Microsoft Volume Shadow Copy Service (VSS) to create Microsoft supported and guaranteed persistent snapshots of all volumes associated with an Exchange Storage Group or an SQL Server database without taking either application offline. In particular, SFW calls upon VSS to notify the Exchange VSS Writer or SQL VSS Writer to momentarily quiesce the application and allow FlashSnap to take a split-mirror snapshot.
Through a collection of wizards, which can be launched from within the VEA console, all of the processes associated with creating and restoring a snapshot are fully automated. Snapshots of the storage groups can be reattached and resynchronized to match the current state of the storage group with the VSS Refresh Wizard. Moreover, the VSS Snapshot Scheduler Wizard can be invoked to create a fully automated snapshot process based on multiple time intervals.
For many administrators, however, the most important benefit of the wizard-oriented, VSS-based backup solution is that it allows for very rapid recovery of the databases and logs that are associated with a storage group belonging to Exchange. The VSS Restore Wizard can use any snapshot created with the VSS Snapshot Wizard to perform either a point-in-time recovery or a roll-forward recovery to a point of failure. The Wizard can operate on an entire storage group or, for a point-of-failure recovery, an individual database within a storage group. Using these wizards within VEA, system administrators have a point-and-click solution for recovering an entire Exchange system in far less time than tape-based backup can provide.
Nonetheless, the primary role of snapshot volumes is to provide fast recovery of applications due to data corruption of the server on which they are running. The ease with which a system administrator can move disks managed by SFW from server to server without the assistance of a storage administrator also allows the use of snapshot volumes to store backups remotely. This use of snapshots can provide a low-cost option for companies that have all their production servers in a central site. This option also supports a reasonably fast RTO as well.
Large enterprises with multiple data centers, a high volume of transactions, and a low tolerance for data loss may find the optional Veritas Volume Replicator (VVR) package that complements SFW a necessity. The key value proposition of VVR is that it is designed specifically to provide a mechanism for a full disaster-recovery solution utilizing heterogeneous storage over any distance.
VVR integrates a replication function into SFW by passing heartbeat and data messages via UDP or TCP over a LAN or WAN to provide consistent, up-to-date copies of application data on the source (dubbed the primary server by VVR) to remote targets (secondary servers). If the primary server goes down for any reason, then the replicated application data is immediately available on a secondary server. To this end, VVR can replicate data in either synchronous or asynchronous mode.
Synchronous transmissions are needed to keep secondary servers consistently up to date with the source host. VVR can be set to automatically fall back to asynchronous mode whenever a synchronous replication link (RLINK) for communications between a primary and secondary server is down, and then resume synchronous mode when the RLINK is functional.
Nonetheless, to use a secondary host in a disaster-recovery scenario, regardless of whether the data on the secondary host is up to date, the data on that secondary host must at a minimum be consistent with the data on the primary server at some point in time. In the case of a database, consistency requires that the database on the secondary host contain all of the transaction updates in order up to some point in time and none of the updates that occurred after that point.
To maintain the required consistency, VVR preserves write-order fidelity on each data volume in the replication set, no matter what transmission mode is in use. For an Exchange Storage Group, VVR tracks the order of writes made to both the log and data volumes and maintains that order when applying the writes on the secondary host. Without write-order fidelity, the databases in an Exchange Storage Group may not recover successfully on fail-over to the secondary server. Such a fail-over, it should be noted, is not restricted to disasters. With VVR in place, IT can easily transfer the role of the primary server to a secondary host if it becomes necessary to bring down the primary server for maintenance purposes.
To replicate volume groups over longer distances, VVR tracks all writes using a Replicator Log volume in a two-phase acknowledgement scheme that utilizes a mix of synchronous and asynchronous writes to ensure data consistency and improve application performance by limiting sources of latency. As part of this scheme, VVR places the log volume, along with all secondary replicated data volumes, into a special Replicated Volume Group (RVG).
While VVR provides complete data integrity and consistency in either asynchronous or synchronous mode, to support a zero data loss RPO, VVR must be run in synchronous mode. On the primary server, each write to the primary data volume generates a second synchronous write to the Replicator Log. These requests are persistently queued in the order that they are received. Should the primary server crash before the write to its data volume has completed, the data can be recovered from the Replicator Log.
In synchronous mode, VVR next transmits the write request as a message to the secondary server over the RLINK. This second write is actually an asynchronous background task that does not affect application performance. The application on the primary server is only constrained by the time it takes to transmit the write message and receive an acknowledgement from the secondary server.
Even though the data has not yet been written to disk on the secondary server, it is stored and recoverable from the Replicator Log on the primary server. When in synchronous mode, a zero data loss RPO can be achieved with the secondary data volume by rolling forward any non-committed transactions in the Replicator Log of the primary server. After the secondary server writes the data to its local disk, it sends a second acknowledgement to the primary server. Only when the primary server receives this second data acknowledgement does it discard the write from the primary Replicator Log.
As a result of this two-phase acknowledgement scheme, consistency is ensured, since the data is recoverable at all phases of a replication transaction. What’s more, for the ultimate in recoverability, VVR can implement a Bunker Replication scheme, which adds another server—the bunker—that only replicates the Replicator Log, in case of total primary server failure. In addition to supporting a zero data loss RPO, this scheme limits overhead latency in synchronous mode to that of network acknowledgement traffic.
To test VVR performance, openBench Labs used the network monitoring facilities of Veritas Storage Foundation for Windows to observe data rates as it copied 30GB of file data from a workstation to a public folder that was being replicated to a secondary server. Data on the secondary server remained consistent with the primary server throughout the test. At the primary server, the I/O at the primary Replicator Log volume, as measured by the number of I/O operations per second and the throughput per second, varied from the primary data volume by about 1.5%. More importantly, data throughput at the secondary disk volume also varied less than 1.5% from the throughput rate at the primary Replicator Log.
To lower the costs of a disaster-recovery plan, Veritas SFW provides wizard-based tools that meet RTO and RPO requirements in a cost-effective manner. SFW wizards remove the guesswork of configuring data replication strategies to minimize the time it takes to make data available after a disaster through such techniques as point-in-time data copies and continuous data replication. SFW wizards also make it easy for system administrators to minimize the amount of data lost. In particular, wizards are available to support the entire process of using logs to roll point-in-time snapshots forward to a point of failure.
What’s more, once the wizards are invoked to create a disaster-recovery configuration with a sufficient level of protection, that DR configuration can be cloned and reused to provide the same level of protection to numerous systems. Thanks to the dynamic disk virtualization construct, there are no physical storage infrastructure constraints. Everything is portable. In particular, the physical storage arrays supporting the original data no longer need to be identical to the physical storage arrays that support the snapshots or replicas.
The bottom line for disaster recovery is that it’s a matter of when not if. To deal effectively with the problem—and not risk new business or run up unnecessary costs—a business continuity plan needs to balance true risks with real costs. To tilt that balance in favor of a quicker and more-accurate response, storage virtualization via the use of dynamic disks can lower contingency costs—dramatically, in many cases. In this vein, SFW provides system administrators with the power to reinstate order from chaos.
Jack Fegreus is CTO of openBench Labs (www.openbench.com). He can be reached at [email protected]