Storage area network backup will take place in three stages: LAN-free, integrated media and devices, and server-less backup.
By Marc Farley
Network backup has been a problematic systems management issue for several years. In general, it has been nearly impossible to keep up with the growth of network-resident data that needs backup protection. This article explores how storage area network (SAN) technology offers powerful solutions to the difficult problems that have plagued network backup.
Centralized backup over Fibre Channel
Market research has indicated for many years how important it is to centrally manage storage resources and processes, including backup and recovery. Centrally controlled network backups provide significant savings in the time spent managing backups. Organizations that have implemented dedicated backup backbone networks have appreciated this already. But with SAN technology, the dedicated backup can be implemented exclusively on the storage I/O side, without having to transfer data back and forth over the data network. In addition, a dedicated network with a SAN backup network has much more available bandwidth.
Table 1 compares three backup approaches. The first approach, data network backup, is the typical way of implementing network backup using the existing data network. The second approach, using a dedicated backup backbone uses a completely different data network to isolate backup traffic. The third approach, SAN backup, has several variations on a common theme of sending backup traffic over the SAN.
The assumptions in Table 1 include 100Mbps network technology for the first two approaches. The higher throughput expectations indicated assume an FDDI network, as opposed to 100BaseT Ethernet. Gigabit Ethernet would increase the throughput numbers considerably, but it still would have the other characteristics indicated in Table 1 that would impact its performance potential. The SAN throughput indicated is conservative. In reality, SAN transfers could be in the 70MBps to 80MBps range. However, tape subsystems are generally not capable of sustaining this type of performance.
Comparing network protocols
Traditionally, server-to-storage I/O uses a master/slave relationship. The protocols to support cooperative processing between servers and storage are in their infancy and not well developed. By contrast, the history of TCP/IP networking is much longer and much more developed. While TCP/IP networks have developed around the client/server computing model in the last decade, a great deal of their communications capability is based on peer-to-peer communications between systems of similar architecture and capabilities. TCP/IP networks have a great deal of flexibility for the design of distributed services and applications.
Storage networking, by comparison, is attempting to establish higher-level communications capabilities between dissimilar entities: storage devices/subsystems and computer systems. The general idea is that storage subsystems should be able to manage themselves more effectively as intelligent processing entities working together in a network environment, in contrast to the master/slave relationship they have in bus-attached storage. Eventually, the development of intelligent SAN devices and the protocols to use them will be completed, but it will take several years of development and standards work. The use of IP protocols within the SAN could enable the acceleration of cooperative storage management between servers and devices. The section on server-less backup later in this article discusses this promising area in more detail.
Three stages of SAN backup
There are three different ways to apply SAN technology to backup to solve some of the chronic problems in network backup: LAN-free, integrating media and devices, and server-less backup.
Stage 1: LAN-free
Figure 1: LAN-free backup moves backup traffic off the data network and puts it on a SAN.
The first big application of SANs for backup is to move backup traffic off the data network and put it on a SAN. This concept is called LAN-free backup, and is shown in Figure 1. The idea is the same as a dedicated backup network, but this time it's a SAN instead of an IP data network.
A high-level view of the I/O path for LAN-free backup is shown in Figure 2, which illustrates the efficiency of using a SAN, as opposed to routing backup over a data network. The SAN approach requires the data to pass through only one system, whereas moving backup traffic over the data network means it must pass through two systems.
Mixing platforms with LAN-free backup. SAN technology is platform-independent with broad connectivity support across all open-systems server platforms. Unlike bus-attached storage, SANs were designed for use in multiple-initiator, shared-access environments, which means several computers can connect and work on a SAN simultaneously. It follows that multiple independent backup systems can share a single SAN, carrying backup data concurrently for multiple servers to their corresponding backup devices.
Figure 2: SANs improve backup efficiency compared to routing backup over a data network. The SAN approach requires data to pass through only one system.
The servers, backup applications, and backup devices can be any combination. In other words, different backup systems can be deployed to match the requirements of individual servers. For example, a Unix server would likely have a Unix-oriented backup application running on it and an NT server would likely run a backup application from a company with a Windows NT orientation. Figure 3 shows a SAN backup network with three independent backup systems on three different types of servers, sharing the network, each reading or writing to its own tape device.
Figure 3: A SAN can be configured with independent backup systems on different types of servers.
Segregating devices for SAN backup. The SAN device shown in Figure 1 connects all three servers, their storage, and backup subsystems on the same network together. While this looks great, there are potentially serious problems to work out on the way devices are accessed and shared in the SAN. Just as operating systems are not consistent in their level of sophistication for sharing SAN-resident disk volumes, there is no standard method yet for making sure that tape drives are accessed correctly without errors not by competing systems. Backup vendors do not yet have a standard mechanism for interoperating on the SAN, although there have been solutions proposed for many years. More on that will come later in the discussion of the second stage of SAN backup.
Zoning for backup. Therefore, while a SAN integrates all the equipment on a single cabling structure, it can also segregate equipment logically into separate I/O paths through zoning. While zoning is not necessarily required for all combinations of servers and devices, it may be necessary for multiplatform SANs. Zoning technology enables multiple initiators to access multiple devices on a single I/O channel, without requiring this support from the server platforms using it. For that reason, the first stage in SAN backup uses zoning to create virtual private backup networks for the various platforms.
Reserve/release. Another way to circumvent the problems of concurrent access to backup devices over a SAN is to use a bridge, router, or other SAN device that provides support for SCSI reserve/release functions. Reserve/release was originally developed for parallel SCSI operations and is also supported by mappings of serial SCSI in storage networks.
The concept of reserve/release works pretty much the way it sounds. An application can reserve a particular backup device and save it for its own use until it no longer needs it, at which time it releases it for other applications. Compared to more sophisticated device and sharing management systems, reserve/release is pretty basic, but at least it provides a way to keep two initiators from fouling up by sending backup data to the same device concurrently.
Figure 4: Bridges or routers can provide support for SCSI reserve/release functions, which can solve the problems associated with concurrent access to backup devices over the SAN.
Figure 4 illustrates a reserve/release function implemented in a SAN device where the SAN device has reserved the access to the backup device for one server and replies to other servers that the device is busy and unavailable.
Automation of media handling for removable media devices, such as tape libraries, is a fairly mature technology that has been implemented for server systems since the early 1990s. The capacity and automation benefits of such equipment are clear-they reduce errors and allow unattended tape changes whenever they are needed. This last point has not been lost on network administrators who no longer have to visit their computer rooms on weekends to change backup tapes. However, with every silver lining comes a cloud: one of the drawbacks of automated tape products has been the difficulty in sharing this relatively expensive resource among applications, not to mention servers.
As SANs increase in popularity, it is highly likely that tape libraries and other media automation devices will also become more popular as SAN technology enables them to be shared by multiple servers. In a multi-drive tape library, it's easy to see how the various drives in the library could be allocated to the various servers in the SAN.
Sharing a library. The robot component of a library automates media changes and movements. While there may be multiple tape drives within a tape library being used by multiple servers, there is usually only one robotic mechanism. So, the question is: How is the robot controlled? Zoning does not provide a solution to this problem. The tape library and its robot are a single entity that cannot be split or shared across zones. Reserve/release provides the basic capability to share a robot.
While one server is using the robot and has it reserved, another server cannot use it. When the first server is done, it releases the robot and another server can access it.
Bridges and routers. Most tape drives and libraries have parallel SCSI interfaces and not native SAN interfaces. Therefore, one should expect that a SCSI-to-SAN bridge or router will be required to connect tape drives, autoloaders, and libraries to the SAN.
Even if a library has an external SAN connection port, it probably has an integrated bridge or router. This situation will change over time as tape library manufacturers integrate SAN technology in more of their products.
For the purposes of the current discussion, bridges/routers can provide the reserve/ release mechanism for all the tape devices and the robot in the library. This allows unintelligent tape libraries lacking reserve/ release capabilities to be shared in a SAN.
Protecting tapes in shared libraries. While reserve/release allows the robot to be shared, it does not do anything to protect the media inside it. Using reserve/ release, any application on any server is free to use the robot and access any of its tapes, including tapes that it does not control. If the application does not recognize the tape as one of its own, it may decide to format it and overwrite it with its own data.
This scenario can be prevented by configuring backup software applications to access only certain slots in the library. For instance, assume there is a 60-slot library being shared by three different servers. To divide the capacity of the library equally among the three servers, you would configure the library control software in each server to access 20 slots. For example, server 1 would access slots 1 to 20, server 2 would access slots 21 to 40, and server 3 would access slots 41 to 60. This way, tapes from different applications can be kept apart by the various software applications. While this approach works, it does not enforce segregation in the library. Changes to the software configurations need to be carefully controlled and managed to prevent slot assignments from overlapping across server boundaries.
When multiple copies of the same backup application are installed on multiple servers sharing a library, it is likely that each server will recognize the tapes from the other servers because they will all have a familiar tape format. In this case, it is less likely that tapes will be overwritten when accessed by a different server, and it may not be necessary to segregate slots as discussed previously.
However, it is very important in this case to ensure a unique naming scheme is used for each server sharing the library to prevent the server's library control software from selecting the wrong tape. Using unique naming for each backup installation is a good idea in any case, but it is imperative for library sharing.
Zoning and reserve/release. Zoning and reserve/release are two different approaches to the same problem. As it turns out, there may be cases where an organization will implement both techniques at the same time. Zoning can do many things, but one of its most important functions is segregating different platforms on the same SAN. Reserve/release, on the other hand, works very well in homogeneous environments where multiple servers using the same backup application (and on the same platform) can share the backup devices on it. Reserve/release sharing depends on each application being well-behaved citizens, and violations of good citizenship can cause backup failures.
Figure 5: SANs can be implemented with zoning for segregating platform traffic, and with reserve/release functionality to share devices within the zones.
So, it is likely that there will be SANs implemented with zoning for segregating platform traffic and reserve/release to share devices within these zones. For instance, a two-platform SAN with a pair of Unix systems and a pair of NT servers could have two zones in the SAN and two different backup systems, backing up data to a pair of tape libraries through two storage routers or bridges. Figure 5 illustrates this example.
SAN backup vs. other methods. As mentioned, the LAN-free virtual private network backup implementation is similar to the backup backbone network discussed above, but it is also quite similar to the idea of dedicated stand-alone backup discussed previously. The primary difference is the implementation of a high-speed, shared I/O channel to route all the backup traffic to devices in a centralized location. Because all the backup equipment can be centrally located, backup administrators don't have to visit backup systems in multiple locations to verify the correct media is loaded and the devices are in working order. Backup administrators can save hours every day by having all the media and devices in the same location.
Even more important, the performance of SAN backup is excellent. With transfer rates of 100MBps, and most tape drives for servers supporting less than 10MBps, there is no lack of bandwidth to support backup. So, not only can it cost less to manage backup, but there is no performance penalty to pay for achieving it. It's not always possible to scale performance and achieve management efficiencies at the same time, but LAN-free backup does just that.
While LAN-free backup has some significant advantages, it still has its shortcomings. Reserve/release provides device sharing, but it does not provide centralized control and management of how devices are accessed. One can imagine a backup system where the backups of certain business applications are given higher priorities and device assignments than less important applications. Also, segregating media access by server is certainly not the best solution either, as distinct media collections have to be managed separately. A single integrated media management solution could simplify things a great deal. This topic is discussed in the section on the second stage of SAN backup.
Likewise, improvements are needed for other logical backup components such as operation schedules, device managers, media managers, and metadata access. It should also be pointed out that just because backup data transfers move across the SAN, it does not mean they relieve servers from the problems associated with running backup applications alongside other business applications.
Overall, LAN-free virtual private backup networks are a vast improvement over data network backup solutions. Furthermore, the installation of a LAN-free backup system builds the infrastructure for later stages of backup and data management over SANs.
Stage 2: Integrating media and devices
The second stage of SAN-enabled backup integrates the device and media management components that were intentionally segregated in the previous stage. As discussed previously, the selection of devices for specific backup tasks and the segregation of media into discrete collections are two shortcomings of the LAN-free backup approach.
In contrast, if any system could access any device in response within the limits set by system-wide policies, it would also be advantageous if similar policies could enable the selected device to use any media available. In fact, it might be more proper to select the device based on some combination of media availability and performance/capacity requirements of the specific operation.
In addition, a consistent logical and physical format across all tapes and the ability to manage them in a meaningful hierarchy can simplify the management effort considerably.
A single integrated SAN backup system can encompass all logical components of a backup system, including operations management, data transfers, error reporting, and metadata processing. Not only can an organization realize cost savings through less administration effort, it also contributes to better disaster preparedness if common backup software and hardware is deployed throughout the organization on all platforms.
Media/device standardization. An open systems integrated backup system depends on backup vendors working together and agreeing on standards for identifying and accessing devices and media. Unfortunately, this does not seem to be realistic due to competitive forces. So instead, these systems will probably be somewhat proprietary, with some partnerships occurring between vendors until standards can be adopted.
The standards for implementing some of these functions already exist, but they have not been widely implemented. In the early to mid-1990s some of the large government laboratories began working on a storage management standardization effort. Known as the 1244 Mass Storage Reference Model, this standardization work covered a broad range of storage management functions. By and large, the work done by this group was intended to apply to its particular large-scale data processing needs. However, as the amount of data continues to increase in the commercial data centers, this technology has become more relevant. Commercialized versions of various pieces of this specification are starting to make their way into leading network backup products. Legato sells an implementation of parts of this work in a product called Smart Media, and Veritas' Virtual Media Librarian has also borrowed from the concepts in this work. Time will tell how these products evolve, and whether or not real open-systems interoperability can be achieved. Backup vendors will need to find a way to jointly define interface specifications and test procedures to ensure interoperability in the market.
Integrated SAN backup. Whether this type of implementation is proprietary or open, there is still a common design ideal at work. Figure 6 shows one example of an integrated backup system where multiple servers have their data backed up to a single tape library.
While it is possible to build an integrated backup system with individual tape devices, it is much more likely to deliver management efficiencies by using multiple drive tape libraries. Tape libraries provide the means for the backup system to automatically react to problems as they occur. This can include the requirement for loading blank tapes as well as previously used tapes belonging to the same media set, or pool.
Establishing connections. Although the tape library pictured in Figure 6 has a SAN bridge, or router, for SAN connectivity, it is not providing reserve/release functionality as it did in the previous examples of LAN-free backup. Instead, the bridge/router implements an access gate through which each of the servers can establish backup sessions.
The access gate provides a function similar to reserve/release, but uses a security mechanism as opposed to a device command. This security mechanism is a higher-level software function that can be controlled by a centralized management system.
With a security mechanism in place at the tape library, each server requests the services of a device in the library as it is needed. The library then can control which device it makes available. This approach makes all devices generally available to all servers and establishes an orderly mechanism supporting intelligent subsystem path routing and error recovery as well as the prioritization of applications and the policies for managing their activities.
Besides the access gate, this integrated system adds another global management function, called a connection broker, which generates the keys used by servers when communicating with the access gate. The integration of the connection broker and the access gate comprise two halves of a key-based security system in an integrated SAN backup system. The precise role and scope of the access gate and the connection broker could differ, depending on the level of intelligence in both components. For example, the connection broker could determine the best device for the server to use and provide that information in the key or directly to the access gate. Similarly, the connection broker could provide the priority level of the server with the key and the access gate could determine the best drive to use.
This type of connection mechanism is well suited to large systems with policy management that can enforce priorities during backup operations. For instance, it is theoretically possible to reroute a high-priority operation to another lower-priority connection if needed.
Separation of control and data. The connection broker does not have to be in the path between a server and the device it is working with. It only needs to be able to communicate with both servers and devices. In fact, it is not necessary that the connection broker be connected to the SAN at all; its function could be provided over the data network as long as the access gate is also capable of communicating on the data network. The entire key exchange mechanism could be handled on the data network.
Figure 6: A bridge or router can implement an access gate through which each of the servers can establish backup sessions.
This raises an important architectural point about SAN-oriented backup: the communications that control the operation and the backup data can travel on different paths and networks. This is sometimes referred to as separating the control path and the data path. Today, SANs almost exclusively run the FCP storage protocol. However, network backup systems use the TCP/IP data networking protocol to communicate between backup engines and source system agents. Therefore, it is somewhat likely that SAN backup implementations will use separate control and data paths for some time, until multi-protocol implementations are widely deployed on SANs.
Sharing a robot. One of the concepts central to an integrated SAN backup system is library sharing, including access to the robot. There are two basic ways this access could be shared:
- Brokered access with decentralized robotic control.
- Providing an interface to a centralized media changing service.
The idea of brokering access is exactly the same as described in the previous sections for creating connections between servers and drives. The problem with this approach is that an individual application could load or change tapes in the library, regardless of what other applications or the connection broker is expecting. For instance, a poorly behaved application could unload tapes in the middle of another server's operations.
The other, more controlled, way to share library access is to provide a media changing service using a client/server model. For instance, an application could request that a tape be loaded as part of an upcoming job. If there is heavy competition for the library's resources, the connection broker could determine if this request is a high enough priority to interrupt work in progress. In addition, a connection broker could be configured to restrict access from certain applications to specified media locations in the library as an extra means of protecting high- priority data.
Stage 3: Server-less backup
Another technology area being developed for SAN backup systems is one where independent entities on the SAN provide the device-to-device operations on behalf of servers and data management applications. This is called server-less backup, or third-party copy.
Data movers. As discussed previously, one of the major benefits of SANs is the independence of data from systems. Devices in SANs no longer are restricted to belonging to any particular server. That means that a function in any server can access a backup device and perform operations. In fact, the entities accessing devices in a SAN do not have to be servers or systems at all. They can be any type of SAN component, including hubs, routers, switches, or even host I/O controllers. The idea is that an entity that has the intelligence and the necessary hardware can access devices in the SAN and act as a data mover, transferring data from one device to another. A data mover in a SAN could be used to perform data and storage management functions.
Third-party copy transfers. Given the number of processors and resources available in many different storage and SAN devices that could function as data movers, it's not difficult to imagine backup and storage management functions as independent entities. A SAN-enabled backup application could query a source server about the data in its file or database system residing in the SAN, and then initiate a data-mover operation to copy it to a backup device in the network. This is known as third-party copy, and its process is illustrated in Figure 7. The independence of devices and applications is the impetus for the term server-less backup. By removing both hardware control and software resources from a server, the backup load is reduced enormously.
Server-less backup server agents. In Figure 7, the backup application first queries an agent on the server, which identifies the data to back up. But instead of sending a list of files to back up, the agent sends a list of blocks for each of the files to back up. The backup application then transfers the block lists to the data mover in the SAN. The data mover receives the job list and starts transferring the data from disk to tape.
The agent that generates these block lists is an important component of a server-less backup system. This agent has the ability to query the file system on the server and receive back the list of blocks where the file or database object resides. This list is generated and transferred eventually to a data mover. Notice that the data mover function has no knowledge of the file system or database system that it is performing work for-it just reads and writes data.
Hot server-less backup. Server-less backup can be cold or hot. Cold backups are much easier to accomplish than hot backups. The performance and centralized management benefits of running cold backups over the SAN may make cold backup a realistic alternative for some organizations.
However, cold backups are probably not realistic for most networks on a regular basis, and hot backups must be employed. Therefore, a critical component of server-less backup is a copy-on-write function that supports hot backups by writing new updates to temporary storage while backup is copying old data. Copy-on-write for server-less backups provides a way to ensure that new data written by an application during backup operations does not create data integrity problems on SAN-attached storage.
In essence, this is the same problem as bus-attached hot backup, except that it is being done in a multi-initiator SAN environment in a distributed fashion where the server-less backup agent conveys block information to an independent data mover. Server-less hot backup creates a point-in-time backup that manages new writes while it is backing up the data.
Hot backup provides the ability for the source system to copy old data to a temporary storage location as new data is being written by the application. As backup progresses, the old data is copied from the temporary storage location, then deleted. However, for server-less backups on a SAN, the copy-on-write process runs in a backup source system while the backup block list is transferred to the data mover machine somewhere in the SAN where it is processed. This is a distributed process involving remote inter-process communications between the source system and the data mover. Things get interesting when an update occurs to a block that was recently updated, and is reflected in the data mover's block list, but has not been backed up yet by the data mover.
The challenges in implementing this are nontrivial, and will require a great deal of work. The process of exchanging the backup block list with an external data mover introduces timing and control problems that need to be carefully considered. The copy-on-write process starts in the source system before it sends a block list to the data mover. Instead of keeping the block list locally in memory, it has to be successfully transferred to the data mover. The data mover has to receive the block list, without errors, and initialize its process soon thereafter to prevent the copy-on-write process from filling up the free disk space of the source system with too many redirected copy-on-write data blocks. As mentioned above, accommodations have to be made to handle the scenario where freshly updated blocks that were indicated in a block list are updated again before the data mover has the chance to transfer the blocks.
Signaling between the copy-on-write process and the data mover to acknowledge transfers of the block list and their subsequent completion or failure has to be handled as a distributed network process in the SAN. The various failure modes of this system need to be anticipated and planned for, including a mechanism for the copy-on-write process to time-out and release its temporary disk storage blocks back to the system to alleviate disk capacity problems.
Server-less backup and data sharing. Server-less backup raises an interesting contradiction in SANs regarding the independence of storage as well as the importance of distinguishing between storage and data. Third-party copy works because data movers are able to read data blocks from SAN storage. The independent nature of SAN storage makes this possible.
However, there is an important difference between storage access and data access. While SANs allow great connectivity flexibility, the file systems for sharing data among servers is a different story. Theoretically, data movers can communicate with any storage in the SAN. They are platform-independent, block-access functions. The agents they depend on, however, are completely platform dependent, run on single-server systems, and are not intended to work across multiple servers or clusters. During server-less backup, an update to data from a server that does not have the server-less backup agent running on it will likely corrupt the backup data. The data-mover component would have no way of knowing if something has changed on the storage subsystem.
Figure 8: In this configuration, a server-less backup data mover crosses zone boundaries.
So, the independence of SAN storage that allows data movers to access data does not automatically apply to other servers and systems. One could say that server-less backup restricts data access in the SAN to a single production server-at least until a distributed copy-on-write function could be employed. This is not realistically a very big problem today as SAN-based, record-level data sharing between open systems in a SAN does not describe a current and large market requirement. However, this is an area with a great deal of development work, and the coordination between file system and backup technology could become a much more difficult problem to manage in coming years if server-less backup is to be a viable solution.
Designing SANs for server-less backup. Figure 8 shows a SAN with two servers and two storage subsystems and two tape libraries segregated into separate zones. A data mover system belongs to both zones and can access all the storage subsystems and tape libraries.
Server-less backup agents in both servers communicate to the data mover and send it block lists for data needing to be backed up. The data mover then reads data from those blocks in their respective subsystems, and writes it to the corresponding tape drive in a tape library. Again, as in previous examples, the tape library is "fronted" by a SAN bridge or router that provides connection of the SCSI tape drives and library to the SAN. In Figure 8, the data mover is located in a separate system instead of in the bridges/routers.
Server-less + integrated SAN backup. Figure 8 shows a SAN with two tape libraries serving two different servers. While this is a beautiful picture for tape library manufacturers, it begs the question as to whether or not a single library could be used. The answer is maybe someday, if data-mover functionality can be included as part of an integrated SAN backup system.
Figure 9: A hypothetical structure of a server-less, integrated SAN backup system.
The challenge in combining server-less backup with the total integrated SAN backup system is figuring out how the connection broker, device, robot, and media management of the integrated SAN backup concept can be implemented with server-less backup systems. Server-less backup is being developed with a design goal of minimizing the size and resource requirements of the data mover function so it can be implemented in various SAN devices, some with relatively few resources. For that reason, it might not be realistic to expect the data mover to also take on the work of the connection broker or access gate.
However, the data mover does have the resources to support log-in functions and access keys for establishing a communication session with a library in an integrated SAN backup system. One possibility is that backup traffic would flow through data movers while the negotiation for these sessions and the transfer of communication keys would be managed by the server-less backup application and the connection broker. Figure 9 illustrates a hypothetical structure for such a server-less, integrated SAN backup system.
There are several ways SAN architectures can solve the serious backup problems facing high-capacity data centers and web sites. This article describes several of them and lays out an evolution for shared, high-speed SAN backup. There is a great deal of work to be done in this area, particularly involving how devices and media are shared by multiple systems in the SAN. As server systems evolve into clustered systems for high availability, the difficulties of scaling backup become even larger. Perhaps we will see server-less backup for clustered servers sometime in our lifetimes, but don't expect it in the near future.
Marc Farley is vice president of marketing at Solution-Soft, in San Jose, and the author of (Osborne/McGraw-Hill).
This article is excerpted from Building Magazine by Marc Farley (Osborne/ McGraw-Hill). The book can be ordered online at the following sites: