What is SAN management, and why do you need it?

What is SAN management, and why do you need it?

SAN management involves multiple layers, from SNMP-based device control to enterprise-level systems/network management frameworks.

Tom Clark

Nearly all major server and storage vendors today are shipping Fibre Channel SAN solutions based on arbitrated loop hubs, fabric switches, or some combination of switches and hubs. This is a significant shift in market adoption over the past several years and marks a major advance in the development of storage area networks (SANs).

Storage networking began with small (two- or three-node) homogeneous single-vendor configurations, typically based on arbitrated loop. Since the supplying vendors owned the design of the server and the storage array, management was typically not too sophisticated. In normal operation, a problem could usually be corrected by identifying the offender out of two or three suspects.

The recent trend, however, is toward more devices in a SAN and a heterogeneous mix of participants. With sometimes over 100 nodes on a single SAN and configurations composed of multiple vendors` host bus adapters (HBAs), disks, and tape subsystems, management has become critical. Management tools have become essential for maintaining uptime, and decisions relating to SAN infrastructure are now based on both functionality and manageability.

What is SAN management?

Management means different things to different people. For example, for network administrators, it relates to data transport or the ability to move user information reliably from one point to another. For them, management issues include bandwidth usage, provisioning redundant links in a meshed topology to guarantee alternate data paths, support for multiple protocols, and error-free delivery. In short, network administrators are concerned with getting data from A to B, but not what happens to the data once it arrives at its destination.

Storage administrators, on the other hand, are more concerned about organizing and placing data once it arrives at its destination. LUN mapping, RAID levels, file integrity, tape backup, and disk usage, are day-to-day preoccupations. Storage management assumes the data arrived at B intact and then decides how it is written onto disk or tape; data transport is not an issue.

In a SAN, these views converge. Properly operating SANs require both data transport and data placement management. By placing networking between servers and storage, a SAN broadens traditional storage management to include network administration and encourages traditional network management to extend its reach to data placement and organization. Consequently, network management frameworks such as CA Unicenter and Tivoli are incorporating SAN storage management utilities, while storage platforms such as Veritas and Legato are including modules to moni- tor the Fibre Channel network transport. The integration of storage and networking management functions is a unique product of SAN evolution and confirms the shift from a server-centric to a data-centric model in the enterprise.

SAN management involves multiple layers (see figure). At the lowest level, management of the SAN interconnect or transport depends on hardware that supports intelligent features, typically simple network management protocol (SNMP) agents that can report status and respond to commands. At the highest level, enterprise-wide management platforms oversee a variety of functions and are fed status information from multiple networking and storage infrastructures. The interfaces between the different management layers may be as straightforward as event logs or SNMP traps, or as sophisticated as programming interfaces (APIs) and common information models (CIMs).

Throughout the hierarchy of management layers, the common charter of all management functions is data availability. For enterprise networks, access to data is as essential as a dial tone. Loss of data access disrupts communication, delays business transactions, and ultimately results in lost revenue. For SANs, an efficient high-speed transport has little value if, as data is placed on the disk, it is corrupted due to a faulty RAID algorithm on the storage side. Likewise, a fault-tolerant, high-capacity storage array cannot fulfill its function if, as data moves across the SAN, it is corrupted due to poor signal quality. Management of the SAN, therefore, requires visibility of all aspects of data transport and storage. Partnerships among storage management and SAN interconnect vendors are driving the integration of management functions for more comprehensive SAN management applications.

When is management required?

Ideally, the decision to deploy a managed or unmanaged SAN solution should be determined by the application. If a user application can withstand potential disruptions due to network outages, advanced management features are less necessary. In many early-adopter SAN installations, however, unmanaged SAN configurations were installed by default, either due to lack of manageable products at the time or the assumption that small homogeneous SANs do not require management.

Some of these early installations continue to run mission-critical applications and attempt to address the lack of management by provisioning redundant data paths. This solution is not a substitution for SAN management. Without management visibility, the failure of a backup path may go unnoticed, resulting in system failure if the primary path subsequently goes down.

Enterprise and information networks are so dependent on continuous data access that even critical tape backup schedules or planned changes to the network are difficult to accommodate. Often in small single-vendor departmental installations, the time spent identifying and correcting a simple problem may represent an unacceptable disruption to end users.

Consciousness of SAN management surfaces instantly whenever failures occur. Without tools to quickly identify and isolate a problem, the mean time to repair slowly expands to fill the gap between initial symptoms and all the remedial processes required to track down and finally fix the offending cable, transceiver, host bus adapter, or application error. This is not normally the desired method for gaining appreciation of management capabilities.

As larger, more complex SAN installations are deployed in enterprise networks, the integrity of storage transport and organization will be held to the same management criteria as LANs and WANs. In enterprise environments, it is common practice to mandate that all networking infrastructure be, at minimum, SNMP-compliant. If SAN components for both mission-critical data center configurations and small, less-critical departmental applications share common management interfaces, a superstructure management framework can provide visibility throughout the enterprise. Alternately, with manageable SAN products even a small company or department can benefit from management functionality designed for enterprise-level high availability.

How much management is needed?

Traditionally, storage management has been limited to the organization of data, e.g., through LUN definitions and RAID level assignment or simple status of stor-age components through SCSI enclosure services (SES) inquiries. Storage administrators, unlike their enterprise networking counterparts, have rarely resorted to protocol-level analysis to troubleshoot problems. With the traditional SCSI connections that typically bind a single server to one or more disk arrays, it is normally sufficient to monitor basic enclosure status: power supply, temperature, and fans.

Introducing networking between server and storage, however, in-creases the complexity of the storage configuration, requiring much more from the devices that provide the SAN interconnect. Individual Fibre Channel HBAs, transceivers, hubs, switches, routers, and disk controllers may offer different levels of manageability. Some only offer basic enclosure status, while others offer sophisticated diagnostics. The ability to monitor the status of the entire SAN depends on the capabilities of various devices, often from an assortment of interconnect vendors. Combining these capabilities into a robust, managed solution is a challenge assumed by server and storage suppliers, via integrating them into their own management platforms, setting higher criteria for interconnect providers, and through common efforts such as SNIA (Storage Networking Industry Association) committees and the Fibre Alliance.

In normal operation, a SAN should be transparent to the user. SANs are deployed to solve application problems, not to complicate them. Therefore, the more intricate the SAN, the greater the demand on hardware and software to hide that complexity from the user. To be effective, interconnect devices need to be more intelligent and SAN management applications must be more sensitive to users` expectations.

SAN devices that follow the classic SES management strategy of basic enclosure status and port on/off controls do an adequate job of hiding Fibre Channel issues from the user. Unfortunately, they also hide Fibre Channel issues from the device, making it incapable of responding to anything but basic problems. Administrators need to be notified of fan or power supply failures and higher-level protocol events that could threaten to bring down the SAN. A product designed around elementary enclosure status has no visibility to Fibre Channel protocol events and cannot proactively keep the SAN operational. Protocol recognition requires advanced product design that extends basic device management to more sophisticated problem detection, isolation, and recovery from both link- and protocol-level issues.

In arbitrated loop environments, for example, loop initializations naturally occur as new nodes are added to the loop or previously attached nodes are powered on. Loop initializations typically occur in milliseconds and do not disrupt most applications. Certain loop initialization sequences, however, may be sustained, which can be highly disruptive to user applications. A SAN interconnect device that lacks protocol recognition circuitry cannot respond to these initialization storms and cannot notify administrators when a problem occurs.

A Fibre Channel hub that, at minimum, can sense an initialization storm can send an alert. This is a significant improvement over simple SES management since the administrator will at least know what, if not where, the problem is. Additionally, some hub`s management applications provide wizards that enable administrators to manually step through a diagnostic process to identify the offending node. However, if the process takes several minutes, it is likely that user applications will time out and disrupt end-user activity.

From an administrator`s perspective, it is far more useful if the loop hub not only detects the initialization storm, but also automatically isolates the problem node from the production loop. If the loop hub immediately restores normal operation of the entire loop, administrators are shielded from the troubleshooting process and the remaining users can be restored before applications time out.

Administrators have one central concern: uptime. They do not want to bothered with time-consuming wizards, be required to know all the various conditions and protocols of an initialization storm, or otherwise be concerned with Fibre Channel internals. They simply want to know when a problem occurs and then leave it up to the infrastructure to bypass or "heal" the failure. This is especially true for departmental SANs, which may not be monitored by onsite support personnel.

In this example, the interconnect product provides a high level of proactive management, but the management GUI simplifies how this activity is translated to the user. The standard "back-of-box" view of the hub or switch, with color-coded status indicators and plain English descriptions of events hides the complexity of the SAN interconnect while providing powerful corrective functions working behind the scenes to maintain availability automatically. The administrator can then beacon or flash the port LEDs so customer service personnel can quickly locate the failed port.

In addition to proactive features to ensure uptime, management features provide useful performance data. For example, the data can tell you when a new application or server is brought on-line, so traffic loads can monitored. It also enables administrators to intelligently allocate devices and redistribute storage resources on different segments to fine-tune SAN operations. If performance data is displayed over time, administrators can accumulate capacity planning information for budgeting and SAN expansion purposes (see screenshot on p. 30). For small, fairly static SAN configuration, this capability may be overkill, but it is invaluable for dynamic SANs that otherwise would have no way of measuring the effects of adds, moves, and changes.

How much management is needed is fundamentally determined by who is supporting the SAN infrastructure. An administrator of a 5- to 20-node storage network may simply want to be notified of a problem, calling a VAR or solutions provider to replace or repair the failed unit.

If the SAN interconnect has auto-recovery capability, the SAN can remain operational while remedial action is taken. Hubs or switches that also support advanced diagnostic tools are probably too complex for direct customer use, but they are valuable tools for customer service.

Enterprise networks, however, often have their own in-house expertise to deal with more extensive network configurations. Just as a technical support staff for an enterprise LAN or WAN may include experts trained in use of a Sniffer and TCP/IP protocol decode, support staff for large SANs may include experts trained to use Fibre Channel analyzers and Fibre Channel protocol. Likewise, VARs who service large corporate accounts are encouraged to provide higher levels of expertise to more quickly resolve customer issues.

In these instances, the more management capability, the better. Products that offer auto-recovery and advanced diagnostics via management software help quickly identify complex problems and ensure higher system availability. The trend in Fibre Channel hub and switch design is therefore toward increased SAN health capabilities and more sophisticated diagnostic tools that meet the serviceability requirements of both departmental and enterprise-level SANs.

The scope of SAN management

The relationship between the embedded, intelligent functions in SAN hardware and the user interface supplied by interconnect management software is reflected, at a higher level, by the relationship between interconnect management in general and the upper level storage and systems management platforms. Small and medium SAN installations may not have the requirements, resources, or budget to install larger management frameworks such as Tivoli or CA Unicenter. Tape backups may be performed through operating system utilities instead of more sophisticated backup managers from Veritas or Legato. Large framework platforms and backup managers are, however, very much part of the daily life of an enterprise, so while integration between SAN interconnect management and upper level applications may not always be used, it is important that those integration hooks exist.

For low- and mid-range server/storage markets, a SAN may be comprise a single hub or switch with as few as three to eight devices. Despite the relative simplicity of such configurations, SAN management is facilitated by smart features in the interconnect hardware and an intuitive graphical interface to interpret status for the user. As SANs become more complex, a combination of hubs and switches in a single environment is more easily managed if the graphical interface can solicit status from both types of devices. This capability simplifies user requirements for platforms, staffing and training and presents a consistent look and feel for managing the data transport layer.

Integration to upper level storage and systems management platforms gives administrators additional flexibility in defining the scope of SAN management. A storage management platform may provide the tools for data organization, tape backup, RAID levels, and LUN mapping. It also may be able to launch a hub or switch device manager from the main program. SNMP traps or links to event log error codes may also be used to launch interconnect managers from systems management platforms like Tivoli or CA Unicenter. Having a variety of integration available options allows administrators to start small and then implement larger SANs as application requirements demand.

Click here to enlarge image

Per port performance monitoring of hub or switch traffic patterns.

Tom Clark is director of technical marketing at Vixel Corp. (www.vixel.com), in Bothell, WA. He is author of "Designing Storage Area Networks: A Practical Reference for Implementing Fibre Channel SANs," published last month by Addison Wesley Longman. He can be contacted at tclark@vixel.com.

This article was originally published on October 01, 1999