Management Strategies for FC-AL
A solid strategy for administering Fibre Channel-Arbitrated Loop environments ranges from low-level device management to yet-to-come predictive management.
By Tom Clark
Fibre Channel benefits storage networking in many ways. Chief among these is a much higher level of visibility and control than is available with parallel SCSI bus configurations. Leveraging Fibre Channel`s advantages requires a comprehensive design strategy and consistent implementation. But the result is worth the effort: integration of network management platforms with storage and a foundation for predictive management.
A comprehensive Fibre Channel-Arbitrated Loop (FC-AL) management strategy has a number of layers: device management, problem detection, problem isolation, recovery, and predictive (proactive) management. Products that offer all or most of the management levels give network storage managers the benefits of a single system view of the storage network and higher levels of operational control.
Management capability in any network device requires additional circuitry, microcode, and application software, all of which raise per-port costs. Network storage managers want manageability, but this necessitates a tradeoff between cost and functionality. The more complex a storage network configuration becomes, the greater the need for manageability.
Unmanaged, low-cost hubs are popular choices for configurations of lower port populations that do not justify higher levels of control. Status LEDs at each port and auto-bypass circuitry (which verifies the signal integrity of an attached node) are often sufficient for small, homogeneous storage networks.
As port densities increase, management needs come to the forefront. A larger 10- to 50-node Arbitrated Loop may use host bus adapters and storage arrays from a variety of vendors. The storage network may include clusters, and cable plants may include a mix of copper, multi-mode, and single-mode fiber. The same loop topology may transport both IP and SCSI-3 protocols. These factors add complexity to the storage network.
In LAN/WAN internetworking, companies often mandate SNMP support, even for small configurations. In fact, SNMP management support is now assumed for every network device, from Ethernet hubs and switches to routers and CSUs/DSUs. Network operating centers want complete visibility and control throughout the network. As Fibre Channel infrastructures evolve for storage networking and become more closely tied to internetworking, companies will demand comprehensive management strategies.
SNMP is the common language of multi-vendor network management. An IP-based protocol, SNMP has a reduced set of commands for soliciting status or setting operational parameters of target devices. An SNMP management platform may poll thousands of devices throughout a routed network. The management platform is the SNMP manager; the managed devices contain SNMP agents.
Device status information may include a variety of data points: serial number, vendor ID, enclosure status, port type, port operational state, traffic volumes, error conditions, etc. This information is organized in a management information base (MIB), which is maintained by the management workstation and device agent. The internetworking community sanctions several standard MIBs. If a vendor wants to include device information not specified in a standard MIB, vendor-specific parameters or status can be compiled in an enterprise MIB or MIB extension. Since there is currently no standard MIB for Arbitrated Loop hubs, SNMP data for an FC-AL hub may be available via both standard MIB-II structures and vendor-specific MIB extensions.
Device information and status within a MIB is organized in a hierarchical data structure, the structure of management information. SMI defines an information tree: The branches lead to various management information bases; the leaves are discrete data about a device`s functionality. SMI notation, which is part of the payload of an SNMP query, is an address, pointing to the location of the data requested by the management workstation.
SNMP also enables devices to generate unsolicited status information (a trap). If a pre-configured error condition or threshold is reached, the device initiates an SNMP message, alerting the manager workstation. The application may send pages to the network operator.
In addition to multi-vendor SNMP management platforms, hardware vendors often provide their own graphical configuration utilities. These utilities are element managers, that is, they focus on a specific vendor`s product set and the parameters and realtime diagnostics are specific to the hardware elements of that set. Part of a comprehensive management strategy for providing a single-system view involves integrating the element manager into global management platforms.
As TCP/IP has increasingly become the Esperanto of data communications, SNMP has become the standard management protocol. Storage is the last area of data networking to embrace IP; consequently, it is the most recent recipient of SNMP capability. This evolution works well for storage networking management because SNMP offers proven, stable standards-based functionality and is widely supported by the rest of the networking world.
SCSI-3 Enclosure Services (SES)
SCSI is the most prevalent server-to-storage protocol. In legacy SCSI systems, the protocol runs over limited-length parallel cables, with up to 15 devices in a chain. The latest version of SCSI--SCSI-3--has the same disk read/write command set as previous versions, but in a serial format. This format makes Fibre Channel a more flexible replacement for parallel SCSI. Running SCSI-3 over Fibre Channel allows server and storage vendors to offer higher speed, longer distances, and greater populations for storage networks with fewer changes to upper-level protocols.
The ANSI SCSI-3 Enclosure Services (SES) proposal defines a command set for soliciting basic device status from storage enclosures. Similar to SNMP Get and Set commands, SES provides SCSI SEND DIAGNOSTICS and RECEIVE DIAGNOSTIC RESULTS commands to query a device. SES may be used to retrieve power supply status, temperature, fan speed, UPS, and other parameters from SCSI and proxy-managed non-SCSI devices.
SES is significant for FC-AL management because it is a potential source of overhead within loop data traffic, i.e., it is in-band management. An Arbitrated Loop hub may be neutral in the deployment of SES within storage networks, since it simply facilitates movement of all SCSI-3 transactions. If a hub supports SES queries, it becomes a participant (a node) in the loop. A management strategy for FC-AL should accommodate non-SNMP management protocols, but also exercise caution in introducing management traffic into the loop.
Arbitrated Loop for storage networks is analogous to shared-media LAN topologies such as FDDI and Token Ring, but has unique features and definitions. To appreciate the challenges of a comprehensive Fibre Channel management strategy, it`s necessary to understand how it differs from traditional LAN management.
In-Band vs. Out-of-Band
In traditional LAN/WAN terminology, an alternative data path provides out-of-band management, typically a serial RS-232 or SLIP connection. An out-of-band managed Ethernet hub or router, for example, is typically accessed through a command line interface over a dial-in modem or direct-connect serial cable. Management traffic is "out-of-band" since it does not flow through the primary LAN or WAN interface. Current router, LAN hub, and switch technology provides an out-of-band interface only for worst-case situations--for instance, when the primary interface is down and SNMP management is not possible through the network.
Network operating managers prefer in-band management access methods in LAN/WAN environments. In this case, management traffic (SNMP commands via IP) intermixes with data traffic along the primary interface. Providing there is no catastrophic condition in the network, network managers can configure devices, check status, and monitor network information from management stations anywhere on the network. Managers rely on the devices they are managing to route user data and management traffic. And since routers and switches provide multiple paths through a meshed network, management queries have less overhead than a single link.
Things are somewhat different in the Fibre Channel world. In-band management in an Arbitrated Loop incurs traffic overhead. To solicit information from the loop`s nodes, initiating managers have to arbitrate, open, and close repeatedly. Server-to-storage conversations--the most critical exchanges in data-centric networks--are repeatedly punctuated with management traffic. So, in-band management provides some valid functions in Arbitrated Loop environments (e.g., soliciting SES data), it is not the preferred access method and should be employed sparingly.
Fibre Channel out-of-band occurs through an Ethernet interface on the switch or hub or through a serial, console interface. Out-of-band management via Ethernet has three main advantages: it keeps management traffic off the loop where it will not burden business-critical data; it makes management of an Arbitrated Loop possible even if the loop is down; and it is accessible from anywhere in a routed network. In this sense, it is "in-band" in LAN/WAN terms.
To reduce costs, a single management card can be used to manage multiple hubs in a stack. Additional management cards may be added for automatic failover in the event a primary card fails. Since the management bus is separate from the hubs` loop ports, this solution preserves investment in ports that are available for nodes.
Reactive vs. Proactive Management
Traditional network management platforms focus on providing the tools to monitor, troubleshoot, and reconfigure network components in realtime. Errors are detected and reported to the management platform immediately, which results in a visual notification, an audible alarm, or a page. The network operator can then use standard SNMP tools, the vendor`s MIB extensions, or the vendor`s management application to diagnose the problem.
This management practice is reactive. Reactive management is essential for day-to-day network operations, but is insufficient for maintaining network stability and preventing lost revenues.
An alternative is an application that predicts and resolves problems before they affect the network. Proactive management platforms use the same SNMP data that reactive platforms employ, so network devices don`t need to be altered. The predictive management platform periodically rolls SNMP statistics into relational databases and trends that data over time. Proactive management can thus trend usage, track marginal performance of a device, and provide statistical justification for capacity planning and network upgrades.
As storage networks evolve on Fibre Channel architectures, vendors will have to provide a path from reactive management to proactive, fault-preventive management. This strategy implies a feature-rich MIB extension that provides more useful information to a proactive management application.
Troubleshooting Arbitrated Loops
Arbitrated Loop offers unique challenges for diagnosing operational problems. Unlike Token Ring or other shared media, transactions on an Arbitrated Loop are not always visible to all nodes.
For example, Token Ring sends a data frame from source to destination, the destination copies the data, marks the frame as copied, and returns the frame to the source. Once the source verifies the data has been copied, it is responsible for removing the frame from the ring. Since the data must traverse the entire ring, any intervening node can view it. All that is required, then, to troubleshoot a Token Ring transaction, is inserting an analyzer anywhere on the ring to observe traffic.
An Arbitrated Loop sends a data frame from source to destination, but the destination removes the frame from the loop. Other nodes are unaware of the transaction. Troubleshooting an Arbitrated Loop may therefore require a data trace at each suspect port. Typically, storage network managers insert data analyzers on either side of a port and capture the transaction as it arrives at or leaves the port. This is not only an expensive diagnostic that requires considerable Fibre Channel expertise, but inserting and de-inserting data analyzers may alter the topology of the loop and hide the problem under investigation.
Addressing this unique feature of Arbitrated Loop solves a pressing operational issue of storage network managers: Network storage managers spend 80% of downtime identifying the source of problems. If problems can be easily identified, they will dramatically minimize revenue losses, service disruptions, and staffing requirements.
Device management is the first level of management strategy. In a hub-based network, "device" refers to the hub enclosure, power supply, fans, and ports. This level may incorporate basic control features if the hub supports configurable parameters, inventory/asset support, type of port or GBIC, microcode version reporting, and management topology mapping.
For a product to be considered "managed," it must at least have device management capabilities. Device management is useful for low-level hub status, such as information about fans or power supplies. It does not provide useful information about the loop. To extend management beyond the enclosure components and basic port state to the entire loop (or multiple loops), the management strategy must be extended to higher levels.
Management tools should enable storage managers to quickly detect and isolate problems and to recover loop activity. Since FC-AL is not a broadcast media, each port may have to interrogated to identify the source of a problem. Observing Fibre Channel activity at each port, however, should not interfere with normal loop traffic, i.e., it should be performed out-of-loop. This implies eavesdropping on a port`s activity without incurring delays or interfering with user transactions.
In addition, the collective observations at each port should be aggregated to a "loop status." When a network storage manager inserts a device into an Arbitrated Loop hub, for example, a loop initialization sequence begins; the loop momentarily passes through an "OpenInit" state in the process of acquiring an address. If a single port is in an OpenInit state, loop initialization activity may be normal; however, if all ports are hung in an OpenInit state, the initialization sequence has failed and further diagnostics are required.
Problem detection should readily identify a number of conditions:
- Physical connection (signal integrity, transmitter and receiver status, GBIC status)
- Presence of valid Fibre Channel characters
- Identification of Fibre Channel ordered sets issued from any port; start of frame sequences (indicating data transactions); and Arbitrated Loop port addresses
- Recognition of port insertion and removal; loop initialization sequences; loop normal operation; and loop failure
By monitoring these conditions, problems can be quickly identified and addressed; in fact, the exact cause of the problem can be pinpointed down to the port and node level.
Event logs, which record significant changes in loop status, are useful for unattended or off-hour operations. For example, if a misbehaving HBA intermittently goes out of service, the event log will record that activity over time.
A prerequisite for hub management is the ability to take a problem device off-line and run diagnostics. This allows the loop to regain normal activity while the troubleshooting process continues.
Most Arbitrated Loop hubs automatically isolate (bypass) nodes that lose signal. Loss of signal, however, is not the most common cause of loop disruption. A problem node is more likely to have a valid Fibre Channel signal, but invalid or inappropriate Fibre Channel characters. It is important to identify these invalid characters or sequences and to automatically isolate the port so that the operation of the port is not affected.
The management application should also include graphical tools, which enable storage managers to manually bypass or insert a node, and the capacity to run diagnostics from the end node to the hub and back (loop-back mode). These tools allow the operator to selectively perform nondisruptive tests on the lobe (port, cable, and end node) and to determine corrective action before the port is re-introduced to the loop.
Recovery from an error condition involves two components: recovery of loop activity and recovery of a node`s participation in the loop. Automating the recovery process ensures that high data availability facilitates unattended operation of the storage network.
Loop operation is restored when errant nodes are isolated from the loop. But if the problem node represents gigabytes of business-critical data (e.g., a RAID array), the node must be isolated and service must be restored as quickly as possible.
If, for example, a storage array is generating valid Fibre Channel signals, but its state machine is confused, it makes more sense to "wake" the device by sending it a specific Fibre Channel ordered set than it does to physically power cycle the array or replace the interface. Quickly restoring the array to service gives the operator the opportunity to monitor the device and schedule maintenance during off-peak hours.
The ability to rapidly recover loop operation and attached nodes raises server/storage interconnects to a much higher level of reliability and accelerates the migration from legacy SCSI to storage networking. To deploy large server and storage configurations on Fibre Channel, network storage managers must have confidence that this topology will not only overcome bandwidth, distance, and population issues, but also keep downtime to an absolute minimum.
At the highest stage of the management hierarchy, predictive management`s goal is to eliminate downtime (hence, lost revenues) and provide valid data on traffic volumes and patterns for capacity planning.
Predictive management requires hardware and software components. Problem detection, isolation, and recovery features minimize downtime. The ability to proactively verify the status of a port, cable plant, and end node before introducing it into the loop greatly enhances the predictable operation of the storage network. The next step toward eliminating downtime is to accumulate loop and port statistics on a periodic basis and, via a predictive management platform, to trend activity over time.
In an Arbitrated Loop, it is not unusual for a node to initialize a loop. Normally, the loop passes quickly through the initialization sequence and returns to its previous activity. However, if the node is sporadically initializing loops, the HBA may need to be replaced. If this activity is observed over time, the problem can be addressed before it becomes severe and without disrupting service.
Since most loop problems occur when introducing new devices, proactive management tools provide significant control over the storage network. There are additional proactive tools to maximize loop operation. For example, before a new RAID array is introduced into an active loop, it is extremely valuable if a port is first put into bypass mode so that the cable and end node can be attached and the Fibre Channel ordered sets can be issued to the node and verified. Once they are verified, the port can be enabled for insertion into the loop.
Sweep screen provides port and loop detection, and reports detected problems with color-coded indicators on loop, stack, hub, and port views.
Automatic recovery features and port-level diagnostics can be used to recover an attached node without disrupting normal loop operation.
Tom Clark is a senior systems engineer at Vixel Corp. in Bothell, WA.