Onaro's SANscreen Application Insight empowers IT to align a SAN for maximum application performance at the lowest cost by providing an automated solution for discovery, service monitoring, and capacity management analysis.
By Jack Fegreus
—When it is time to act on an emerging opportunity or react to a competitive threat, today's real-time enterprise depends on getting the right information, to the right people, in time to act and create value or mitigate risk. That goal has set off an explosion in the volume of data maintained on SANs.
To deal with that expanding volume of data, IT administrators have focused on managing storage devices as individual assets. At best, this has meant investments in storage resource management (SRM) software. More often, however, storage administrators simply rely on a combination of out-of-date spreadsheets and data generated from homegrown scripts to represent the current configuration state of storage resources.
The problem for IT is that no matter how well storage is managed as an isolated asset, only a limited value can be derived from an isolated device. Asset properties and the state of storage devices are not sufficient to address the challenges to delivering IT support as a business service. In fact, most IT organizations lack a clear set of links that tie storage resources with applications and business value, which creates a gulf between IT and corporate executives.
Corporate executives think in terms of business processes. When it comes to the services they need to support those processes, they expect IT to address issues of availability, business continuity, performance, and security. This requires IT administrators to automate data-center processes for building, maintaining, optimizing, and auditing storage networks. That means IT must be able to create policies and procedures that can effectively support service level agreements (SLAs) for storage.
To define and support an SLA for a business process, IT must understand all of the interdependencies among storage devices, hosts, and SAN switches for each application that's a part of the process. The lack of such an overall understanding will negatively impact IT's ability to deliver processes for building, maintaining, optimizing, and auditing storage networks. Unfortunately, the task of developing a service-centric process all too often burdens IT with costly labor-intensive tasks that require detailed application, data-center, and business process knowledge. Onaro's SANscreen Application Insight, however, provides a simpler solution to optimizing storage as a service through the construct of an access path.
An access path consists of physically connected SAN resources and represents a relationship between a particular application on a server, and the application's data on a storage device. This construct of a path is at the heart of SANscreen Service Insight. Using a service intelligence engine and an SQL-based database, Service Insight automatically discovers all of the interrelationships among physical resources on a SAN. As a result, SANscreen provides IT with insight into the end-to-end access paths that support business processes, which is the foundation that IT needs to provide storage as a service.
We opened the Load Analyzer display, with a week's data grouped by device type, and charted data utilization and error rates. At this level of abstraction, approximately 55% of all the data was distributed to host devices and the breakdown of traffic—reads versus writes—was fully consistent with classic heuristics as reads made up 75% of traffic.
In particular, SANscreen Application Insight extends the device attributes associated with an access path with real-time SAN traffic data. Via Application Insight, storage administrators have both the transactional data and the critical analysis perspectives needed to align SAN devices within a service context. Through SANscreen Application Insight, IT acquires the information and tools needed to support a business-process SLA.
Application Insight's ability to understand an application's path to storage, combined with device load information gathered from SAN switch ports, provides visibility into the cost and efficiency of delivering storage services to an application. The result is lower capital costs through improved resource utilization, improved traffic balance, identification of "orphaned" resources, validation of tiered-storage strategies, and improved application performance. Using Onaro's SANscreen Application Insight, openBench Labs was able to look beyond device-centric traffic metrics and correlate overall SAN traffic with quality of service issues.
The ability to group, aggregate, and visualize a massive amount of SAN traffic data—by default, up to 10,000 Fibre Channel switch ports are scanned every 10 seconds—allows SANscreen Application Insight to be thought of as an online analytical processing (OLAP) tool for predictive analysis. By slicing and dicing traffic data from varying path perspectives, we were able to quickly correlate traffic problems, such as host congestion and multi-path availability, with application performance and service-level policies. This enables rapid resolution of performance bottlenecks and proactive optimization of host, array, switch, and fabric traffic distribution and utilization.
Application Insight is, however, more than a monitoring tool to resolve known issues: Building on the service intelligence engine of the software, real-time traffic data can be used to discover the root causes of problems before users are impacted. For that reason, openBench Labs set out to collect and analyze fabric data from a proactive storage-service management perspective, rather than from a reactive asset maintenance one. To that end, we needed to resolve three important questions:
- Could we uncover hidden issues that could lead to potentially severe application availability problems?
- Could we quickly and intuitively resolve those problems?
- Could we pinpoint potential capital savings based on the levels of resource utilization?
Through the Load Analyzer, a storage administrator can perform sophisticated capacity management functions by slicing, dicing, and visualizing all of the SAN traffic data collected from each switch port. That traffic can be aggregated and grouped in a number of ways. Traffic is also a function of time, so all of this analysis can be done over standard intervals—last hour, last day, last week—or a customizable time horizon.
When openBench Labs drilled down on specific hosts, we readily identified the server Connors, which was lightly loaded with only 1% of the host traffic distribution, had a 73%/27% traffic-load imbalance over paths. Looking at the performance of the heavily used path, we found a data framing problem—forcing data to be retransmitted and the path utilization rate to spike at around 30%.
Application Insight represents the volume of that traffic in terms of a raw count, a distribution percentage based on the devices attached to a switch port, or a utilization percentage based on the physical properties of a specific path. Traffic is related to either application data or to errors. Error-related data is further broken down by cause (e.g., loss of synchronization, loss of signal, general error rate). As with data, the volume of errors is represented by either a raw count or an error-rate percentage.
To simplify the visualization of traffic patterns within a single performance chart, all data counts are presented on a percentage basis over a particular time horizon. Application traffic is represented as a percentage of path utilization; total errors are represented as a percent of total traffic; and each type of error is displayed as a percentage of the number of overall errors. This allows a storage administrator to identify any correlation between errors and traffic utilization. A quick resolution of any such correlation will then leave more bandwidth available for application data traffic.
We opened the Hosts display, with top-line data grouped by application. Selecting MS Exchange as our application, we easily identified two servers running that application. Narrowing the focus to a specific server, we discovered a well-balanced traffic flow across the three switch ports to which the server was attached. Via the Topology display, we could also drill down on the two Hitachi Data Systems arrays and the tape library to which there were data paths connecting this server.
Moreover, host-to-storage paths are not the only SAN paths for which traffic balance is of critical importance. To create an extended SAN fabric with multiple switches, it is necessary to connect the switches via high-speed inter-switch links (ISLs), which are often created by bundling several switch ports. That configuration can be costly in terms of the allocation of switch ports and becomes very costly whenever the flow of data is not balanced, the utilization is low, or an ISL operates inefficiently.
A degraded ISL will negatively impact all traffic routed through that link. In particular, on a well-balanced SAN fabric, traffic will be dynamically routed over changing path combinations. As a result, serious application performance issues may be masked as random events to a storage administrator who lacks the tools needed to assess performance in an end-to-end service context. Even worse, that impact of a degraded ISL will cascade down onto any SLA associated with an application that might access data through that link.
Within the Monitor display, openBench Labs found time-consuming tasks, such as setting alerts and creating SAN policies, were greatly simplified through the presentation of easy-to-use menu screens.
In addition to resolving load-balance issues, the Load Analyzer SAN traffic distribution and utilization statistics make it easy for storage administrators to identify both busy, under-utilized resources, and orphaned resources. Once identified, applications utilizing these resources can be analyzed as candidates for migration to higher- or lower-performing equipment or for consolidation, especially to virtual machines. As a result, IT can delay, and perhaps avoid, capital expenditures through increased host and array utilization, rigorously planned storage tiers, and the reclamation of unused devices.
With regard to the path topology, the distribution of SAN traffic is closely related to the notion of risk. If a large number of storage array volumes were to be zoned, mapped, and masked through the same switch port, a failure at that port would cascade down to each volume. Making matters worse, hosts that use multi-path software for redundancy and load-balancing are often configured incorrectly, which increases the risk of failure.
Validating multi-path configurations manually, however, is a very labor-intensive, expensive, and frequently ineffective task. Nonetheless, when IT is faced with such tasks as adding applications or consolidating servers, issues such as availability and business continuity are very important. This makes the question, "How busy is a host?" a critically important issue, even for storage administrators. To answer that and other similar questions, Application Insight adds three options to the SANscreen Client GUI: Load Analyzer, Hosts, and Monitor.
As its name suggests, the Hosts display provides a more host-centric perspective on traffic data. Data associated with multi-pathing risk, load-balancing, and resolving congestion can be initially grouped and aggregated by application, business unit, or operating system.
Within the Hosts display, Application Insight adds new data that includes a Port Count, which is the number of switch ports through which a host sends and receives data. There are also structural data entries that contain useful SLA information, such as the name of application running on a server, the priority of that application, and whether the server has a redundancy policy in place. More importantly, the Hosts display includes a calculated Balance Index, which is an important measure of risk.
Through the Hosts display, we began an analysis of the host Nadal, which was running the Mapping Service application. We quickly identified a multi-path risk condition, which is flagged by a Balance Index of 100: Nadal's traffic is going through only one HBA, within an active-active environment. This is reinforced in the distribution statistics of the Load Analyzer Detail View. By selecting Analyze Congestion on the host Nadal, we were able to see a bigger picture based on the end-to-end path configuration data discerned by SANscreen Service Insight. Through Analyze Congestion, we see that Nadal is "competing" for the same storage port resources as servers Roddick and Safin. This could degrade the performance of the Mapping Service application and create an SLA issue.
Application Insight calculates the standard deviation from the average traffic over all of a host’s redundant paths using data collected from the switch ports. If redundant data paths on a host all have equal distribution, then the paths are said to be in balance and those paths will have a balance index of 0. A high Balance Index is a good indication that a problem is causing the server traffic to be out of balance across switch ports.
Within the Hosts display, a Load Analyzer detail view with traffic data can be opened. As a component of the Hosts display, this Load Analyzer detail view is specific to switch ports that are directly connected to the host's host bus adapters (HBAs). A detailed Topology visualization can also be opened within the Hosts display. This Topology detail view uses icons and graphical elements to visualize the SAN as seen from the host. All physical connections appear as lines connecting the icons, which represent switches and storage devices.
For even more detailed analysis, physical connections can be further categorized by the way in which they were logically defined. Color keys identify connections as zoned, masked, or mapped, or as a violation. By selecting a switch in the Topology display, the Load Analyzer Window in the Host display will provide a traffic distribution analysis for all hosts connected to that switch. This enables a storage administrator to determine all of the hosts sharing a switch.
Finally, the third display introduced by SANscreen Application Insight is that of an alarm Monitor. A storage administrator uses the Monitor to set and edit error conditions along with upper/lower bounds on traffic utilization. These alarm thresholds for performance can be related to total port utilization or broken out by transmit utilization and received utilization. Alarms can also be set on the number of errors, such as signal or synchronization loss, CRC, and the rate of errors.
Whenever any such threshold is violated during a specified sampling time, an alert is automatically triggered. The alert will provide granular information concerning switch ports and the devices attached to the switch ports. In addition to being stored in the database, alerts can be e-mailed or sent to another systems management framework application via SNMP traps.
The ability to automatically discover changes in the SAN fabric and analyze those changes with respect to the flow of data over identifiable access paths from storage devices to switches to servers makes Application Insight an important tool for storage administrators. What distinguishes Application Insight is that it is not just a tool to help resolve known issues and problems. The promise of Application Insight is that it enables IT to identify bottlenecks and re-route data flows proactively to under-utilized devices and ports before end users are stymied by glacial I/O throughput, error conditions, or path failures leading to application brownouts or failures.
Among the key elements of any SAN fabric are the Fibre Channel ports of storage arrays. Through these ports, volumes are "exposed" to hosts. In theory, these storage ports should be some of the most carefully planned and balanced in a fabric. In practice, these ports often end up supporting the aggregate traffic of a random collection of logical volumes as the manual tracking of volume-to-port mappings falls victim to the exponential growth in the number of logical storage devices.
Nonetheless, using Application Insight, a storage administrator can locate any congested ports, analyze the cause of the congestion, and identify the SAN devices that are most likely the cause of the congestion. While a number of factors can cause congestion at a storage port, three issues are at the heart of most congestion problems:
- Too many hosts mapped to logical volumes exported through a port;
- A busy high-traffic application monopolizing the port; and
- A malfunctioning switch port.
What makes this root-cause analysis an easy task to accomplish is SANscreen's construct of a data path as the focal point for the integration of Service Insight, which automatically discovers all of the interrelationships among physical resources, and Application Insight, which extends all the device attributes associated with an access path with real-time SAN traffic data. Organizations using SANscreen can manage storage as a true end-to-end IT service. SANscreen's actionable service-level information is consumable by both storage teams and non-technical storage users, effectively integrating storage into the entire IT service delivery chain.
openBench Labs Scenario
SAN service management software
WHAT WE TESTED
Onaro's SANscreen Service Insight and Application Insight
- Automatic discovery and collection of real-time data (without agents) across all switch ports
- Automatic discovery of service topology and application data paths
- Analysis of traffic loads by host, array, and switch port supports the resolution of bottlenecks, optimization of storage tiers, and discovery of orphaned resources.
HOW WE TESTED
- HP ProLiant ML350 G3 server
- Windows 2003 Server SP2
- To support storage service management, the software provides visibility of the global storage infrastructure, including host-to-storage access paths, storage arrays, switch devices, tapes, and hosts, as well as configuration changes.
- For rapid resolution of QoS issues, SANscreen Service Insight and Application Insight automatically collect, organize, and present both access path and SAN load data in an application-centric context to isolate and diagnose problems.
- Resources that are imbalanced, underutilized, saturated, or orphaned, or do not comply with tiered-storage strategies can be readily identified.