SAN management user case study, part 1

Editor's note: This is the first installment in a four-part series on how a SAN manager at a major U.S. credit card firm solved his SAN management challenges in a storage network with more than 450TB of capacity.

Opening Pandora's box of SAN management

By David Vellante, Wikibon.org

-- SAN performance and management tools from array vendors fail to provide the requisite heterogeneity, metrics, and interoperability to enable SAN managers to effectively manage performance. And virtualization exacerbates their inadequacies.

This was the assertion of a SAN practitioner at a major US credit card firm, Ryan Perkowski, who addressed the Wikibon community at a recent Peer Incite Research Meeting.

Perkowski shared with nearly 60 Wikibon members in attendance how he and his team are able to dramatically improve processes around SAN performance management by gaining better visibility through the use of SAN management tools from Virtual Instruments and NetApp. It is Perkowski's contention that while these tools were initially difficult to justify, their payback has been substantially higher than expected.

Perkowski's firm is a Cisco shop, and he manages a SAN environment that is about 75% AIX systems with some Windows-based VMware hosts for test and development applications. His firm has grown storage capacity from 30TB to 450TB in three years. Driving that growth has been the acquisition of new credit card customers and accounts. The key applications in the shop are analytics and data warehouse systems based on a 20TB Oracle warehouse and other warehouses including a large SAS instance. Perkowski manages a combination of EMC DMX-3 and DMX-4 arrays, as well as NAS systems.

Whales floating through the SAN

According to Perkowski, these large warehouse applications are like a "whales floating through the SAN." They caused frequent and intermittent performance bottlenecks that were difficult to pinpoint, complicated by the fact that the organization leverages the virtualization capabilities of AIX.

The firm was experiencing major SAN performance headaches and what Perkowski referred to as "gray performance issues," meaning that the root cause of the problem was difficult to find. These sporadic and unpredictable slowdowns led to a standard operating procedure that when a performance problem occurred the SAN got blamed.

Like virtually all high-performance environments, Ryan had to over-design the SAN in order to accommodate these fluctuations in performance. The challenge increasingly became tackling the endless "dark art" of SAN management, which required an unproductive set of activities. Perkowski at one point described this as "grabbing and shaking the crazy black eight ball" to try and find answers. Clearly, the organization was struggling with this problem, especially given its high rate of storage growth.

Visibility, metrics and trending

Metrics Perkowski was able to gather from his EMC array-based tools were limited to parameters such as cache hit rate, spindle response times, and other array-specific data. What he lacked was a fuller picture, especially from the perspective of the end-user.

Perkowski initiated a proof-of-concept using NetWisdom from Virtual Instruments and SANscreen from NetApp (which it acquired from Onaro).

NetWisdom is a dedicated monitoring tool that uses a combination of software and hardware to probe the storage network and in particular the components that are problematic, in this case the Oracle data warehouse infrastructure.

SANscreen is a heterogeneous service management suite which, among other things, describes the relationships between a particular application on a given server and its data on a storage device.

The combination of these two tools allowed Perkowski to immediately receive an avalanche of useful metrics about his SAN. By accessing trending data on metrics such as MBps, CRC errors, log ins and log outs, he was able to either confirm or eliminate storage as the bottleneck. Perkowski eventually rolled these tools out as fundamental components of his infrastructure, initially with a single probe around the Oracle data warehouse, and adding probes into other systems over time.

The results were a dramatic improvement in problem determination and remediation, and a credibility boost for the SAN team. Perkowski shared an example where the application developer had suggested the best backup window for a particular system was between 6p.m. and midnight prior to an automated batch job that kicked off overnight. However, the backup team was unable to complete the job within the prescribed window. Perkowski asked the backup team to refrain from performing the backup the next evening while he performed his analysis. He found that I/O activity on the SAN spiked from 6p.m. to midnight, the exact times the application developer had said activity would be lowest and best for the backup.

Perkowski went to the user organization and asked a few questions. As it turned out, the users were all queuing up batch jobs just before they left the building, hitting return at 6p.m. and running their queries into the evening. It was one of the busiest times for the application on the SAN. Perkowski says he never would have been able to gain the visibility to resolve this problem quickly without the third-party SAN management tools.

Justification for SAN tools

The challenge Perkowski sees for SAN practitioners is that the benefits of tools such as NetWisdom are hard to predict prior to installation. His organization can cite the following areas of improvement:

• Much faster and more accurate problem determination
• Better capacity planning
• More efficient provisioning
• Cost savings through better IT productivity and more efficient use of SAN capacity
• Substantially better application performance predictability and quality of service
• Elimination of existing licenses and maintenance fees for array-based management software.

The challenge for SAN managers is that they have no way of knowing the degree to which these tools will save money and improve service levels until they run a proof of concept. In the case of Perkowski's firm, the target applications are revenue generators (e.g., credit card transaction enablers). Users need to understand that the required components of such tools roughly approximate to the following:

• Software costs (about $50,000 per probe)
• Splitter costs (about $300 per port)
• Server capacity to run the software
• Additional disk capacity to house the tools and analysis
• Time to install (a few days)

In the case of Perkowski's firm the total costs roughly equate to $500,000 to accommodate about half of his 450TB of capacity -- the high performance half. This equates to roughly $2,000 per TB. The benefits easily outweigh the costs, according to Perkowski, but he had no way to know this going in. As such, a combination of existing pain, faith and intelligent proof of concepts will reduce risk for SAN managers and result in potentially substantial benefits.

Action Item: Most array-based SAN management tools are deficient in their ability to detect and help remediate storage network performance problems. Organizations in high performance, high growth SAN environments should evaluate heterogeneous tools such as Virtual Instruments' NetWisdom, which can provide valuable metrics, trending data and end-to-end visibility on performance bottlenecks. Tools such as NetApp's SANscreen are complimentary and can simplify change management and capacity planning. The ROI of these tools will be a function of the size of the SAN, its growth rate, and the value of applications to the business.

Dave Vellante is a founder and member of the Wikibon.org community.

This article was originally published on January 27, 2010