Opening Pandora's box of SAN management

By David Vellante, wikibon.org

SAN performance and management tools from disk array vendors fail to provide the requisite heterogeneity, metrics, and interoperability to enable SAN managers to effectively manage performance. And virtualization exacerbates their inadequacies.

This was the assertion of a SAN practitioner at a major US credit card firm, Ryan Perkowski, who addressed the Wikibon.org community at a Peer Incite Research Meeting.

Perkowski discussed how he and his team are able to dramatically improve processes around SAN performance management by gaining better visibility through the use of SAN management tools from Virtual Instruments and NetApp. It is Perkowski's contention that while these tools were initially difficult to cost justify, their payback has been substantially higher than expected.

NetWisdom provides a real-time and historical view of a SAN

Perkowski's firm is a Cisco shop, and he manages a SAN environment that is about 75% AIX systems with some Windows-based VMware hosts for test and development applications. His firm has grown storage capacity from 30TB to 450TB in three years. Driving that growth has been the acquisition of new credit card customers and accounts. The key applications in the shop are analytics and data warehouse systems based on a 20TB Oracle warehouse and other warehouses, including a large SAS instance. Perkowski manages a combination of EMC DMX-3 and DMX-4 arrays, as well as NAS systems.

Whales floating through the SAN

According to Perkowski, these large warehouse applications are like "whales floating through the SAN." They caused frequent and intermittent performance bottlenecks that were difficult to pinpoint, complicated by the fact that the organization leverages the virtualization capabilities of AIX.

The firm was experiencing major SAN performance headaches and what Perkowski referred to as "gray performance issues," meaning that the root cause of the problem was difficult to find. These sporadic and unpredictable slowdowns led to a standard operating procedure: When a performance problem occurred, the SAN got blamed.

Like virtually all high-performance environments, Ryan had to over-design the SAN in order to accommodate these fluctuations in performance. The challenge increasingly became tackling the endless "dark art" of SAN management, which required an unproductive set of activities. Perkowski at one point described this as "grabbing and shaking the crazy black eight ball" to try and find answers. Clearly, the organization was struggling with this problem, especially given its high rate of storage growth.

Visibility, metrics, trending

Metrics Perkowski was able to gather from his EMC array-based tools were limited to parameters such as cache hit rate, spindle response times, and other array-specific data. What he lacked was a fuller picture, especially from the perspective of the end-user.

Perkowski initiated a proof-of-concept using Virtual Instruments' NetWisdom and NetApp's SANscreen (which NetApp acquired from Onaro).

NetWisdom is a dedicated monitoring tool that uses a combination of software and hardware to probe the storage network and in particular the components that are problematic, in this case the Oracle data warehouse infrastructure.

SANscreen is a heterogeneous service management suite which, among other things, describes the relationships between a particular application on a given server and its data on a storage device.

The combination of these two tools allowed Perkowski to immediately receive an avalanche of useful metrics about his SAN. By accessing trending data on metrics such as MBps, CRC errors, log ins and log outs, he was able to either confirm or eliminate storage as the bottleneck. Perkowski eventually rolled these tools out as fundamental components of his infrastructure, initially with a single probe around the Oracle data warehouse, and adding probes into other systems over time.

The results were a dramatic improvement in problem determination and remediation, and a credibility boost for the SAN team. Perkowski shared an example where the application developer had suggested the best backup window for a particular system was between 6p.m. and midnight prior to an automated batch job that kicked off overnight. However, the backup team was unable to complete the job within the prescribed window. Perkowski asked the backup team to refrain from performing the backup the next evening while he performed his analysis. He found that I/O activity on the SAN spiked from 6p.m. to midnight, the exact times the application developer had said activity would be lowest and best for the backup.

Perkowski went to the user organization and asked a few questions. As it turned out, the users were all queuing up batch jobs just before they left the building, hitting return at 6p.m. and running their queries into the evening. It was one of the busiest times for the application on the SAN. Perkowski says he never would have been able to gain the visibility to resolve this problem quickly without the third-party SAN management tools.

Justification for SAN tools

The challenge Perkowski sees for SAN practitioners is that the benefits of tools such as NetWisdom and SANscreen are hard to predict prior to installation. His organization can cite the following areas of improvement:

  • Much faster and more accurate problem determination
  • Better capacity planning
  • More efficient provisioning
  • Cost savings through better IT productivity and more efficient use of SAN capacity
  • Substantially better application performance predictability and quality of service
  • Elimination of existing licenses and maintenance fees for array-based management software.

The challenge for SAN managers is that they have no way of knowing the degree to which these tools will save money and improve service levels until they run a proof of concept. In the case of Perkowski's firm, the target applications are revenue generators (e.g., credit card transaction enablers). Users need to understand that the required components of tools such as Virtual Instruments' NetWisdom roughly approximate to the following:

  • Software costs (about $50,000 per probe)
  • Splitter costs (about $300 per port)
  • Server capacity to run the software
  • Additional disk capacity to house the tools and analysis
  • Time to install (a few days)

In the case of Perkowski's firm, the total costs roughly equated to $500,000 to accommodate about half of his 450TB of capacity -- the high performance half. This equates to approximately $2,000 per TB. The benefits easily outweigh the costs, according to Perkowski, but he had no way to know this going in. As such, a combination of existing pain, faith and intelligent proof of concepts will reduce risk for SAN managers and result in potentially substantial benefits.

Most array-based SAN management tools are deficient in their ability to detect and help remediate storage network performance problems. Organizations in high performance, high growth SAN environments should evaluate heterogeneous tools such as Virtual Instruments' NetWisdom, which can provide valuable metrics, trending data and end-to-end visibility on performance bottlenecks. Tools such as NetApp's SANscreen are complimentary and can simplify change management and capacity planning. The ROI of these tools will be a function of the size of the SAN, its growth rate, and the value of applications to the business.

David Vellante is a founder and member of the wikibon.org community.

Suburban Propane drills deep

By Dave Simpson

Suburban Propane Partners, a nationwide distributor of propane, fuel oil and related products and services (and a marketer of natural gas and electricity) deployed VMware about three years ago, at the same time making a number of upgrades to its data center.

Tom Chorba, Suburban Propane's senior supervisor of networks, says that at the time they had two concerns: How to tell how many servers they could consolidate onto the virtual machines without negatively affecting performance, and how to solve a performance problem.

The company deployed Akorri's BalancePoint software, which provides visibility – analytics and reporting -- into SANs, as well as physical/virtual server environments.

Regarding the performance issue, the company originally suspected that the problem was related to the Windows environment. But after firing up BalancePoint they discovered that the problem was in the SAN; specifically, contention in the disk groups.

In addition, "BalancePoint's Performance Index showed us we could push our virtual machine hosts harder. We were able to significantly increase the number of servers on our virtual machines and save costs in hardware purchases," says Chorba.

BalancePoint provides historical trending and analysis of the performance impacts of VMotion through its Disk Group Summary feature, which shows a view of infrastructure response time, usage index and throughput.

Suburban Propane currently has 200 servers running on 15 VM hosts, with virtualized applications from vendors such as PeopleSoft, Citrix and SAP, as well as Microsoft Exchange.

"BalancePoint gives us a soup-to-nuts view of both virtual and physical server and storage environments, and provides a combination of holistic and deep drill-down reporting in an easy-to-read dashboard view," says Chorba.

Akorri defines BalancePoint as a virtual infrastructure management software for performance and capacity management. The software provides cross-domain (servers and storage) monitoring and analysis to optimize performance via troubleshooting, as well as capacity utilization.

For virtualized environments, BalancePoint supports vSphere (running as a virtual appliance in a vSphere VM) and plugs into vCenter Server. Support for Microsoft's Hyper-V is in the works.

BalancePoint provides historical trending and analysis of the performance impacts of VMotion through its Disk Group Summary feature, which shows a view of infrastructure response time, usage index and throughput.

NetWisdom provides a real-time and historical view of a SAN.

More InfoStor Current Issue Articles
More InfoStor Archives Issue Articles

This article was originally published on March 01, 2010