Bank uses monitoring tools to manage 250TB SAN

Benefits include improved problem diagnosis and better performance, response times, and reliability.

By Scott R. Humphrey

The Halifax Bank of Scotland (HBOS), which was formed through the merger of Halifax and Bank of Scotland in 2001, has assets of more than £400 million and is the UK's largest mortgage and savings provider. With more than 22 million customers, HBOS has a relationship with two out of every five UK households.

Part of the bank's customer commitment entails a massive information technology infrastructure that helps ensure customers have access to all of HBOS's information and banking services-at all times. In other words, system downtime is unacceptable. And that is no easy task when you consider that HBOS has more than 67,000 corporate users accessing everything from Microsoft Exchange servers to Oracle and SAS business intelligence and data warehousing applications. Hundreds of transactional systems and back office systems support the HBOS retail banking operations.

In its production SAN, HBOS maintains more than 3,000 ports split across two data centers in West Yorkshire, England. The SAN includes more than 300 Sun E15K Solaris servers, 149 Brocade switches, and 250TB of Hitachi Data Systems Lightning 9900 V Series and Thunder 9500 V Series disk arrays in a split fabric configuration. In addition to the Sun servers, HBOS also has IBM p6 90 Regatta servers running AIX and Hewlett-Packard Integrity SuperDome servers running HP/UX. Veritas' volume management software is used across the platforms.

And that's just the production environment. HBOS also maintains a scaled down version of the main SAN fabric that is virtually identical to the production environment-down to the split fabric configuration where all new releases and products are put through rigorous interoperability verification testing. The test environment alone includes 142 servers.

"We greatly lessen the chance of a problem's working its way into the production system if we have performed testing and analysis in a virtually identical pre-production environment," says Simon Close, HBOS's technical team leader, storage management services. "It has proved to be a key part of our overall IT success." Close is responsible for developing the adoption of SAN best practices and service improvement initiatives.

"It is a huge undertaking to efficiently operate a SAN of this size and complexity," says Close. "Even though we have worked hard to standardize on certain technologies across the enterprise, the reality is that to optimally run a banking operation of our scope requires supporting a diverse heterogeneous computing environment running a multitude of operating systems and a wide variety of applications."

Click here to enlarge image

"Application changes, operating system patches, or simply adding servers or storage modules are just a few of the reasons SAN performance can be impacted," says Richard Briggs, senior technical infrastructure developer in HBOS's storage management services group. Briggs was the technical lead on the initial SAN implementation at HBOS and currently serves as chairman of the Brocade UK User Group. "With the size and complexity of our SAN fabric, it is important that we have our own SAN performance monitoring and analysis solution in place to help diagnose hard-to-find problem areas," Briggs explains

"One of the biggest challenges we face today is the perceived performance issues related to our SAN," says Close. "Hardware problems are relatively simple to isolate, diagnose, and fix, but it poses problems when a user tells us that 'an application is slow.' Other than checking the application itself, the possibilities are endless as to where the source of the problem might be."

"None of the tools that we had been using to monitor our SAN's performance could get us to the root of the problem, and some problems-in the early days-went unsolved," admits Briggs. "With the monitoring and analysis tools bundled with each product, we aren't able to dive deep enough into problem areas, so we used a 'process-of-elimination' method that we recognized was not an effective way to handle these issues."

Detailed view of network traffic

That's why HBOS began evaluating SAN monitoring tools and eventually purchased Finisar's NetWisdom SAN performance and analysis solution. NetWisdom is a real-time monitoring system that enables SAN managers to view critical data about their storage networks and to increase performance and reliability.

"NetWisdom allows us to turn our storage service around from being reactive to proactive," says Close. "Whenever performance issues are encountered it is always the SAN that is highlighted as the problem by our customers. NetWisdom enables us to provide evidence to either support or refute the claim, which was not possible before.

"We have recently been able to demonstrate that SAN response times are well within the thresholds that have been dictated and that the problem most likely was within the application itself," Close continues. "Armed with that information, groups can evaluate their application design and re-develop or tune the application accordingly."

"Everything used to be the SAN's fault," Briggs jokes. "But now we have documented plenty of cases where a bad application or bad coding-not the SAN-was the source of the problem. But wherever the problem lies, we are committed to finding it quickly and getting it resolved."

Briggs adds that HBOS now initiates support calls to the appropriate vendor complete with a trace that details the issues to be dealt with. "It really helps jump-start the support process," he says.

Tracking performance levels

According to Close, HBOS is starting to baseline the performance of its systems while they are in pre-production so the company can measure performance before it is deployed in the production SAN. HBOS can then benchmark the system again once it is brought online as well as six months later so the organization can continually monitor not only current overall performance, but also historical performance levels of individual SAN components.

"By doing this, we are much better able to track performance over time, which often can give us a clue that a server, disk, or switch might be starting to malfunction," Close explains. "By identifying these potential problem areas early in the process we can proactively resolve potential problems before they occur."

According to Close, NetWisdom also helps with storage capacity planning and load balancing. "Using our best practices, we want to design systems that are optimal and that will scale," says Close. "We spread the load across storage ports and subsystems to optimize performance."

"From a storage management perspective, NetWisdom allows SAN managers to isolate issues quickly and resolve them in a timely manner," says Neil Collier, technical director at GCH Test & Computer Services Ltd., a Finisar value-added reseller (VAR) working with HBOS. "This enables them to spend more time tuning the SAN and less time fighting fires."

Scott R. Humphrey is president of Humphrey Strategic Communications (www.strategic-pr.com/) in Portland, OR.

This article was originally published on December 01, 2004