Analyzing I/O performance bottlenecks

Adding more hardware is not the answer to eliminating performance bottlenecks in the data center.

By Greg Schulz

Most data centers have bottleneck areas that impact application performance and service delivery to IT customers and users. Possible bottleneck locations include servers, networks, applications, and storage systems. These areas, in particular I/O-related performance bottlenecks, impact most applications and are not unique to large enterprises or high performance computing (HPC) environments. The direct impact of data-center I/O performance issues is a general slowing of systems and applications, causing lost productivity time for users of IT services. Indirect impacts of data-center I/O performance bottlenecks include additional management by IT staff to trouble-shoot, analyze, re-configure, and react to application delays and service disruptions.

Click here to enlarge image

As shown in the figure, above, data-center performance bottleneck impacts include the following:

  • Under-utilization of disk storage capacity to compensate for lack of I/O performance;
  • Poor quality of service (QoS), causing service level agreement (SLA) objectives to be missed;
  • Premature infrastructure upgrades, combined with increased management and operating costs; and
  • Inability to meet peak and seasonal workload demands, resulting in lost business opportunity.

I/O bottleneck impacts

There are many applications across different industries that are sensitive to timely data access and that are impacted by common I/O performance bottlenecks. For example, as more users access a popular file, database table, or other stored data item, resource contention will increase. One way resource contention manifests itself is in the form of database “deadlock,” which translates into slower response time and lost productivity. Given the rising use of Internet search engines and online price shopping, some businesses have been forced to create expensive read-only copies of databases. These read-only copies are used to support more queries to address bottlenecks from impacting time sensitive transaction databases.

In addition to increased application workload, IT operational procedures to manage and protect data help to contribute to performance bottlenecks. Data-center operational procedures result in additional file I/O scans for virus checking, database purge and maintenance, data backup, classification, replication, and data migration for maintenance and upgrades, as well as data archiving. The result is that essential data-center management procedures contribute to performance challenges and impact business productivity.

Click here to enlarge image

Generally, as additional activity or application workload-including transactions and file accesses-are performed, I/O bottlenecks result in increased response time or latency. With most performance metrics, more is better; however, in the case of response time or latency, less is better. The figure, above, shows the impact as more work is performed (dotted curve) and resulting I/O bottlenecks have a negative impact by increasing response time (solid curve) above acceptable levels. The specific acceptable response time threshold will vary by applications and SLA requirements.

The acceptable threshold level based on performance plans, testing, SLAs, and other factors serves as a guideline between acceptable and poor application performance.

As more workload is added to a system with existing I/O issues, response time will correspondingly decrease (as was shown in the figure, above). The more severe the bottleneck, the faster response time will deteriorate from acceptable levels. The elimination of bottlenecks enables more work to be performed while maintaining response time below acceptable service-level threshold limits.

Click here to enlarge image

To compensate for lack of I/O performance and counter the resulting negative impact to users, a common approach is to add more hardware to mask or move the problem. However, this often leads to extra storage capacity being added to make up for a shortfall in I/O performance. The resulting ripple effect is that now more storage needs to be managed, including allocating storage network ports and configuring, tuning, and backing up data. This can result in environments that have storage utilization well below 50% of the actual capacity. The solution is to address the problem rather than moving the bottleneck elsewhere.

Another common challenge and cause of I/O bottlenecks is seasonal and/or unplanned workload increases that result in application delays and frustrated customers. The figure shows a workload representing an eCommerce transaction-based system with seasonal spikes in activity (dotted curve). The resulting impact to response time (solid curve) is shown in relation to a threshold line of acceptable response time performance.

Click here to enlarge image

I/O performance improvement approaches to this problem have been to do nothing (and incur the service disruptions) or over-configure by throwing more hardware and software at the problem. By over-configuring to support peak workloads and prevent loss of business revenue, excess storage capacity must be managed throughout the non-peak periods, adding to data-center and management costs. Besides impacting user productivity due to poor performance, I/O bottlenecks can result in system instability or unplanned application downtime.

Putting a value on the performance of applications and their importance to your business is a necessary step in the process of deciding where and what to focus on for improvement. For example, what is the value of reducing application response time and the associated business benefit of allowing more transactions, reservations, or sales to be made? Likewise, what is the value of improving the productivity of a designer or animator to meet tight deadlines and market schedules? What is the business benefit of enabling a customer to search faster for, order and download a streaming video-on-demand?

Server-I/O performance gap

It should come as no surprise that businesses continue to consume and rely on larger amounts of disk storage. Disk storage and I/O performance fuel the needs of applications to meet SLAs and QoS objectives. Even with efforts to reduce storage capacity or improve capacity utilization, applications leveraging rich content will continue to consume more storage capacity and require additional I/O performance. Similarly, the current trend of making and keeping additional copies of data for regulatory compliance and business continue is expected to continue. These demands all add up to a need for more I/O performance to keep up with server processor performance improvements.

The continued need for accessing more storage capacity results in an expanding gap between server processing power and storage I/O performance, as shown in the figure. This server-to-I/O performance gap has existed for several decades and continues to widen. The net impact is that bottlenecks associated with the server-to-I/O performance gap result in lost productivity for users who have to wait for transactions, queries, and data access requests to be resolved.

I/O performance bottlenecks are common across most data centers, affecting many applications and industries. It is important to understand the value of performance, including response time, for each environment and particular application. While the cost per raw terabyte may seem relatively inexpensive, the cost for I/O response time performance also needs to be effectively addressed and put into the proper context as part of the data-center QoS cost structure.

There are many approaches to address data-center I/O performance bottlenecks, with most currently centered on adding more hardware.

However, the key to eliminating data-center I/O bottlenecks is to address the problem instead of simply moving or hiding it with more hardware.

Click here to enlarge image

Greg Schulz is founder and senior analyst at the StorageIO Group, and author of the book, “Resilient Storage Networks - Designing Flexible Scalable Data Infrastructures” (Elsevier).

This article was originally published on December 01, 2006