Performance characterization of FC-AL loops, part 2
This is the second of a two-part study; part 1 ran in the May issue.
Thomas M. Ruwart
Laboratory for Computational Science and Engineering, University of Minnesota
A loop gets "larger" as the number of devices physically connected to the loop increases. In this study, the number of disk devices is increased from 1 to 96 for a total of 97 nodes including the host bus adapter. Performance measurements are taken with 12, 24, 48, and 96 disks on the loop. In order to get very detailed information about events on the loop, analyzer traces are taken for many of the tests.
There are three points of interest covered in this part of the study. First, how the performance of a single disk affected by the presence of other non-participating devices on the loop. Second, what happens to the aggregate performance of the loop as the congestion increases. And finally, what happens to the performance of the individual benchmark threads as the congestion increases.
The overall effect of a highly populated loop depends heavily on the amount of data being transmitted. To test this effect a single disk is accessed using 128 read and write options of 1024-bytes, 2048-bytes, and 4 Megabytes. A set of access tests are run for loop populations of 12, 24, 48, and 96 disks. The additional disks are not active in the sense that they are not accessed but do have a presence on the loop. All I/O operations are time stamped and analyzer traces are taken for each access test.
For short transfers the effect was measurable but still relatively small. For large transfers the effect was not significant. Graph 9 shows the shortest recorded time for each I/O operation of a single disk for read operation of 1024 bytes and 2048 bytes. The I/O time increases steadily as the number of non-active devices are added to the loop. This graph shows two trends. First is the most obvious trend that as devices are added, the I/O time increases. Secondly, as the data transfer size increases, the effect of additional devices becomes negligible.
The measured increase in command operation time amounts to 99 microseconds for 1024-byte read operations when an additional 84 devices are added to the loop. Each of the three phases in a read command (Command, Data, and Status) takes three loop tenancies to complete for a total of nine loop tenancies for a single read command. The increase per loop tenancy is approximately 19 microseconds (84 times the elasticity buffer delay per node of 226 nanoseconds). Therefore, the expected increase in the overall command time is 9 times 19 microseconds or 171 microseconds. The observed value of 99 microseconds is only about half that. So where did the remaining time go?
By using the analyzer traces it is possible to partially dissect each read operation into its three phases and further into each loop tenancy. Table 1 shows the time for each phase (Command, Data, and Status) as well as the Idle time between phases on a 50 meter loop populated with 12 and 96 devices. The Delta row is the amount of increase/decrease in time for each phase as the population changes from 12 to 96 devices. A single loop tenancy is measured at approximately 20 microseconds when all 96 disk drives are attached to the loop. This is consistent with the theoretical propagation delay that is calculated by multiplying the number of devices on the loop by 226 nanoseconds of propagation delay per node.
With the addition of 84 devices, the increase in each loop tenancy is approximately 19 microseconds. The Command phase (three loop tenancies) shows an increase of 52 microseconds which is consistent with the theoretical increase of 57 microseconds (19 times 3). There is no increase in the idle time between the Command and Data phases as would be expected. The Data phase only increases 34 microseconds instead of the expected 57. Upon closer inspection however, the missing time most likely was absorbed in the Idle time between the Data and Status phase which shows up as a decrease in time. The Status phase likewise only increases 34 microseconds and again, that additional time is most likely absorbed in the inter-command idle time. (Using a single 2-channel analyzer as described in Figure 2 it is difficult to capture events that occur on both sides of the sending and receiving devices. To accomplish this a 4-channel analyzer would be required.)
Therefore, the net effect of a loop populated with 96 devices is quantitatively about 11-13% for small (1024- byte) read requests. The effect decreases as the request size gets larger. For 4MB transfers, the effect was less than 0.1%.
Long Large Loops
The testing of a single large long loop is accomplished by populating the long loop with the same 96 devices used in the large loop configuration. A loop of 30 kilometers is incrementally populated with disks and benchmarks are run similarly to the testing of the large loop. The loop trip time for long loops (7.5 kilometers and above) quickly became the dominant factor in the overall delay (See graphs 9-12). From the previous discussion on Large Loops, the increase in propagation delay due to large loop populations is on the order of 20 microseconds per loop tenancy. Similarly, from the discussion on the effects of long loop lengths, the increase in propagation delay is approximately 35 to 150 microseconds for loop lengths of 7.5 to 30 kilometers respectively. Together, the propagation delay for a single trip around the loop amounts to approximately 170 microseconds - a value that has been verified with the analyzer traces.
The access fairness worked as advertised and can be seen by the results from the tests that were performed simultaneously accessing 96 devices on the loop. Graph 13 plots the completion times of each of the 96 devices on a 30-kilometer loop in increasing priority. If the devices were "unfair" the plot would show the higher priority devices completing before lower priority devices. Graph 14 shows the unfair behavior of 26 disks on a shorter 30-kilometer loop. The unfair devices are older Barracuda 9 disks that are running a very old version of firmware that did not have the fairness algorithm implemented. It is clear from Graph 14 that the higher priority devices get preferential access to the loop since they finish long before the lower priority (lower address) devices.
Long loops see a significant drop in aggregate performance, particularly for write operations on this configuration. This was primarily due to the number of buffers available on the disk drive to receive incoming data. The transaction performance is significantly affected for both reads and writes more than the bandwidth performance when viewed as a percentage of the peak. In short, extended loop lengths (greater than 5 kilometers) begin to show appreciable performance degradation.
As the number of devices on the loop increases, the propagation delay through the devices introduces a small but noticeable performance degradation. The degradation is more noticeable in transaction rate than in bandwidth.
As the loop congestion grows, the performance of each thread degrades uniformly such that each the average performance is the same over all threads. This is due in part to the loop access fairness mechanism. It should be noted that the access fairness algorithm does not guarantee equal performance among all devices on the loop rather it guarantees that each device will have an access window within which it can win arbitration for the loop and perform its function. The net effect, however, seems to indicate an even distribution of performance for all devices on a heavily congested loop.
The factor that contributes the most performance loss is the length of the loop. In terms of scale, the length of the loop can contribute 100 to 1000 times more propagation delay than the elasticity buffers in the devices themselves.
Based on experiences from this project, researchers at the LCSE are working on ways to better gather and visualize performance data for large and complex storage area network configurations. This project alone generated nearly 10GB of compressed performance data much of which still needs to be analyzed. The LCSE is also currently working with Ancor Communications and Brocade Communication Systems evaluating their respective Fibre Channel switch products. This research is focused on what happens to the performance of large disk subsystems attached to these switches as the cross-sectional bandwidth and cross-sectional transaction rates are increased to the point of overwhelming the capability of the switch.
Graph 9. Effect of node propagation delay on read operations on a 50-meter loop. Read operation time in Milliseconds
Graph 10. Effect of node propagation delay on read operations on a 30-kilometer loop.
Graph 11. Comparison of the effect of node propagation delay on read operations on 50-meter and 30-kilometer loops.
Graph 12. Comparison of the effect of node propagation delay on read operations as a function of loop length.
Graph 13. The effect of unfair device behavior on the order of completion for read operations across 26 devices on a 30-kilometer loop.
Graph 14. The effect of unfair device behavior on the order of completion for read operations across 26 devices on a 30-kilometer loop.
Graph 15. The effect of fair device behavior on the bandwidth performance of read operations across 26 devices on a 30-kilometer loop.
Graph 16. The effect of unfair device behavior on the bandwidth performance of read operations across 26 devices on a 30-kilometer loop.
Benner, Alan F. Fibre Channel - Gigabit
Communications and I/O for Computer Networks, McGraw-Hill Series on Computers, 1966
Gary R. Stephens and Jan V. Dedek. Fibre Channel - The Basics, Ancot Corporation
Robert W. Kembel. The Fibre Channel Consultant - Arbitrated Loop, Connectivity Solutions
This work was performed at the University of Minnesota Laboratory for Computational Science and Engineering with support from the National Science Foundation and the Department of Energy under grants NSF/ACI 96-19019, DE/B347714, and NSF/CDA-9502979. This work was also partially supported by Minnesota Supercomputer Institute at the University of Minnesota. Other contributors and supporters include Seagate Technology, Inc., MTI, Vixel Corporation, AMP, Inc., Finisar Corporation, Ciprico, Inc. and Silicon Graphics, Inc.
Thomas M. Ruwart is assistant director of Laboratory for Computational Science and Engineering at the University of Minnesota.