Part II — Transport protocol considerations for IP SANs

This article, the second in a two-part series, examines the pros and cons of using TCP vs. the User Datagram Protocol for SAN extension.

By Sandy Helton

Transporting storage traffic within the data center uses a combination of the SCSI storage protocol and the Fibre Channel network protocol. Transporting storage traffic across IP networks requires a corresponding assembly of protocols, matching the provisions of SCSI and Fibre Channel. While IP provides the network layer and SCSI provides the block-level storage protocol, the transport protocol can be provided by either TCP or the User Datagram Protocol (UDP). These transport protocols offer very different feature sets.

Part I of the series detailed the requirements for transporting Fibre Channel storage area network (SAN) traffic over managed IP networks (see InfoStor, April 2002, p. 30), while this article examines the pros and cons of using TCP versus UDP for SAN extension.

TCP is a robust, highly resilient transport protocol and, for the majority of networking applications, is the preferred transport protocol for communicating over IP networks. TCP was developed to operate in highly congested, chaotic network environments. It provides a reliable data-delivery interface that enables upper-level applications to assume that all data sent through the network will arrive at the destination without loss and in the same order sent.

In addition to reliable data delivery, the second major feature of a TCP network is congestion management—an essential capability for network systems that must access the public Internet. The most frequent form of data loss on any over-subscribed network is congestion.

Congested networks can have packet-loss rates of at least one packet per thousand. In these environments, congestion management provides a system that dynamically adapts to data loss across the network. The transport protocol detects these high loss rates and reduces the amount of data being sent. As conditions improve, it can raise the data rate to a level that the network can tolerate. This type of highly adaptive system is not necessary for managed IP networks because of under-subscription and because these networks use Quality of Service (QoS) mechanisms to manage the type and amount of traffic carried across them.

User Datagram Protocol
UDP offers a streamlined, connection-less protocol capable of traveling unhindered across IP networks at wire speed. Used extensively with applications like voice-over-IP and streaming video, UDP offers an excellent transport medium for performance-critical, block-level storage traffic.

Absent the retransmission techniques that allow TCP to correct for errors, UDP must rely on the upper-layer protocol to ensure the integrity of data traveling across the network.

SAN Extension Performance Comparison
Comparing the overall performance of the SAN extension network using TCP versus UDP transport protocols over a managed IP network yields some interesting results. Since TCP is designed to react immediately to any detected loss of data, the overall performance of the network drops dramatically with each lost packet. TCP does recover its throughput, but it does so slowly, thus yielding an overall sluggish performance. UDP, however, does not react at all to a loss of data packets and thus keeps its throughput constant. A TCP-based network's performance is also distance-sensitive, due to TCP's window-based flow control mechanism, which limits the total amount of data that can be sent without receiving an acknowledgement from the receiver. Throughput will diminish linearly with distance when the round-trip delay of the network exceeds the window size.

The SCSI protocol is responsible for detecting data loss in a UDP-based SAN network. It operates with a request/response command sequence and will retry its entire command sequence if it detects the loss of any part of that sequence. The SCSI layer uses a different recovery algorithm compared to TCP and will continue transmitting command sequences to its target devices, while waiting for the completion of prior command sequences. Command sequence chains that are interrelated are held up until prior commands are completed, but frequently multiple commands are concurrently processed by both the SCSI initiators and their target devices.

Performance Examples
An analysis of the impact of errors on overall application performance, in terms of the maximum throughput achievable on a gigabit-per-second link, reveals some interesting conclusions. While actual throughput will vary with application activity, one can model the throughput ceiling that is imposed by the interaction of the application and networking protocols. The probability of transmission errors in managed IP networks may be very low but will occur in practice. In fact, a BER of 10-12 implies, on average, an error every 1,000 seconds when transmitting at the rate of 1Gbps.

When TCP is used as the transport protocol, it will mask errors by using error-detection and retransmission mechanisms. However, there will be a significant loss of achievable throughput, as explained earlier.

Let's look at a remote mirroring application scenario involving data replication between a primary storage system and a remote (mirrored) storage system. The primary system is composed of a number of logical unit numbers (LUNs) that have peer LUNs in the remote system. Every time a write operation occurs to a primary LUN, the storage system issues a write to the corresponding remote LUN. The total number of LUNs depends on the size of a LUN and the total amount of storage under management. Large storage systems can have hundreds of LUNs. Each of these LUNs will have independent I/O processes (IOPs) associated with them to perform read/write operations.

A UDP transport used between the primary and remote sites will not provide error correction when a transmission error corrupts a packet in a SCSI command sequence (IOP). The IOP will then time out waiting for the sequence to complete and will retry the operation to the remote disk. The total impact on application-level throughput will be determined by two factors: the total number of IOPs running between servers and disks (and correspondingly between primary and remote disks) and the time-out value for which a single process is blocked upon an error. When one of these IOPs is blocked waiting for a time-out, the other unrelated IOPs will continue uninterrupted. Time-out values vary in practice, but one second is a typical value used by SCSI systems for disk I/O. This is quite conservative, as a disk I/O will usually take less than 10ms. This would also be true in a remote mirroring scenario as the additional link latencies might range from a few ms to as much as 50ms—still well under one second.

The figure shows a simulation model of application performance. It depicts the maximum achievable throughput in the presence of errors using both UDP and TCP transports. The x-axis represents packet-error rate (PER), which can be computed from BER assuming typical packet sizes based on Ethernet (1,500-byte) frames.

This simulation model of application performance depicts the maximum achievable throughput in the presence of errors using both UDP and TCP transports.
Click here to enlarge image

UDP performance is modeled to cover a wide range—from 32 to 256 LUNs (and corresponding concurrent IOPs). For each of these cases, the achievable throughput does not drop until the PER drops to 10-5—well below the 10-8 values expected for metro Ethernet services.

The figure also shows the performance expected for TCP. Note that this assumes an ideal implementation of TCP capable of running at 1Gbps. In practice, however, TCP implementations today seldom achieve throughput higher than 100Mbps to 200Mbps.

TCP performance is computed using a commonly accepted model. (TCP/IP will be modeled using "The Macroscopic Behavior of TCP Congestion Avoidance Algorithm," by Matthew Mathis et al., which was originally published by "Computer Communications Review," ACM SIGCOMM, vol. 27, no. 3, July 3, 1997.) The primary variable here is the link latency or distance; the simulation considers links of 100km, 1,000km, and 10,000km. The corresponding graphs show that TCP throughput begins to degrade with PER, as low as 10-8 in the case of the longest link, and at a PER of 10-4 for a 100km link. The throughput achievable with an ideal TCP implementation on a 1,000km link at a PER of 10-5 will only be 3x108 or 300Mbps, significantly below the 1Gbps achieved by any of the UDP implementations.

Managed IP networks provide significantly higher service levels than the public Internet, and these improved attributes can be leveraged by UDP-based SAN extension to provide higher performance levels than those achievable with a TCP-based system. TCP has been shown to achieve better performance than UDP in environments with high packet loss. However, managed IP networks have a packet-loss rate comparable to that of Fibre Channel networks and provide the reliability and quality necessary for mission-critical data storage applications. Over a range of typical application scenarios and error rates, UDP provides excellent application performance that is equal to or better than TCP performance. When such a managed network is used for SAN extension, there is no need for extra error-protection mechanisms, beyond those that are normally employed on the SAN.

Sandy Helton is executive vice president and chief technology officer at SAN Valley Systems (www.sanvalley.com) in Campbell, CA.

This article was originally published on May 01, 2002