High-performance SANs: Does TCP 'stack' up?

Before users can take advantage of IP storage, a number of technical hurdles need to be cleared.

By Sudhir Srinivasan

As IP storage protocols approach standardization, the trend toward IP-based transport for storage area network (SAN) applications is gaining momentum. The ubiquity and attractive economics of IP networks, including the Internet, as well as the nascent 10Gbps Ethernet standard, are driving significant interest and investment in SAN solutions that leverage the advantages of IP.

Responding to this trend, the Internet Engineering Task Force (IETF) is defining new standards for IP storage that leverage existing Fibre Channel and SCSI standards, while tapping the interconnectivity advantages of IP. The standards include:

  • iSCSI—a native IP-based transport service for SCSI transactions;
  • FCIP—a protocol to extend SANs beyond the data center and link SAN islands by tunneling Fibre Channel traffic through an IP network; and
  • iFCP—a gateway-to-gateway protocol that goes beyond FCIP in its IP implementation, linking Fibre Channel SANs using elements of the TCP protocol to manage data transport.

Each of these standards has its target applications. FCIP and iFCP offer the advantage of preserving existing Fibre Channel investment, while iSCSI's pure IP implementation is attractive to users looking to deploy SANs by directly leveraging existing IP infrastructure. (For more information on the three IP storage standards, see "Going the distance with IP storage over WANs," March 2002, p. 23.)

Choosing a transport protocol
IP networks will provide the "pipe" for these next-generation SANs, but how does one achieve reliability comparable to today's SANs? The physical networks underlying IP SANs are likely to be more spread out than traditional Fibre Channel deployments have been, extending beyond the data center to the campus and metro areas and even to the WAN. For storage applications to operate over such distributed, heterogeneous, and potentially congested networks requires a transport protocol with strong support for congestion control and reliable delivery. TCP is that protocol.

TCP is a transport protocol (Layer 4) in the OSI stack. In the Fibre Channel protocol stack, the FC-2 layer is the rough equivalent of the network and transport layers.
Click here to enlarge image

TCP vs. Fibre Channel
How does TCP stack up against Fibre Channel as a transport protocol? There are some fundamental differences that reflect the different classes of networks for which these two protocols were designed.

TCP is a transport protocol (Layer 4) in the OSI stack model (see figure). The Fibre Channel standard has its own protocol stack as well; in this model, a single layer, FC-2, is the rough equivalent of both the network and transport layers. While FC-2 supports various classes of traffic, including reliable service (Class 1), the most deployed service by far is Class 3 (unreliable delivery), which is closer to UDP in functionality. So not only is the Fibre Channel stack inherently "thinner" than the TCP/IP stack, but the overhead incurred in the FC-2 layer is further reduced by the use of Class 3 unreliable service.

TCP is a "byte-stream" protocol, transporting command and data information in a continuous stream of bytes. The TCP payload in Gigabit Ethernet packets coming off the wire must first be reconstructed into the byte stream before the content can be parsed to extract iSCSI protocol data units (PDUs). This re-assembly process typically requires an intermediate buffering area; once the byte stream is reconstructed in this intermediate buffer, the contents (e.g., application data) can then be copied out to the final locations in application memory. Fibre Channel, on the other hand, is a datagram protocol—a frame coming off the wire contains enough information to identify the information unit (IU) that the frame belongs to (an IU being the Fibre Channel equivalent of an iSCSI PDU). Thus, a Fibre Channel device can simply look at the start of a frame and quickly determine what needs to be done with the frame. If the IU is carrying application data, the Fibre Channel device can determine the exact location in application memory where the data needs to be sent; no intermediate buffers are necessary.

A key difference between TCP and Fibre Channel lies in the way they implement flow (congestion) control. The flow control in Class 3 service of Fibre Channel—the predominant service in use—operates at each link using a simple buffer credit scheme to provide a generic push-back on a link when the downstream end of the link is saturated. The congestion control is thus neither end-to-end nor capable of differentiating among the various flows moving through the link. In contrast, TCP provides end-to-end congestion control on a per-flow basis through the use of various windows (transmit, receive, and congestion windows). These windows are used by either end to cause the originator of the flow to regulate transmission under congested conditions.

In summary, the design of the Fibre Channel protocol suite allows for efficient hardware implementation and, consequently, high performance. The question now is whether TCP, with all its complexity, can benefit equally from hardware acceleration to deliver comparable performance.

The TCP challenge
While TCP provides a reliable byte-stream service with features that protect against packet loss, it also presents designers with two critical challenges: stack overhead and direct data placement.

Stack overhead— The sophisticated flow control and error recovery services offered by TCP entail a significant amount of "protocol stack," or protocol message processing, including:

  • Copying TCP segments into and out of system buffers;
  • Calculating TCP checksums across each data segment/packet;
  • Processing acknowledgements (incoming and outgoing);
  • Detecting out-of-order and lost packets and trimming overlapping segments;
  • Enabling/disabling of retransmission timers and generating/processing retransmissions;
  • Pacing data transmission to stay within permissible windows;
  • Estimating round-trip times; and
  • Updating congestion windows and slow start thresholds.

Processing this protocol stack places significant demands on the host CPU, creating a bottleneck that limits not only the performance of TCP itself, but also the overall performance of the application running on top of it. As noted in a recent Gartner Research Brief, the accepted rule of thumb is that each bit per second of line capacity requires about 1Hz of CPU horsepower to run TCP in software. This means a 1Gbps link will consume a 1GHz CPU with no cycles to spare for doing actual application work. To achieve 10Gbps will clearly require tremendous processing just to handle the TCP stack processing.

Direct data placement (DDP)— Another challenge is supporting DDP with TCP. DDP enables data coming into the system to be placed directly into application buffers. Given TCP's byte-stream semantic, however, identification of frame boundaries requires a dedicated data buffer. This is in contrast to Fibre Channel, which supports DDP through the automatic framing of its datagram semantic. However, it is important to note that the lack of support for DDP in TCP primarily impacts the cost of a solution rather than its performance. Assuming a sufficient amount of re-assembly buffer is available to tolerate normal operating conditions (packet loss), the only effect on performance of this re-assembly step is a very small additional latency. Work is currently in progress at research and standards organizations to explore the addition of framing support to TCP and/or to develop TCP-like protocols with built-in framing.

Offloading the stack
Fibre Channel SANs deliver very high levels of performance primarily because the Fibre Channel protocol stack has been implemented completely in silicon. Fibre Channel controller chips implement the Fibre Channel stack all the way from the FC-0 layer through portions of the FC-4 layer in silicon. For TCP to deliver performance comparable to Fibre Channel, it would make sense that a comparable level of "siliconization" is required.

Offloading part or all of the TCP stack from the host CPU frees up cycles for data handling, effectively increasing throughput and potentially decreasing latency. This may not sound too difficult, until you consider the complexity of the TCP protocol—a typical TCP/IP protocol stack contains tens of thousands of lines of code. To meet the daunting challenge of dealing with this stack, adapter board vendors have pursued two approaches: partial offload and full offload.

Partial offload— This approach splits the TCP protocol stack, handling only the most straightforward data-transfer processing in hardware and leaving the rest of the stack, such as error handling and connection management, in software on the host processor. Offloaded functions typically include checksum calculations, acknowledgment processing, and simple segmentation and re-assembly of in-order segments. The host CPU is still required to run a software TCP stack that implements the rest of the functions (e.g., gap management and retransmissions). Such partial offload is available today in high-end Gigabit Ethernet chipsets and network interface cards (NICs), as well as in emerging special-purpose TCP offload adapters using custom-built ASICs.

While offloading some basic tasks does reduce the burden of TCP protocol processing on the host CPU, partial offload has two limitations:

  • It breaks down when its fundamental premise is invalid (i.e., when network conditions are not ideal); and
  • It is limited in scalability. While it may prove satisfactory in 1Gbps networks with a limited number of connections, increased link speeds and traffic will force the host CPU to devote more cycles to the portion of protocol processing performed by software.

Full offload— As the name suggests, this approach offloads all of the "fast-path" TCP protocol processing from the host CPU. This includes the full range of TCP data manipulation functions such as data buffering and checksumming, as well as TCP transfer control operations such as acknowledgment processing, including retransmission, gap management, certain timer functions, congestion avoidance, and flow control algorithms. Protocol processing functions outside the data-transfer phase—such as address resolution, connection establishment/ termination, exception handling, and polling for idle connection timeouts—can also be offloaded from the host if necessary. (For block storage applications, these functions are invoked primarily during device setup and almost never thereafter, and thus may not justify offloading.)

The advantage to the full-offload approach is scalability. By completely eliminating host processor intervention for TCP stack processing and offloading those aspects of TCP that are invoked under "lossy" network conditions, this approach enables solutions at multi-gigabit line rates under real-world network conditions.

It is worth noting that, due to the early stage of this technological development, some vendors have chosen to use one or more general-purpose processors for the dedicated task of running the TCP/IP stack. While this does offload the main processor, it adds significant cost to the solution. To scale this approach to multi-gigabit line rates requires the use of multiple processors, which entails significant software complexity and system cost. Furthermore, it is not entirely clear that the performance of such a system will scale very well.

Achieving performance parity with Fibre Channel may demand full offload of the TCP stack to silicon state machines. A pure silicon approach uses pipelined hardware structures for protocol processing through the entire stack (Gigabit Ethernet, IP, TCP, iSCSI, and FCP), enabling wire-speed, full-duplex operation across the entire range of workloads (e.g., block I/O sizes) and lossy network conditions at 1, 2, and 10Gbps.

For the next generation of SANs, IP is a reality—and a driving force for new storage networking applications. However, making the right choices now about how to address the challenge of TCP protocol processing will be a key factor in the eventual success of this new generation of products.

The recent demonstration of a 1Gbps wire-speed solution using iSCSI products in a laboratory setting is very encouraging but represents only the first steps for this fledgling industry. Leveraging the advantages of silicon-based full-offload architectures will be instrumental in creating IP SANs that "stack up" to the real-world challenges of tomorrow's multi-gigabit storage networks.

Sudhir Srinivasan, Ph.D., is a senior architect at Trebia Networks (www.trebia.com) in Acton, MA, and a contributor to the IETF's IP Storage Working Group. He can be contacted at Sudhir.Srinivasan@trebia.com.

This article was originally published on May 01, 2002