Back to the basics: Improve TCP/IP performance in memory-constrained embedded apps - Embedded.com

Back to the basics: Improve TCP/IP performance in memory-constrained embedded apps

Often developers believe that since a communication protocol stack iscalled a TCP/IP stack , portingit to an embedded target provides the target with all TCP/IPfunctionalities and performance. This is far from true.

A TCP/IP stack requires resources such as sockets and buffers toachieve its goal. These resources, however, consume RAM–a scarceresource on an embedded target. Deprived of sufficient resources, aTCP/IP stack will not work better than a RS-232 connection.

When performance is not an issue and the primary requirements areconnectivity and functionality, implementing TCP/IP on a target withscarce resources (RAM and CPU) is a viable option.

Today, however, when an Ethernet portis available on a device, expectations are that performance will be inthe order of Megabits per second. While achievable on small embeddeddevices, it is a necessary condition that certain design rules beobserved. The goal of this article is to guide the reader through someof the necessary design rules that need to be employed when whenporting a TCP/IP stack to an embedded device.

Network buffers
A TCP/IP stack places received packets in network buffers to beprocessed by the upper protocol layers and also places data to send innetwork buffers for transmission. Network buffers are data structuredefined in RAM.

The data portion of the network buffer contains the application dataand protocol headers. Figure 1 below illustrates how application data is encapsulated by the various layersof the IP protocol family to create the Layer 2 frame used by the dataportion of the network buffer.

Figure1 ” Encapsulation process

A buffer contains a header portion used by the protocol stack. Thisheader provides information regarding the contents of the buffer. Thedata portion contains data that has either been received by the NetworkInterface Card (NIC) and thus will be processed by the stack, or datathat is destined for transmission by the NIC.

Figure2 ” Network buffer

The maximum network buffer size is determined by the maximum size ofthe data that can be transported by the networking technology used.Today, Ethernet is the ubiquitous networking technology used for Local Area Networks (LANs) .

Originally, Ethernet standards defined the maximum frame size as1518 bytes. Removing the Ethernet, IP and TCP encapsulation data, thisleaves a maximum of 1460 bytes for the TCP segment. A segment is thedata structure used to encapsulate TCP data. Carrying an Ethernet framein one of the TCP/IP stack network buffers requires network buffers ofapproximately 1600 bytes each. The difference between the Ethernetmaximum frame size and the network buffer size is the space requiredfor the network buffer metadata.

It is possible to use smaller Network buffers. For example, if theapplication is not streaming multimedia data but rather transferringsmall sensor data periodically, it is possible to use smaller networkbuffers than the maximum allowed.

TCP segment size is negotiated between the two devices that areestablishing a logical connection. It is known as the Maximum SegmentSize (MSS). An embedded system could take advantage of this protocolcapability. On an embedded target with 32K RAM, when you account forthe all the middleware RAM usage, there is not much left for networkbuffers!

Network operations
Many networking operations affect system performance. For example,network buffers are not released as soon as their task is completed.Within the TCP acknowledgment process, a TCP segment is kept until itsreception is acknowledged by the receiving device. If it is notacknowledged within a certain timeframe, the segment is retransmittedand kept again.

If a system has a limited number of network buffers, networkcongestion (packets being dropped) will affect the usage of thesebuffers and the total system performance. When all the network buffersare assigned to packets (being transmitted, retransmitted oracknowledging received packets), the TCP/IP stack will slow down whileit waits for available resources before resuming a specific function.

The advantage of defining smaller network buffers is that morebuffers exist that allow TCP (and UDP) to have more protocol exchangesbetween the two devices. This is ideal for applications where theinformation exchanged can be in smaller packets such as a data loggingdevice sending periodic sensor data.

A disadvantage is that each packet carries less data. For streamingapplications, this is less than desirable. HTTP, FTP and other suchprotocols will not perform well with this configuration model.

Ultimately, if there is insufficient RAM to define a few networkbuffers, the TCP/IP stack will crawl.

TCP Performance
Windowing. TCP has a flow control mechanism called Windowing that isused for Transmit and Receive. A field in the TCP header is used forthe Windowing mechanism so that:

1) This Window fieldindicates the quantity of information (in terms of bytes) that therecipient is able to accept. This enables TCP to control the flow ofdata.

2) Data receiving capacityis related to memory and to the hardware's processing capacity (networkbuffers).

3) The maximum size of thewindow is 65,535 bytes (a 16-bit field).

4) A value of 0 (zero) haltsthe transmission.

5) The source host sends aseries of bytes to the destination host.

Figure3 ” TCP Windowing

Several important things to note in Figure3, above , illustrating TCP Widowing include:

1) Bytes 1 through 512 havebeen transmitted (and pushed to the application using the TCP PSH flag)and have been acknowledged by the destination host.

2) The window is 2,048bytes long.

3) Bytes 513 through 1,536have been transmitted but have not been acknowledged.

4) Bytes 1,537 through 2,560can be transmitted immediately.

5) Once an acknowledgementis received for bytes 513 through 1,536, the window will move 1,024bytes to the right, and bytes 2,561 through 3,584 may then be sent.

On ann embedded device, the window size should be configured interms of the network buffers available. For example:

With an embedded device that has 8 network buffers with an MSS of1460, let's reserve 4 buffers for transmission and 4 buffers forreception. Transmit and receive window sizes will be 4 times 1460 (4 *1460 = 5840 bytes).

On every packet receive, TCP decreases the Receive Window size by1460 and advertise the newly calculated Receive Window Size to thetransmitting device. Once the stack has processed the packet, theReceive Window Size will be increased by 1460, the network buffer willbe released and the Receive Window Size will be advertised with thenext packet transmitted.

Typically, the network can transport packets faster than theembedded target can process them. If the Receiving device has received4 packets without being able to process them, the Receive Window Sizewill be decreased to zero.

A zero Receive Window Size advertised to the Transmitting devicetells that device to stop transmitting until the Receiving device isable to process and free at least one network buffer.

On the transmit side, the stack will stop if network buffers are notavailable. Depending how the stack is designed/configured, thetransmitting function will retry, time-out or exit(Blocking/Non-blocking sockets).

UDP does not have such a mechanism. If there are insufficientnetwork buffers to receive the transmitted data, packets are dropped.The Application needs to handle these situations.
TCP connection bandwidth product
The number of TCP segments being received/transmitted by a host has anapproximate upper bound equal to the TCP window sizes (in packets)multiplied by the number of TCP connections:

Tot #TCP Pkts ~= Tot # TCP Conns * TCP Conn Win Sizes

This is known as the TCP connection bandwidth product.

The number of internal NIC packet buffers/channels limits the targethost's overall packet bandwidth. Coupled with the fact that mosttargets are slower consumers, data being received by the target by afaster producer will consume most or all NIC packet buffers/channels& thereby drop some packets. However, even when the throughput isexceptionally low; TCP connections should still be able to transferdata via re-transmission.

Windowing with multiple sockets
The given Windowing example assumes that the embedded device has onesocket (one logical connection) with a foreign host. Imagine a systemwhere multiple parallel connections are required. The discussion abovecan be applied to each socket.

With proper application code, the connection throughput is a divisorof the total connection bandwidth. This means that the TCP/IP stackconfigured Window size needs to take into consideration the maximumnumber of sockets running at any point in time.

Using the same example with 5 sockets and providing a Receive Windowsize of 5840 bytes to every socket, 20 network buffers (4 buffers perWindow * 5 sockets) will have to be configured. Assuming that thelargest network buffers possible (about 1600 bytes) are used, thismeans about 32K RAM of network buffers (20 * 1600) is required;otherwise, the system will slow down due excessive retransmissionpatterns.

How does one find the Tx and Rx window sizes for a system? A reversecalculation is probably what happens most of the time.

When 20 network buffers are reserved for reception and that thesystem needs a maximum of 5 sockets at any point in time, then:

RxWindow Size = (Number of buffers * MSS) / Number of sockets

If the result is less than one MSS, more RAM for additional buffersis required.

Delayed Acknowledgement
Another important factor needs to be taken in to consideration withTCP–the network congestion state. TCP keeps each network buffertransmitted until it is acknowledged by the receiving host. Whenpackets are dropped or never delivered because of a number of networkproblems, TCP retransmits the packets. This means that unacknowledgedbuffers are set aside and used for this purpose.

TCP does not necessarily acknowledge every packet received, asituation called Delayed Acknowledgement. Without delayedacknowledgement, half of the buffers used for transmission are used foracknowledging every received packet. With delay acknowledgement, thisnumber is reduced to 33%.

Knowing the number of buffers that can be used for transmission,based on the quantity of RAM that can used for network buffers and themaximum number of sockets in use at any point un time, the TransmitWindow Size can be calculated:

Without DelayedAcknowledgement:
Tx Window Size = (Number of buffers * MSS) / (Number of sockets * 2)

With DelayedAcknowledgement:
Tx Window Size = (Number of buffers * MSS) / (Number of sockets * 1.5)

Note that a similar analysis can be done with a UDP application.Flow control and congestion control instead of being implemented in theTransport Layer Protocol are moved to the Application Layer.

For example: TFTP (Trivial File Transfer Protocol). Acknowledgementand retransmission are part of any data communications protocols. If itis not performed by the communication protocols, the application musttake care of it.

It is the developer's decision to use UDP or TCP. If TCP is notrequired, it can be removed from the stack (reducing the applicationcode space), however the application will need to take care of thenetwork problems responsible for the non-delivery of packets.

DMA and CPU speed
As stated previously, most embedded targets are slow consumers. Packetsgenerated by a faster producer and received by the target will consumemost or all NIC network buffers and some packets will be dropped.Hardware features such as DMA and CPU speed can improve this situation.The latter is trivial, the faster the target can receive and processthe packets, the faster the network buffers can be freed.

DMA support for the NIC is another means to improve packetprocessing. It is easy to understand that when packets are transferredquickly to and from the stack, that network performance improves. DMAalso relieves the CPU from the transfer task, allowing the CPU toperform more of the protocol processing.

Conclusion
When implementing a TCP/IP stack, the design intentions need to beclear. If the goal is to use the Local Area Network without anyconsideration for performance, a TCP/IP stack or a subset of it can beimplemented with very few RAM (approximately 32K).

However, if the application requires the capabilities of the TCPprotocol at a few Megabits per second, a more complete TCP/IP stack isrequired. In this case, embedded system requirements dictate in therange of 96K of RAM, resources need to be allocated to the protocolstack so that it can perform its duties.

Christian E. Legare is Vice-President of Micrium Technologies Corporation.He has a Master's degree in Electrical Engineering from the Universityof Sherbrooke, Quebec, Canada. He can be reached atchristian.legare@micrium.com.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.