Reworking the TCP/IP stack for use on embedded IoT devices -

Reworking the TCP/IP stack for use on embedded IoT devices


Developers often believe that since a communication protocol stack is called a TCP/IP stack, porting it to an embedded target provides the target with all TCP/IP functionalities and performance. This is far from true.

A TCP/IP stack requires such resources as sockets and buffers to achieve its goal. These resources, however, consume RAM–a scarce resource on an embedded target. Deprived of sufficient resources, a TCP/IP stack will not work better than a RS-232 connection.

When performance is not an issue and the primary requirements are connectivity and functionality, implementing TCP/IP on a target with scarce resources (RAM and CPU) is an option. Today, however, when an Ethernet port is available on a device, expectations are that performance will be in the order of Megabits per second. This is achievable on small embedded devices, although certain design rules need to be observed.

By using a Transport Control Protocol (TCP) example, this article demonstrates design rules to be considered when porting a TCP/IP stack to an embedded device.

Network buffers
A TCP/IP stack places received packets in network buffers to be processed by the upper protocol layers and also places data to send in network buffers for transmission. Network buffers are data structure defined in RAM.

A buffer contains a header portion used by the protocol stack. This header provides information regarding the contents of the buffer. The data portion contains data that has either been received by the Network Interface Card (NIC) and thus will be processed by the stack, or data that is destined for transmission by the NIC.

Figure 1 – Network buffer

The data portion of the network buffer contains the protocol data and protocol headers. For example:

Figure 2 – Encapsulation process

The maximum network buffer size is determined by the maximum size of the data that can be transported by the networking technology used. Today, Ethernet is the ubiquitous networking technology used for Local Area Networks (LANs).

Originally, Ethernet standards defined the maximum frame size as 1518 bytes. Removing the Ethernet, IP and TCP encapsulation data, this leaves a maximum of 1460 bytes for the TCP segment. A segment is the data structure used to encapsulate TCP data. Carrying an Ethernet frame in one of the TCP/IP stack network buffers requires network buffers of approximately 1600 bytes each. The difference between the Ethernet maximum frame size and the network buffer size is the space required for the network buffer metadata.

It is possible to use smaller Network buffers. For example, if the application is not streaming multimedia data but rather transferring small sensor data periodically, it is possible to use smaller network buffers than the maximum allowed.

TCP segment size is negotiated between the two devices that are establishing a logical connection. It is known as the Maximum Segment Size (MSS). An embedded system could take advantage of this protocol capability. On an embedded target with 32K RAM, when you account for the all the middleware RAM usage, there is not much left for network buffers!

Network operations
Many networking operations affect system performance. For example, network buffers are not released as soon as their task is completed. Within the TCP acknowledgment process, a TCP segment is kept until its reception is acknowledged by the receiving device. If it is not acknowledged within a certain timeframe, the segment is retransmitted and kept again.

If a system has a limited number of network buffers, network congestion (packets being dropped) will affect the usage of these buffers and the total system performance. When all the network buffers are assigned to packets (being transmitted, retransmitted or acknowledging received packets), the TCP/IP stack will slow down while it waits for available resources before resuming a specific function.

The advantage of defining smaller network buffers is that more buffers exist that allow TCP (and UDP) to have more protocol exchanges between the two devices. This is ideal for applications where the information exchanged be in smaller packets such as a data logging device sending periodic sensor data.

A disadvantage is that each packet carries less data. For streaming applications, this is less than desirable. HTTP, FTP and other such protocols will not perform well with this configuration model.

Ultimately, if there is insufficient RAM to define a few network buffers, the TCP/IP stack will crawl.

TCP Performance
Windowing. TCP has a flow control mechanism called Windowing that is used for Transmit and Receive. A field in the TCP header is used for the Windowing mechanism so that:

  1. The Window field indicates the quantity of information (in terms of bytes) that the recipient is able to accept. This enables TCP to control the flow of data.
  2. Data receiving capacity is related to memory and to the hardware’s processing capacity (network buffers).
  3. The maximum size of the window is 65,535 bytes (a 16-bit field).
  4. A value of 0 (zero) halts the transmission.
  5. The source host sends a series of bytes to the destination host.

Figure 3 – TCP Windowing

Within Figure 3, the following occurs:

  1. Bytes 1 through 512 have been transmitted (and pushed to the application using the TCP PSH flag) and have been acknowledged by the destination host.
  2. The window is 2,048 bytes long.
  3. Bytes 513 through 1,536 have been transmitted but have not been acknowledged.
  4. Bytes 1,537 through 2,560 can be transmitted immediately.
  5. Once an acknowledgement is received for bytes 513 through 1,536, the window will move 1,024 bytes to the right, and bytes 2,561 through 3,584 may then be sent.

On an embedded device, the window size should be configured in terms of the network buffers available. For example, with an embedded device that has eight network buffers with an MSS of 1460, let’s reserve 4 buffers for transmission and 4 buffers for reception. Transmit and receive window sizes will be 4 times 1460 (4 * 1460 = 5840 bytes).

On every packet receive, TCP decreases the Receive Window size by 1460 and advertise the newly calculated Receive Window Size to the transmitting device. Once the stack has processed the packet, the Receive Window Size will be increased by 1460, the network buffer will be released and the Receive Window Size will be advertised with the next packet transmitted.

Typically, the network can transport packets faster than the embedded target can process them. If the Receiving device has received four packets without being able to process them, the Receive Window Size will be decreased to zero. A zero Receive Window Size advertised to the Transmitting device tells that device to stop transmitting until the Receiving device is able to process and free at least one network buffer. On the transmit side, the stack will stop if network buffers are not available. Depending how the stack is designed/configured, the transmitting function will retry, time-out or exit (Blocking/Non-blocking sockets).

UDP does not have such a mechanism. If there are insufficient network buffers to receive the transmitted data, packets are dropped. The Application needs to handle these situations.TCP connection bandwidth product
The number of TCP segmentsbeing received/transmitted by a host has an approximate upper boundequal to the TCP window sizes (in packets) multiplied by the number ofTCP connections:

Tot # TCP Pkts ~= Tot # TCP Conns * TCP Conn Win Sizes

This is the TCP connection bandwidth product.

Thenumber of internal NIC packet buffers/channels limits the target host'soverall packet bandwidth. Coupled with the fact that most targets areslower consumers, data being received by the target by a faster producerwill consume most or all NIC packet buffers/channels & thereby dropsome packets. However, even if/when performance/throughput isexceptionally low; TCP connections should still be able to transfer datavia re-transmission.

Windowing with multiple sockets
Thegiven Windowing example assumes that the embedded device has one socket(one logical connection) with a foreign host. Imagine a system wheremultiple parallel connections are required.

The discussion abovecan be applied to each socket. With proper application code, theconnection throughput is a divisor of the total connection bandwidth.This means that the TCP/IP stack configured Window size needs to takeinto consideration the maximum number of sockets running at any point intime.

Using the same example with 5 sockets and providing aReceive Window size of 5840 bytes to every socket, 20 network buffers (4buffers per Window * 5 sockets) will have to be configured. Assumingthat the largest network buffers possible (about 1600 bytes) are used,this means about 32K RAM of network buffers (20 * 1600) is required;otherwise, the system will slow down due excessive retransmissionpatterns.

A reverse calculation is probably what happens most of the time. How does one find the Tx and Rx window sizes for a system?

When 20 network buffers are reserved for reception and that the system needs a maximum of 5 sockets at any point in time, then:

Rx Window Size = (Number of buffers * MSS) / Number of sockets

If the result is less than one MSS, more RAM for additional buffers is required.

Delayed Acknowledgement
Anotherimportant factor needs to be taken in to consideration with TCP–thenetwork congestion state. TCP keeps each network buffer transmitteduntil it iis acknowledged by the receiving host. When packets aredropped or never delivered because of a number of network problems, TCPretransmits the packets. This means that unacknowledged buffers are setaside and used for this purpose.

TCP does not necessarilyacknowledge every packet received, a situation called DelayedAcknowledgement. Without delayed acknowledgement, we half of the buffersused for transmission are used for acknowledging every received packet.With delay acknowledgement, this number is reduced to 33%.

Knowingthe number of buffers that can be used for transmission, based on thequantity of RAM that can used for network buffers and the maximum numberof sockets in use at any point un time, the Transmit Window Size can becalculated:

Without Delayed Acknowledgement:

Tx Window Size = (Number of buffers * MSS) / (Number of sockets * 2)

With Delayed Acknowledgement:

Tx Window Size = (Number of buffers * MSS) / (Number of sockets * 1.5)

Notethat a similar analysis can be done with a UDP application. Flowcontrol and congestion control instead of being implemented in theTransport Layer Protocol are moved to the Application Layer Protocol.For example: TFTP (Trivial File Transfer Protocol). Acknowledgement andretransmission are part of any data communications protocols. If it isnot performed by the communication protocols, the application must takecare of it.

It is the developer’s decision to use UDP or TCP. IfTCP is not required, it can be removed from the stack (reducing theapplication ROM usage), however the application will need to take careof the network problems responsible for the non-delivery of packets.

DMA and CPU speed
Asstated previously, most targets are slow consumers. Packets generatedby a faster producer and received by the target will consume most or allNIC network buffers and some packets will be dropped. Hardware featuressuch as DMA and CPU speed that can improve this situation. The latteris trivial, the faster the target can receive and process the packets,the faster the network buffers can be freed.

DMA support for theNIC is another means to improve packet processing. It is easy tounderstand that when packets are transferred quickly to and from thestack, that network performance improves. DMA also relieves the CPU fromthe transfer task, allowing the CPU to perform more of the protocolprocessing.

When implementing a TCP/IPstack, the design intentions need to be clear. If the goal is to use theLocal Area Network without any consideration for performance, a TCP/IPstack or a subset of it can be implemented with very few RAM(approximately 32K).

However, if the application requires thecapabilities of the TCP protocol at a few megabits per second, a morecomplete TCP/IP stack is required. In this case, when embedded systemrequirements are in the range of 96K of RAM, resources need to beallocated to the protocol stack so that it can perform its duties.

Christian Legare is Executive Vice-President and Chief Technology Officer at Micrium .He has a Master's degree in Electrical Engineering from the Universityof Sherbrooke, Quebec, Canada. In his 22 years in the telecom industry,he deployed networks and thought classes. Christian was involved as anexecutive in large scale organizations as well as start-ups, mainly inEngineering and R&D. Christian was in charge of an IP (InternetProtocol) certification program at the International Institute ofTelecom (IIT) in Montreal, Canada as their IP systems expert. Mr. Legarejoined Micrium in 2002 as Executive Vice-President and Chief TechnologyOfficer with Micrium, home of uC/OS-II and uC/OS-III, the real-timekernels, and was instrumental in the development of the majority of thekernel services.

This paper was presented at the EmbeddedSystems Conference as part of a class taught by Christian Legare on”Achieving TCP-IP performance in embedded systems (ESC-106).”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.