Building better memory management for high performance wired/wireless networks: Part 1 - Embedded.com

Building better memory management for high performance wired/wireless networks: Part 1

Memory management is a fundamental technique in embedded systems design. Standard “C” memory allocation techniques such as malloc/free are available in most embedded systems development environments.

However, due to the typically associated memory fragmentation and potential havocs that could result from wrong usage of these functions, the embedded development community has recognized the general wisdom that using dynamic allocation is to be discouraged if not completely banned.

This is particularly true for mission-critical and safety-critical embedded systems where response times need to be predictable and bounded. The typical solution used is to allocate, during system initialization, a fixed size pool or stack of buffers.

After that, the buffer allocations are requested from the pool. This allows for an almost constant processing time. However, there are also several problems with this method.

The buffer size is typically chosen as the maximum possible request size foreseen by the intended application, e.g. the maximum transfer unit (MTU) of a protocol data unit (PDU). This results in low overall memory utilization due to the fact that many requests will require far lower memory than the buffer size.

Network designs' memory allocation problems
This problem is magnified in protocol stacks embedded within high-speed network interface cards. Typically such stacks process variable-size protocol data units (PDUs)/service data units (SDUs) on their input and output ports requiring high-speed processing.

One of the key factors for realizing the required high-speed processing of PDUs/SDUs is an efficient buffer management scheme. In many instances, a communication link raw capacity is bottlenecked by the processing delays at the network interface cards or the host processor hardware or operating system.

In this article we'll describe what we think is an efficient memory management system designed specifically for high-speed network interface cards particularly data link layers (layer 2 of the protocol stack).

We specifically target layer 2 implementations as these are good candidates for software implementation on embedded processors. Our variable-buffer scheme is based on structuring a memory area into equal sized units, which we call a partition. A pool is defined to have the configuration , where N is the number of partitions and S is the size of the partition in bytes.

Full details about this scheme, which has been implemented in SySDSoft's LTE and WiMAX protocol stacks are described later in this Part 1 of two articles.

Later in Part 2 in this series, we detail the results of various tests and demonstrations we have performed to compare network implementations based on variable-size versus more traditional fixed size memory pools. In general our conclusion is that this scheme exhibits excellent performance in LTE and WiMax protocol stacks.

Typical processing of the Protocol stack data path
The data path of a protocol stack is typically structured as multiple-layers separated by well-defined interfaces. Speaking in abstract terms, a protocol sublayer receives a service-data unit (SDU) from an upper layer/sublayer and processes it producing its own protocol data unit (PDU).

Typically processing involves packing multiple SDU's into one PDU, fragmentation of a large SDU into a number of PDUs, header and subheader formation/ manipulation, ciphering, buffering for automatic repeat request (ARQ) processing among others.

Many of these operations will require the allocation of an appropriate buffer space that will be used to perform a certain function on the SDU's/PDU's. If we take as an example the MAC layer of a WiMAX system, then the buffer will be reserved until the PDU is passed to the physical layer or the upper layer (depending on the direction of the transfer). After processing by the lower or upper layer, the buffer is typically released back to the pool of available buffers and can be reallocated.

Due to the identified problems with dynamic allocations (e.g. malloc/free) such as leaks and fragmentations [1, 2, 3, 4], embedded developers typically use customized memory pools management libraries. Leaks could wreak havoc at run time and fragmentation can result in non-deterministic response time which is of paramount importance in mission-critical systems.

Moreover, the efficiency of memory pools management determines to a large extent the maximum throughput that can be attained by a network interface card. There is a tradeoff between the amount of memory available for PDU/SDU processing and the achievable peak throughput. A larger memory that is efficiently utilized will most probably imply a higher throughput. On the other hand, memory is one of the most expensive and power consuming resources and these factors needs to be carefully considered in chip design/fabrication.

As stated previously, fixed-size pools are usually allocated at system initialization. In the context of protocol stack example, the buffer size of the pool is taken to be the maximum transfer unit (MTU) of the protocol considered.

However, in many situations data can be much smaller than the MTU. For example, in an Ethernet/WiMAX protocol the MTU is 1518 bytes, whereas typical IP packets have a size of average between 400~600 bytes and some packets much smaller depending on application mix. For example, ICMP, DNS, online chat messages, and VoIP packets are known to have small size typically less than 100 bytes.

Designing a pool to provide highest throughput assuming all processing to be performed at MTU sizes could thus result in wasting memory resources. This triggered us to design a scheme to utilize the available buffers more efficiently.

We outline below the familiar fixed-sized pools and the newly devised partitioned pools for highly efficient memory pools. Both implementations depend on a basic data structure called “memory segment”.

A memory segment is a relatively large chunk (buffer) of memory allocated either using a static array of the required size or dynamically allocated at system start-up using typical malloc or equivalent. Alternatively, most RTOS's provide a memory segment handling library in case more advanced processing is needed.

Fixed-size Memory Pools
The fixed memory pool is a data structure that maintains the pool internal variables and provides the interface to the fixed-pools management functions. The data structure provides four main operations: Initialize, Destroy, Allocate, and Free.

A user initializes the pool with the required maximum size for each buffer and the number of buffers needed. The pool is maintained as a queue with a number of entries equal to the required number of buffers. The queue is populated with pointers to the buffers which are allocated from a user specified memory segment upon the pool initializing. The memory pool also contains a mutex to handle synchronization and thread-safe operations and also simple counters to collect statistics.

Figure 1: Fixed-size Memory Pool Implementation

As shown in Figure 1 above , the queue is the central structure for the allocation and free operations. Allocation requests are performed as a “dequeue” operation form the queue head. If the queue is empty, the allocation fails.

The allocated buffers are used for a random period depending on the operation performed and then released/freed. The buffer release/free is realized via an “enqueue” operation performed at the queue tail. The queue can be implemented as a linked list and thus enqueue/dequeue operations are O(1).

The main advantage of the fixed-size buffer pools is that the implementation is very simple and it allows fast allocation and release. There is also little overhead associated with the fixed-size memory pool structure which is the queue structure and the other member variables as well as some meta-data that can be associated with each buffer.

The main disadvantage is that if the usage pattern is such that the processed data has large variance and contains a mix of small and large-sized PDU's, then memory utilization will be low. This is the case since when a buffer is allocated for a small sized PDU, it occupies the full buffer.

So, for example if the MTU size is 1500 byte and we set the fixed-buffer size to 1540 bytes for possible header/trailer additions, and the protocol receives an IP packet of 80 bytes for processing, then the utilization is about 5% for that buffer.

Since the data mix is not usually known in advance, fixed-size memory pools are typically designed with worst case assumptions. In order to support high throughput at high traffic loads, the designer will need to increase the number of buffers or face packet drops. The increase in the memory size impacts the implementation cost of the system.

Variable-size Memory Pools
The variable-size memory pool is a data structure that, as in the fixed-size pools, supports the functions Initialize, Destroy, Allocate, and Free. The basic concept in the variable-size pool is to consider the memory as being constructed of a number N of consecutive partitions of a relatively small and a user-configurable fixed size S.

A request for allocation of B bytes will be translated into the allocation of consecutive partitions from the pool. The Initialize function is provided with the number of partitions N and the size S of each one. As will be shown in the performance evaluation study in Part 2, the smaller the size of the partition, the better the attained performance in terms of achieved throughput.

This is expected as smaller granularity will allow more flexible and efficient allocations. However, for reasons that will be explained below, excessively small partitions are associated with large overhead and typically result in overall memory utilization to be small.

Figure 2: Variable-size Memory Pool Implementation

In our implementation, the variable-size pool is implemented via three main structures as shown in Figure 2 above . The first structure is a buffer of a total size of bytes, where H is the size of the overhead associated with each partition (see below for more information).

The buffer is allocated at the time of initializing the pool from a given memory segment. The second main structure is an array of “partition group” (PG) structures. The PG is an abstraction for a consecutive number of partitions that are either allocated or free. The structure maintains a pointer to the individual partition memory and an indication if the PG is allocated or not.

Also, each structure contains a pointer to the “other-end” of the partition groups (indicated by the blue dotted lines in Figure 2). The other-end pointer is necessary for merging and splitting partitions upon the free and allocate operations respectively as will be explained below. To remove the need for handling of special cases in the PG merge/split, the PG array contains a dummy head and a dummy tail nodes.

Initially each partition is a distinct PG containing only itself and the other-end pointer will point to itself. After initialization of all PG structures, we have one large free PG (FPG) comprised of all the partitions containing all the buffers. The other-end pointer of the first and last PG entries in the array will point to each other. This situation is shown in Figure 3 below .

Figure 3: Initialization of the partition group structure: one big free partition containing all buffers.

The third fundamental structure in the variable memory pool is a linked-list of free partition groups. The list maintains pointers to the first members of unallocated partition groups.The list is used to search for a free partition group with size that can accommodate an incoming allocation request. Currently, the list is searched in a linear fashion until the first such partition group is found.

The variable memory pool structure contains other maintenance variables and a mutex to control shared access. To support the free operation each buffer contains a metadata header which contains a pointer to the PG to which it belongs as shown in Figure 2.

The Allocate Operation
The allocate function is invoked with the user wishing to allocate a buffer of B bytes from a particular memory pool. The function calculates the number of partitions corresponding to B bytes () and starts by searching the FPG linked list for a partition that contains a number of free partitions greater than or equal to .

The first FPG meeting the criteria is retrieved and in the case it is comprised of exactly the requested number of partitions, the whole PG is allocated to the request and removed from the FPG list. In the case that the partition is comprised of a number of partitions greater than the requested size, then the partition is split into two partition groups.

The first is of size equal to the allocation request and is reserved to the new request whereas the remaining part of the original PG remains in the free partition groups list. The splitting is done in a back-to-front fashion as shown in Figure 4 below .

This allows avoiding unnecessary access of the FPG linked list by eliminating the need to remove the original PG and then insert a new PG. During this operation, adjustment of the other-end pointers of the two new partition groups must be performed.

Figure 4: Split of a partition group in a back-to-front fashion upon an allocate operation for B bytes.

The overhead associated with the allocate operation is mainly the search in the FPG linked list for a FPG that can accommodate the requested size.

The Free Operation
The free operation is passed a buffer pointer to be unallocated and has no direct information about PG's. The meta-data header in the buffer keeps a pointer to the PG to which it belongs.

This way, we retrieve the PG and mark it as free. We further check if the left and right neighboring partitions are free. Note that at the two borders of the partition group array we have the dummy head and dummy tail partition groups which are always allocated. The dummy head and tail serve to eliminate special handling when freeing the first of last partitions.

We have 4 distinct cases: 1) The two neighbors are allocated in which case the freed PG is directly added to the FPG linked list; 2) and 3) where either the left or right PG neighbor is free in which case two partition groups can be merged and then an update of the left/right neighbor entry in the PFG linked list should be performed; and 4) The left and right neighbors are both free in which case the three groups can be merged together to form one larger partition group.

In this last case, the entry of the right neighbor is removed from the FPG linked list and the entry of the left neighbor is updated. The merge operation is simply done by updating the other-end pointer in the partition group entries. The overhead of the free operation is mainly the logic associated with the merge operation.

There is, however, a configuration option in the implementation to delay the merging till when an allocation request fails. In this implementation, we start searching for FPG's that can be merged with neighbors to construct a larger FPG that can satisfy the request. In other words, the merge is “on demand”.

To read Part 2, go to Performance evaluation

References
[1] B. Stroustrup, “Programming — Principles and Practice Using C++,”Chapter 25 Embedded Systems Programming, December 2008. Addison-Wesley.
[2] D. Saks, “The yin and yang of dynamic allocation,” Embedded Systems Design, May 2008,
[3] D. Saks, “Dynamic allocation in C and C++,” Embedded Systems Design, June 2008.
[4] N. Murphy, “Safe Memory Utilization,” Embedded Systems Design, April 2000.

Khaled Fouad Elsayed is the CTO of the embedded wireless business unit at SySDSoft and a full professor of communication networks in Cairo University. Khaled received his B.Sc. degree with honors and M.Sc. from Cairo University and his Ph.D. in Computer Science and Engineering from North Carolina State University. He has published over 50 technical papers and was chair or technical committee member in various IEEE and IFIP conferences and was a technical editor with the IEEE Communications Magazine.

Mohamed Galal is a Software Specialist in the Embedded Wireless business unit in SySDSoft, He has been with SySDSoft since 2004. Currently he is a member of the LTE User Equipment (UE) project team, and his main role is designing, developing, and testing features in the Non Access Stratum (NAS) layer. Previously, he acted as the Software Lead of the Mobile WiMAX Medium Access Control (MAC) program for both Mobile Station (MS) and Base Station (BS). Mohamed has received his B.Sc. of Communication and Electronics Engineering from Cairo University.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.