Improving performance for dynamic memory allocation -

Improving performance for dynamic memory allocation

In many embedded systems applications, developers are often caught on the horns of a dilemma. On the one hand, they are constrained to use the limited memory space available as efficiently as possible. But at the same time, they are required to get as much performance out of the hardware resources (including memory) as is possible.

So it's no surprise that embedded systems developers spend much of their time trying to manage the available memory resources so as to achieve the best combination of performance and minimum memory use. To do this, extensive use is made of a variety of static and dynamic memory allocation techniques.

In static memory allocation, it's necessary to predict–and limit–in advance what the maximum number of objects will be for a given pool, or seen from another perspective, the amount of memory.

This often leads to a large amount of memory being “wasted,” and/or constraints on the number of objects that would limit functionality.

Another problem with static memory allocation has to do with security: allocating more memory on the stack than is available can result in a crash due to stack overflow.

A third problem is that the memory stored on the stack is automatically deallocated when the function that created it returns, and thus the function must copy the data if they should be available to other parts of the program after it returns.

But despite these drawbacks, static memory allocation is still an important part of the embedded systems developer's repertoire because it's much faster than the allocation on the heap, as the data is added and removed in a last-in-first-out manner. Also, the memory on the stack is automatically reclaimed when the function exits, which can be convenient for the programmer.

But on balance, the few advantages of static allocations are far outweighed by its disadvantages in favor of the dynamic approach. Allocating memory dynamically means allocating it from a large pool of unused memory called heap (also called free store).

The word dynamic, related to the memory allocation, refers to two things:

• Allocating memory during the runtime of the program.

• The object allocated has a dynamic lifetime as it is not deallocated when a function returns.

But while dynamic memory allocation is more versatile and secure than static memory allocation, it, too, has some performance-related issues. One has to do with memory leaks. As the memory is not freed automatically, the programmer has to take care of this task and remember to free the memory. It's also necessary to take the time to find a block of unused memory of sufficient size to fulfill the request.

Directly connected to the lookup of a free block is the internal and external fragmentation. Trying to optimize reduce the fragmentation is not an easy task and adds complexity to the algorithms. Fragmentation can create the possibility that it may not be possible to allocate a contiguous block of memory of the requested size if the heap becomes too fragmented.

Dynamic preallocation
This article describes a new approach–the use of a preallocated heap–that combines the best of both approaches with few of the disadvantages. In short, it achieves both good performance and extreme versatility.

To implement this preallocated heap scheme, a new layer is needed between the application and the heap, designed to allow the application to have a “dedicated” heap space, allocated up-front (as in static allocation) and not deallocated until the overall execution of the application ends.

This last point is very important because the allocated memory is not just available to a single function (as in static allocation) but to the all application and throughout the lifetime of the application. Moreover, the programmer does not need to care about freeing the memory (avoid memory leaks) because it will be done when the application ends.

The rest of the article gives some hints for implementing a preallocated heap and covers the algorithms and data structures used to implement such an idea.

The problem of contiguous memory
Looking for contiguous memory segments (“memory pools”) in order to fulfill the request of the user can be very time consuming.

The practice of memory pools is well known in the CS community. For example there is the kmem_cache_ create function in the Linux kernel implementation. However, its implementation has some associated disadvantages:

• It's not possible to hide the use of a memory pool. Basic memory pools are initialized statically with a finite amount of memory and a call to the initialization function cannot be omitted.

• It's generally used for only a well known and fixed memory size

To resolve such issues, I have developed a memory management process that provides the following benefits as it relates to dynamic memory allocation:

• It can be dedicated to dealing with the parameters of the developers particular design; in this case, that is our own application.

• It works for every requested memory size.

• It's configurable via an external configuration file.

• It's able to reduce the internal and external fragmentation

• It has a statistics collector process that helps the developer to tune the configuration file.

Solving the problem
The preallocation technique I have come up with is based on the use of two data structures: the linked list and a hash.

The memory, initially allocated using a configuration file modifiable by the user, is organized in blocks and each block is divided into several segments.

The preallocation is done using the information found in the external configuration provided by the user, such as (1) how many blocks the dedicated heap should be divided into, (2) how many segments in each block, and (3) the size of the segment.

The application developer then chooses the number of segments and the size of each segment building the memory block. My recommendation is to select powers of two (16, 32, 64, and so on). All these memory blocks, taken together, constitute the preallocated memory area.

View the full-size image

For example, in Figure 1 , the dedicated heap has been divided into four blocks. Each block contains the same number of segments (128) but of different sizes: the first one has segments of 512 bytes, the second one of 1,024 bytes (1K), the third one of 2,048 (2K) byte and the last one of 4,096 bytes (4K). As a result the total amount of memory preallocated is given by a simple operation: (128*512) + (128*1,024) + (128*2,048) + (128*4,096).

The strategy I use to minimize memory fragmentation is very straightforward: when the user is asking for a certain amount of memory (calling it X ), the lookup phase is done in the memory block containing segments of the closest size greater than X .

For instance, let's imagine that the image in Figure 1 is our current configuration and that the user is asking for a segment of 768 bytes; the lookup is done in the pool that contains 1,024 byte segments.

Data structures
First, let's deal with free memory segments. The data structure used to manage each memory block is a linked list having also two more pointers, one for the head of the list and one for the tail.

View the full-size image

Each list element (shown in Figure 2 ) is a data type containing the following information:

1. The memory address of the segment.

2. The size of the segment.

3. An identifier of the block.

4. A pointer to the same data type in order to make the linking.

In the preallocation scheme I propose, a memory pool is a data type that contains the following information:

1. The number of segments contained in the pool.

2. The size of each segment inside the pool.

When the allocation function is called, the first operation to be performed is the lookup. This step is done in the block containing the segments of that are (1) closest in proximity to and (2) greater in size than the memory request.

Listing 1 shows how the lookup is performed. When the suitable block has been found, the first operation to be performed is to check if the block is empty. If the TAIL pointer is NULL , the block has no more segments; in other words, no more memory is left. The insert into the list is performed at head while the remove is done at the tail.

View the full-size image

If all checks pass, the last operation performed an insertion into the linked list and adds a new element into the hash data structure, named busyhash , which I'll discuss in more detail later.

The algorithmic complexity of this operation is constant O(1) because it's not affected by the size of the list. [O(1) is an indicator that an operation is one of the slowest growing functions in Big O notation, used in computational complexity theory to describe an algorithm's usage of computational resources (usually running time or memory) with respect to the size of the input data.]

From a computation perspective, the only operation that must be synchronized–to avoid race conditions in multithreading environment–is the update of the TAIL pointer as it's the only one used to retrieve an element from the list. Such operation is performed using the mutual exclusion provided by the pthread library.

On the other hand, the piece of code that has to be locked in order to avoid race conditions in a multithreading application is just the update of the pointer to the TAIL of the list and can be easily ignored.

Making a Busyhash of allocated memory segments
The busyhash is a hash-table storing the segments already allocated.

The element used as the index for the hash-table is the address pointer of the memory segment, because it's of course unique and will be passed into the free function as a parameter, while the hashed element is the instance of the segment data type.

When the user invokes the free function, the lookup is done inside the hash-table, where the operation has complexity time O(1). For multithreading application, you can follow the same approach with the same complexity as that of that the linked list.

Once the element is found in the hash-table during the free operation, it's added again to the linked list containing the free segments. The pool to which it belongs is well-known thanks to the field that the element stores within itself. For example, using the allocation done in Figure 1, when a segment of 512 bytes is freed, the field “stackNumber” inside its data type is accessed and used to address the correct memory block, as shown in Figure 3 .

View the full-size image

The free function is as straightforward to implement as an ordinary alloc function:

1. The element is removed from the hash-table and stored in a local variable.

2. The HEAD->prev is then pointed to the new free element.

3. The HEAD points to the new free element.

The code in Listing 2 shows the sequence that is followed once the segment is removed from the busyhash.

View the full-size image

Busyhash caveats
When using busyhash to preallocation, there could be situations when, even with a perfectly configured and tuned configuration, errors occur that result in an empty block of memory. These errors cannot be seen during tests.

Of course, this type of error does not mean that no memory is left in the system. More likely the memory allocator with the current configuration is not able to find free segments from the preallocated pool.

Following are three different behaviors (and some caveats on their use ) supported by the library I developed, and which can only be used one at a time:

1. Return the user a NULL pointer, like malloc() behavior, and log an error message of OutOfMemory :

a. Not very useful in the release version of the application.

b. Could be useful to study how greedy for memory the application is.

c. With reference to (b), there is the chance to rearrange the configuration file, which is very important during user's application development as the memory requirements could not be known from the beginning.

2. Use a bigger segment than usual to satisfy the request. While it's the fastest way to solve the issue, it's not the best one (fragmentation is best).

3. Allocate more space on the fly for that particular memory block:

a. It would be the best way to proceed; especially with a release version of the user's application.

b. It has some implications with performance.

Of course, it's difficult to say which behavior to choose because it depends partly on the life time of the user's application. But generally two “rules of thumb” can be applied:

• If you're deploying the application, you should not even consider using the first approach, because it will cause your application to fail.

• If you're testing the application and the behavior of the memory allocator, you can use the first approach because it will help you tune the external configuration file

Collecting stats on memory allocation
To facilitate implementation of this procedure, we've built up a library of useful functions. One of these is for automated collection of statistics to provide some feedback to and guidelines for the developer and the user.

Using the information provided through the stats-collection process, the user can tune the behavior of the memory management using the external configuration file, increasing its performance.

The stats collection should be used only during the development of the user's applications or testing them for the simple reason that it's obviously a time-consuming process.

Keeping allocation simple
In order not to add complexity to the development process and to keep the C programming language style, the two functions used to allocate and free memory from the virtual memory are very similar to the standard C malloc and free functions:

void* pMalloc(size_t size)void pFree(void* pointer)   

Two more functions should be used in order to make the library works correctly:

void init()void shutdown()   

The first one is used to initialize the system, reading the start-up configuration from the external file, and the other function is in charge of freeing all the preallocated memory.

Benchmarking the Busyhash
To effectively assess the performance of the Busyhash approach, run benchmarks divided into two test cases:

• Sequential calls to the new malloc function and to the standard malloc.

• Parallel calls to the new malloc function and to the standard malloc.

Both situations share some features:

• Requested segment size.

• Number of blocks.

• Number of segments per block.

• Number of calls to the memory allocation function.

Single-thread applications
Figures 4 and 5 contain comparisons of performances in the single-threaded test case, comparing the standard way of allocating memory on the heap with the “new” way using the developed library.

View the full-size image

View the full-size image

It's clear from the graphs that the behavior of the preallocation is much better than using the standard C malloc() . Something important to point out is the degradation of performance of the standard C malloc() . Using the preallocation, the time spent grows linearly with the number of operations, while if using the standard C malloc() it does not, mainly because of memory fragmentation.

Multithreads applications
Figures 6 and 7 contain performance comparisons between the standard way of allocating memory on the heap and the “new” way using the developed library in the multithreaded test case.

View the full-size image

View the full-size image

The mechanism used to avoid race conditions in order to synchronize the operations is the mutual exclusion given by the standard POSIX thread implementation, pthread.h .

Something to notice in the graphs is the quality of the algorithms and data structures used. The synchronized sections in the code have no impact at all on the overall system performance.

Encouraging results
Based on the performance illustrated in the Figures 3 through 7, I am encouraged with the initial results obtained in applications employing this busyhash approach to dynamic memory allocation, so much so that from both management and performance points of view, I think the use of this library should be the routine in many embedded applications.

The developed library can be obtain from the download page of website.

Marco Varlese received his degree in computer science at the University of Salerno (Italy). He worked for three years on UTM (Unified Treat Management) applications for an Italian company (Connect Srl). Nowadays, he is a software engineer at the R&D in the performance products division of the embedded communication group of the Intel Corporation. His main activities are related to the kernel development (linux/unix/windows) and low-level driver acceleration.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.