Tackling memory allocation in multicore and multithreaded applications

Andrei Gorine and Konstantin Knizhnik, McObject LLC

May 29, 2006

Andrei Gorine and Konstantin Knizhnik, McObject LLCMay 29, 2006

Multithreaded applications that allocate and free large numbers of objects (using standard new and delete operators) often face the same problem: performance degradation on multicore and multiprocessor systems. An application will run fine with single CPU, but placing it on a system with two or more processors yields not the expected doubling of performance, but a ten-fold slowdown.

When adding CPUs significantly decreases application speed, the culprit is often the software's memory allocator. The standard system memory allocators—such as the Windows C runtime or glibc allocator—use a mutex to prevent concurrent access to allocator structures, in order to preserve these structures' consistency.

Locking a mutex invokes only a few atomic assembler instructions, and is not a performance problem when there are no lock conflicts. But on multiprocessor and multicore platforms, different threads concurrently invoke the same memory allocator, with the result that much processing time is spent performing context switching.

Even if threads in an application work autonomously (each thread accessing only objects created by itself) and so require no synchronization, there is still only one memory allocator, and this results in a lot of conflicts between threads. The result is that instead of the expected linear performance increase, adding CPUs leads to exponential performance degradation.

The solution to the problem
One solution to this problem is to build a custom memory manager that avoids locking conflicts. The simplest way of avoiding the conflicts would be for each thread to allocate objects on the stack rather than in dynamic memory. From the performance standpoint that would be almost the ideal solution. However, this simplified approach is not always viable: thread stack size is limited and therefore allocating large objects or a large number of smaller objects on the stack is prohibited.

A more practical approach is to provide a separate memory allocator for each thread " a thread local allocator " so that each allocator manages memory independently of the others. Most modern operating systems support the concept of per-thread storage, or a memory pool that is assigned to an individual thread. Thread-local storage provides a convenient way for different threads referring to the same global variable to actually refer to different memory locations, making the variable local to the thread.

For example, the Windows API provides TlsAlloc and TlsFree API to allocate and free a thread local storage (TLS) index. After the index is allocated, each thread can use it to access its own TLS "slot" for that index and to store a value in the slot and, later, to retrieve the value via TlsSetValue/TlsGetValue functions.

The POSIX implementation of threads, Pthreads, is similar to Win32 in that  pthread_key_create creates a key, that can later be associated with thread-specific data via pthread_setspecific. The data can be retrieved using pthread_getspecific. The key must be destroyed with pthread_key_delete.

A thread-local allocator associates a memory block with each thread and services the thread's memory requests from this block, without interfering with other threads' allocation requests.

The custom thread-local allocator creates and maintains a number of linked-lists (chains) of same-size blocks. Blocks are, in turn, made out of "pages" allocated by a general-purpose memory manager {standard C-runtime malloc() in our case}.

The pages are evenly divided into blocks of a particular size. For example, the allocator could, by default, create five chains with 32, 48, 64, 128 and 256 byte blocks, each made out of 4K pages. The allocator shown in our example below is optimized for the allocation of small objects, but the page size, the number of chains, and the block sizes can easily be changed to better fit application requirements.

When the allocator needs to allocate memory, it simply rounds the allocation size up to the nearest block size, "unlinks" the block from the appropriate chain and returns the pointer to the block to the application. If the appropriate chain is empty, the allocator calls upon a general-purpose memory manager to allocate a new page, then adds new blocks into the linked-list.

Objects that exceed the size of the largest block are allocated with a general-purpose allocator. Therefore, in order to avoid using the general purpose function, the allocator parameters may have to be changed to suit the application's allocation pattern.

As long as all objects are allocated and de-allocated locally (i.e. by the same thread), this algorithm does not require any synchronization mechanism at all. As a result, its performance is excellent and it scales across multiple processors exceptionally well. The reality, however, is that objects are sometimes shared across threads-- for example, an object may be allocated in one thread, used, and finally de-allocated in another.

Reducing synchronization conflicts
This makes it impossible to avoid synchronization issues completely. In order to reduce the number of synchronization conflicts (and their attendant overhead) the custom thread-local algorithm maintains a "pending free requests list" for each thread: when an object allocated in one thread is being de-allocated by another thread, the de-allocating links the object into this list (Figure1, below).

Figure 1

The pending request list must of course be protected by a mutex, to arbitrate conflicts between threads that need to access the same pending request list. Each thread periodically checks the pending requests list and when a certain threshold is reached, it de-allocates its share of objects on the list at once. As the test results below demonstrate, the number of synchronization conflicts is reduced significantly:

a) often the object is freed by the same thread that had allocated it, and
b) when the object is de-allocated by a different thread, it does not interfere with all other threads, but only with those that need to use the same pending requests list

Test results
The complete allocator source code package is available at McObject's Web site  for free re-use. It includes both a gcc makefile, and the makefile to build with the Microsoft Visual C++ compiler (makefile.mvc). We have created four tests to measure the performance of this custom allocator versus the performance of the standard C runtime malloc() on the same tasks, using both single processor and multi-processor/multi-core systems.

The custom allocator's API provides three functions with syntax similar to the standard C runtime allocation API:

void* thread_alloc(size_t size);
void* thread_realloc(void* addr, size_t size);
void thread_free(void* addr);

In order to use the custom allocator instead of the standard C runtime malloc(), realloc() and free() routings, the application should include the threadalloc.h. Applications written in C++ could also redefine the default new and delete operators to use the custom allocator by defining the REDEFINE_DEFAULT_NEW_OPERATOR macros in the allocator's header file (Figure 2 below.)

Figure 2

The first pair of tests, called stdalloc_test1 and threadalloc_test1, compare performance when the allocation pattern is thread-local: all de-allocations are performed by the same thread as the original allocations. The second pair of tests, and stdalloc_test2threadalloc_test2 compare performance when objects are allocated by one thread (called a producer) and freed by another (a consumer).

We ran the test on Linux for x86_64. Our test results (Table 1. below) show that the custom allocator is up to three times faster than the standard C-runtime allocator when running on a single CPU machine.

Table 1

This gain can be explained by the fact that the custom allocator presented here is very "lightweight": allocating/de-allocating an object is as simple as adding and removing a linked-list element, which requires just a few machine instructions.

This, in itself, is an argument for using a custom memory allocator, rather than default allocators, in performance-hungry embedded systems applications. However, in the multi-core environment, the benefits of such an allocator truly shine through, with a performance gain of up to 10 times.

Eschewing standard system memory allocators is not such a radical idea, as these standard allocators are often very expensive, in performance terms, with prohibitive memory consumption. As a result, embedded systems developers often create multiple custom allocators for a single complex application.

For example, the eXtremeDB in-memory embedded database and its SQL extension eXtremeSQL employs numerous allocation algorithms to address specific internal tasks such as infrastructure for data layout, heap management, and SQL parsers and optimizers.

Certainly there are many factors other than memory allocators that limit applications' scalability on multi-core platforms. However the solution presented here addresses one important cause of latency, and presents a clear and simply route to enhance software scalability and performance on multi-core systems.

Andrei Gorine is Co-Founder and Principal Architect, and Konstantin Knizhnik is a software engineeer at McObject LLC.

Loading comments...