Tackling memory allocation in multicore and multithreaded applications

Multithreaded applications that allocate and free large numbers ofobjects (using standard new and delete operators) often face the sameproblem: performance degradation on multicore and multiprocessorsystems. An application will run fine with single CPU, but placing iton a system with two or more processors yields not the expecteddoubling of performance, but a ten-fold slowdown.

When adding CPUs significantly decreases application speed, theculprit is often the software's memory allocator. The standard systemmemory allocators—such as the Windows C runtime or glibc allocator—use a mutex to prevent concurrent access to allocatorstructures, in order to preserve these structures' consistency.

Locking a mutex invokes only a few atomic assembler instructions,and is not a performance problem when there are no lock conflicts. Buton multiprocessor and multicore platforms, different threadsconcurrently invoke the same memory allocator, with the result thatmuch processing time is spent performing context switching.

Even if threads in an application work autonomously (each threadaccessing only objects created by itself) and so require nosynchronization, there is still only one memory allocator, and thisresults in a lot of conflicts between threads. The result is thatinstead of the expected linear performance increase, adding CPUs leadsto exponential performance degradation.

The solution to the problem
One solution to this problem is to build a custom memory manager thatavoids locking conflicts. The simplest way of avoiding the conflictswould be for each thread to allocate objects on the stack rather thanin dynamic memory. From the performance standpoint that would be almostthe ideal solution. However, this simplified approach is not alwaysviable: thread stack size is limited and therefore allocating largeobjects or a large number of smaller objects on the stack isprohibited.

A more practical approach is to provide a separate memory allocatorfor each thread ” a thread local allocator ” so that each allocatormanages memory independently of the others. Most modern operatingsystems support the concept of per-thread storage, or a memory poolthat is assigned to an individual thread. Thread-local storage providesa convenient way for different threads referring to the same globalvariable to actually refer to different memory locations, making thevariable local to the thread.

For example, the Windows API provides TlsAlloc and TlsFree API toallocate and free a thread local storage (TLS) index. After the indexis allocated, each thread can use it to access its own TLS “slot” forthat index and to store a value in the slot and, later, to retrieve thevalue via TlsSetValue/TlsGetValue functions.

The POSIX implementation of threads, Pthreads, is similar to Win32in that  pthread_key_create creates a key, that can later be associated with thread-specific datavia pthread_setspecific .The data can be retrieved using pthread_getspecific .The key must be destroyed with pthread_key_delete .

A thread-local allocator associates a memory block with each threadand services the thread's memory requests from this block, withoutinterfering with other threads' allocation requests.

The custom thread-local allocator creates and maintains a number oflinked-lists (chains) of same-size blocks. Blocks are, in turn, madeout of “pages” allocated by a general-purpose memory manager {standardC-runtime malloc()in our case}.

The pages are evenly divided into blocks of a particular size. Forexample, the allocator could, by default, create five chains with 32,48, 64, 128 and 256 byte blocks, each made out of 4K pages. Theallocator shown in our example below is optimized for the allocation ofsmall objects, but the page size, the number of chains, and the blocksizes can easily be changed to better fit application requirements.

When the allocator needs to allocate memory, it simply rounds theallocation size up to the nearest block size, “unlinks” the block fromthe appropriate chain and returns the pointer to the block to theapplication. If the appropriate chain is empty, the allocator callsupon a general-purpose memory manager to allocate a new page, then addsnew blocks into the linked-list.

Objects that exceed the size of the largest block are allocated witha general-purpose allocator. Therefore, in order to avoid using thegeneral purpose function, the allocator parameters may have to bechanged to suit the application's allocation pattern.

As long as all objects are allocated and de-allocated locally (i.e.by the same thread), this algorithm does not require anysynchronization mechanism at all. As a result, its performance isexcellent and it scales across multiple processors exceptionally well.The reality, however, is that objects are sometimes shared acrossthreads– for example, an object may be allocated in one thread, used,and finally de-allocated in another.

Reducing synchronization conflicts
This makes it impossible to avoid synchronization issues completely. Inorder to reduce the number of synchronization conflicts (and theirattendant overhead) the custom thread-local algorithm maintains a”pending free requests list” for each thread: when an object allocatedin one thread is being de-allocated by another thread, thede-allocating links the object into this list (Figure1, below ).

Figure1

The pending request list must of course be protected by a mutex, toarbitrate conflicts between threads that need to access the samepending request list. Each thread periodically checks the pendingrequests list and when a certain threshold is reached, it de-allocatesits share of objects on the list at once. As the test results belowdemonstrate, the number of synchronization conflicts is reducedsignificantly:

a) often the object is freedby the same thread that had allocated it, and
b) when the object isde-allocated by a different thread, it does not interfere with allother threads, but only with those that need to use the same pendingrequests list

Test results
The complete allocator source code package is available at McObject's Web site for free re-use. It includes both a gcc makefile ,and the makefile to build with the Microsoft Visual C++ compiler (makefile.mvc ).We have created four tests to measure the performance of this customallocator versus the performance of the standard C runtime malloc() onthe same tasks, using both single processor andmulti-processor/multi-core systems.

The custom allocator's API provides three functions with syntaxsimilar to the standard C runtime allocation API:

void* thread_alloc(size_t size);
void* thread_realloc(void* addr, size_t size);
void thread_free(void* addr);

In order to use the custom allocator instead of the standard Cruntime malloc() ,realloc() and free() routings, the application should include the threadalloc.h .Applications written in C++ could also redefine the default new anddelete operators to use the custom allocator by defining theREDEFINE_DEFAULT_NEW_OPERATOR macros in the allocator's header file (Figure 2 below .)

Figure2

The first pair of tests, called stdalloc_test1 and threadalloc_test1 ,compare performance when the allocation pattern is thread-local: allde-allocations are performed by the same thread as the originalallocations. The second pair of tests, and stdalloc_test2threadalloc_test2 compare performance when objects are allocated by one thread(called a producer) and freed by another (a consumer).

We ran the test on Linux 2.6.15.5 for x86_64. Our test results (Table 1. below ) show that the customallocator is up to three times faster than the standard C-runtimeallocator when running on a single CPU machine.

Table1

This gain can be explained by the fact that the custom allocatorpresented here is very “lightweight”: allocating/de-allocating anobject is as simple as adding and removing a linked-list element, whichrequires just a few machine instructions.

This, in itself, is anargument for using a custom memory allocator, rather than defaultallocators, in performance-hungry embedded systems applications.However, in the multi-core environment, the benefits of such anallocator truly shine through, with a performance gain of up to 10times.

Conclusion
Eschewing standard system memory allocators is not such a radical idea,as these standard allocators are often very expensive, in performanceterms, with prohibitive memory consumption. As a result, embeddedsystems developers often create multiple custom allocators for a singlecomplex application.

For example, the eXtremeDB in-memory embedded database and its SQLextension eXtremeSQL employs numerous allocation algorithms to addressspecific internal tasks such as infrastructure for data layout, heapmanagement, and SQL parsers and optimizers.

Certainly there are many factors other than memory allocators thatlimit applications' scalability on multi-core platforms. However thesolution presented here addresses one important cause of latency, andpresents a clear and simply route to enhance software scalability andperformance on multi-core systems.

Andrei Gorine is Co-Founder and Principal Architect, and Konstantin Knizhnik is a software engineeer at McObject LLC.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.