There are two venues for many-core machines to gain higher performance: increasing the number of processors or the number of vector units in one SIMD processor. A truly scalable memory management algorithm should take advantage for both venues.
However, most of past research, on scalable memory allocators such as atomic operation based lock-free algorithms, can be scalable with number of processors growing, but have poor scalability with the number of vector units in one SIMD processor growing.
As a result, they are not truly scalable in many-core architecture. In this paper, we introduce our proposed solution used in the design of XMalloc, an truly scalable, efficient lockfree memory allocator. We present:
(1) Our solution for transforming traditional atomic CAS(Compare-And-Swap) based lock-free algorithm to be truly scalable for many-core architecture, and.
(2) A hierarchical cache-like buffer solution to reduce the average latency for accessing non-scalable or slow resource such as the memory system in many-core machine.
We used XMalloc as a memory allocator for NVIDIA Tesla C1600 with 240 processing units. Our experimental results show that XMalloc achieves very good scalability in terms of the number of processors and the number of vector units in each SIMD processor. Our truly scalability lock-free solution achieve 211 times speedup comparing to the common lock-free solution.
(To read this external content in full download the complete paper from the author archives at the University of Illinois at Urbana-Champaign .)