A portable OpenMP runtime library based on MCAPI/MRAPI: Part 1
Editor’s Note: The authors expand on a previous article on “What the new OpenMP standard brings to embedded multicore software design,” describing in more detail how OpenMP, the Multicore Communications API (MCAPI), and the Multicore Resource API (MRAPI) can be used to program multicore embedded systems.
Productive programming of modern embedded systems is a challenge. For example, consider a smart phone that is using multiple processor cores of multiple capabilities. The functionalities of these cores vary dynamically based on the application requirements; different cores are running on different OSes.
One of the OSes might be handling user interface plus file and data management and the other core, invisible to the user, might be managing low-level activities such as connecting and handling calls. It is a complicated task for a single OS to track and manage all resources and operations, hence the need for more than one OS. How to enable communication between cores, share resources, and synchronize accesses? Some other complicated programming questions include:
1. Can conventional thread creation and management techniques that were originally developed for general purpose processors be used for embedded platforms? Hint: Resource is scarce in embedded systems - the maximum number of cores available on an embedded platform is presently 64.
2. What are the challenges that embedded platforms pose with regard to memory accesses? Hint: Embedded systems contain multiple memory spaces that are dedicated to each core. These memories maintain a distinct address space that are not accessible from other threads.
3. Can conventional, general purpose synchronization be used for embedded systems as well? Hint: Embedded systems, especially heterogeneous ones, are asymmetric multiprocessor (AMP)-based architectures; processors are loosely coupled, with each processor having its own OS and memory.
In the rest of this article we will discuss these challenges in detail and investigate how OpenMP in combination with MCAPI (multicore communications applications programming interface ) can address them.
To begin with, let’s look at an MCAPI-associated application programming interface called Multicore Resource Management API (MRAPI ), which we have used at length in our project.
MRAPI (Table 1) provides a set of primitives for both on- and off-chip resources, which include homogeneous/heterogeneous cores, accelerators, and memory regions. Domain and Nodes define the overall granularity of the resources. An MCA Domain is a global entity that can consist of one or more MCA Nodes. The major difference between MCA Nodes and POSIX threads is that Nodes offer high-level semantics over threads, thus hiding the real entities of execution.
MRAPI supports two types of memory: shared memory and remote memory. Shared memory provides the ability to allocate, access, and delete on- and off-chip memory. But unlike POSIX threads where threads have no control of processor affinity, the MRAPI shared memory allows programmers to specify attributes of shared memory as on-chip SRAM or off-chip DDR memory.
Modern embedded systems often consist of heterogeneous cores that consist of local memory address spaces that may not be directly accessible by other nodes. MRAPI remote memory enables the data movement between these memory spaces without involving CPU cycles, instead using DMA, serial rapidIO (SRIO), or software cache. MRAPI keeps the data movement operation hidden from the end-users.
MRAPI synchronization inherits the essential feature sets from other thread libraries for shared memory programming, including that of mutexes, semaphore, and reader/writer locks. But unlike the POSIX threads, the MRAPI synchronization primitives provide rich functionalities to fulfill the characteristics for embedded systems. For example, locks can be shared by all nodes as well as by only a group of nodes. Using MRAPI metadata primitives, we could gather information about hardware and application execution statistics that can be used for debugging and profiling purposes.
As noted in our earlier article, OpenMP and MRAPI share common mapping characteristics. The nodes naturally map with OpenMP threads and tasks. We adopt the MRAPI synchronization primitives to implement the OpenMP synchronization directives, such as barrier and critical.
MRAPI shared memory and remote memory enhance the OpenMP memory model that typically provides a relaxed-consistency model and is a well-adopted standard except for shared memory parallel programming. However, there are several research efforts that are exploring usage of OpenMP for distributed memory space  . The programming challenges can be broken down into several pieces.
Thread Creation and Management
It is a struggle to manage limited resources available in embedded systems; not to mention the costly engineering efforts. Unlike an HPC system, embedded systems cannot afford thread oversubscription. Sometimes oversubscription has the potential to improve performance with better load balancing techniques and CPU utilization, but this is only the case for HPC systems.
As mentioned earlier, MRAPI nodes and its corresponding primitives can be used to create and manage the OpenMP threads. In a traditional OpenMP implementation, all threads in a team are expected to be identical. However, this may not be the case for embedded systems where threads are running on different cores that could be heterogeneous. The MRAPI nodes relax this condition, allowing each node in a team to be distinct (for example node one may be the CPU while node two may be an accelerator) and each node may have its own attributes with particular data structures. MCA nodes may also create multiple OpenMP threads inside the nodes to support nested parallelism as well.
The concept of a thread pool is pretty straightforward; the number of threads in the pool is pre-defined and it is usually much larger than the total number of CPUs on the platform. However, the pool consumes plenty of resources (such as memory and CPU cycles) that are not abundant in embedded systems. Moreover, this may lead to thread oversubscription, which embedded systems cannot afford.
To handle this situation, we can use an elastic thread pool that uses MRAPI metadata primitive to query the number of nodes available on the platform and generate only as many number of threads as required in the thread pool. Thread oversubscriptions are not allowed to occur. If the size of the worker team requested by the programmer is less than the number of threads available in the pool, the idle threads will go back to sleep, which will further save the system energy. This approach guarantees that no system resources are wasted, which is one of the vital steps while programming embedded systems.