Editor’s Note: The authors expand on a previous article on “What the new OpenMP standard brings to embedded multicore software design,” describing in more detail how OpenMP, the Multicore Communications API (MCAPI), and the Multicore Resource API (MRAPI) can be used to program multicore embedded systems.
Productive programming of modern embedded systems is a challenge. For example, consider a smart phone that is using multiple processor cores of multiple capabilities. The functionalities of these cores vary dynamically based on the application requirements; different cores are running on different OSes.
One of the OSes might be handling user interface plus file and data management and the other core, invisible to the user, might be managing low-level activities such as connecting and handling calls. It is a complicated task for a single OS to track and manage all resources and operations, hence the need for more than one OS. How to enable communication between cores, share resources, and synchronize accesses? Some other complicated programming questions include:
1. Can conventional thread creation and management techniques that were originally developed for general purpose processors be used for embedded platforms? Hint: Resource is scarce in embedded systems – the maximum number of cores available on an embedded platform is presently 64.
2. What are the challenges that embedded platforms pose with regard to memory accesses? Hint: Embedded systems contain multiple memory spaces that are dedicated to each core. These memories maintain a distinct address space that are not accessible from other threads.
3. Can conventional, general purpose synchronization be used for embedded systems as well? Hint: Embedded systems, especially heterogeneous ones, are asymmetric multiprocessor (AMP)-based architectures; processors are loosely coupled, with each processor having its own OS and memory.
In the rest of this article we will discuss these challenges in detail and investigate how OpenMP in combination with MCAPI (multicore communications applications programming interface ) can address them.
To begin with, let’s look at an MCAPI-associated application programming interface called Multicore Resource Management API (MRAPI ), which we have used at length in our project.
MRAPI (Table 1 ) provides a set of primitives for both on- and off-chip resources, which include homogeneous/heterogeneous cores, accelerators, and memory regions. Domain and Nodes define the overall granularity of the resources. An MCA Domain is a global entity that can consist of one or more MCA Nodes. The major difference between MCA Nodes and POSIX threads is that Nodes offer high-level semantics over threads, thus hiding the real entities of execution.
MRAPI supports two types of memory: shared memory and remote memory. Shared memory provides the ability to allocate, access, and delete on- and off-chip memory. But unlike POSIX threads where threads have no control of processor affinity, the MRAPI shared memory allows programmers to specify attributes of shared memory as on-chip SRAM or off-chip DDR memory.
Modern embedded systems often consist of heterogeneous cores that consist of local memory address spaces that may not be directly accessible by other nodes. MRAPI remote memory enables the data movement between these memory spaces without involving CPU cycles, instead using DMA, serial rapidIO (SRIO), or software cache. MRAPI keeps the data movement operation hidden from the end-users.
MRAPI synchronization inherits the essential feature sets from other thread libraries for shared memory programming, including that of mutexes, semaphore, and reader/writer locks. But unlike the POSIX threads, the MRAPI synchronization primitives provide rich functionalities to fulfill the characteristics for embedded systems. For example, locks can be shared by all nodes as well as by only a group of nodes. Using MRAPI metadata primitives, we could gather information about hardware and application execution statistics that can be used for debugging and profiling purposes.
As noted in our earlier article, OpenMP and MRAPI share common mapping characteristics. The nodes naturally map with OpenMP threads and tasks. We adopt the MRAPI synchronization primitives to implement the OpenMP synchronization directives, such as barrier and critical.
MRAPI shared memory and remote memory enhance the OpenMP memory model that typically provides a relaxed-consistency model and is a well-adopted standard except for shared memory parallel programming. However, there are several research efforts that are exploring usage of OpenMP for distributed memory space  . The programming challenges can be broken down into several pieces.
Thread Creation and Management
It is a struggle to manage limited resources available in embedded systems; not to mention the costly engineering efforts. Unlike an HPC system, embedded systems cannot afford thread oversubscription. Sometimes oversubscription has the potential to improve performance with better load balancing techniques and CPU utilization, but this is only the case for HPC systems.
As mentioned earlier, MRAPI nodes and its corresponding primitives can be used to create and manage the OpenMP threads. In a traditional OpenMP implementation, all threads in a team are expected to be identical. However, this may not be the case for embedded systems where threads are running on different cores that could be heterogeneous. The MRAPI nodes relax this condition, allowing each node in a team to be distinct (for example node one may be the CPU while node two may be an accelerator) and each node may have its own attributes with particular data structures. MCA nodes may also create multiple OpenMP threads inside the nodes to support nested parallelism as well.
The concept of a thread pool is pretty straightforward; the number of threads in the pool is pre-defined and it is usually much larger than the total number of CPUs on the platform. However, the pool consumes plenty of resources (such as memory and CPU cycles) that are not abundant in embedded systems. Moreover, this may lead to thread oversubscription, which embedded systems cannot afford.
To handle this situation, we can use an elastic thread pool that uses MRAPI metadata primitive to query the number of nodes available on the platform and generate only as many number of threads as required in the thread pool. Thread oversubscriptions are not allowed to occur. If the size of the worker team requested by the programmer is less than the number of threads available in the pool, the idle threads will go back to sleep, which will further save the system energy. This approach guarantees that no system resources are wasted, which is one of the vital steps while programming embedded systems.
Optimizing thread waiting and awakening
Once the thread poolhas been created, there are idle threads in the pool waiting to bescheduled. The utilization of the thread pool must be managedefficiently, since idle threads affect the runtime performancesignificantly. Thread waiting/awakening is handled by conditionalvariables and signaling in traditional high-performance computingimplementations.
Conditional variables provided by POSIX threadsrequire that a thread be waiting explicitly for the conditionalvariable to receive it. During this process, the CPU cycles are releasedand can be scheduled for other jobs. The main disadvantage of thisapproach is the large context-switch overheads between the sleep andwake-up states. This leads to performance deterioration. Features suchas conditional waiting/signaling are not currently specified in theMRAPI API.
Moreover, a common characteristic of embedded systemsis that they are not time-sharing, and each task is tied to run on acore/thread in order to meet the real-time requirement. Thus when athread goes to sleep, other tasks cannot take advantage of thatprocessor.
The answer to this conundrem is spin waiting. Figure 1shows the state transition diagram for an MRAPI node (thread). As shownin the figure, there are a total of five states. When a new node iscreated as part of the initialization phase, the node is set to the spinwaiting state, i.e., it is waiting for new tasks to be assigned.
Onceit receives a new task the state changes to ready, which means it canbe dispatched by the system scheduler, after which the state changes toexecuting. When the execution is complete, the node goes back to spinwaiting, in which the thread is polling for a new task, and this cycleis repeated. After all the tasks have finished their execution, thenodes reach the terminate state, i.e., the fork-join stage.
Apoint to note is that the spin flag will not be a bottleneck as thesystem scales, as each thread is only waiting for its own task ratherthan competing for global tasks. The advantage of the spin waitingmechanism used in embedded systems is that the thread does not need toswitch between sleep and wake-up stages every time, thus avoiding theunnecessary context switches. Also, in order to avoid false sharing, thespin variable is assigned into the cache line size.
Sowe see that different strategies need to be employed to create,optimize and handle threads in embedded system in order to efficientlyutilize the limited resources.
The costand performance of an embedded system heavily depends on memory. Thememory available in embedded systems is only on the order of few MBs,unlike a traditional PC. The memory hierarchy consumes a large amount ofchip area as well as energy, vital resources in embedded systems.Embedded platforms are very sensitive to memory usage.
Inappropriateallocation or deallocation of memory can lead to memory leakage thatcan degrade system performance significantly. Software programmers needto take extra caution to prevent memory leak and must ensure that theprograms are able to meticulously handle memory usage.
Since weare taking a high-level approach in programming strategies for embeddedsystems, we will consider how OpenMP can address the complex memoryhierarchy of embedded systems. Also, we need to define interactionsbetween OpenMP and MRAPI.
As we already know, OpenMP provides arelaxed-consistency, shared memory model, meaning all the threads accessthe same, global shared memory but can have their own temporary view ofprivate data. Data coherency and consistency is required only atcertain points in the execution flow.
It is the programmer’sresponsibility to protect global shared data. The relaxed-consistencymodel is trivial to achieve in general purpose CPUs since the memorymodel is typically cache-coherent. But embedded systems lack featuressuch as coherent cache systems. Also the cache coherency is notautomatically maintained by the hardware. Embedded systems consist ofon-chip and off-chip shared memory along with local memory with separateaddress spaces.
As long as the memory is shared, OpenMP orPOSIX will work, but there is a problem when there is physical memorybeing shared by several threads operating under different operatingsystems; POSIX will not help. Such memories usually maintain distinctaddress spaces that cannot be accessible from other threads.
Undersuch circumstances, we need smarter APIs that can communicate betweenmore than one core operating on more than one OS (MCAPI is useful here!)and manage the available resources (MRAPI is useful here!). The MRAPIshared memory inherits the same semantics of the shared memory in POSIXthreads, but provides the ability to manage access to physically sharedmemory between heterogeneous threads running on different cores andOSes.
Part 2 of this series discusses in detail how toexploit MRAPI capabilities. We also discuss synchronizationprimitives and some evaluation results.
1.Ayon Basumallik, S-J Min, and Rudolg Eigenmann, “Programmingdistributed memory systems using OpenMP”, Proceedings of IEEEInternational on Parallel and Distributed Processing Symposium, IPDPS2007, pp. 1-8
2. Barbara Chapman, Lei Huang, Eric Biscondi, EricStotzer, Ashish Shrivastava, and Alan Gatherer, “Implementing OpenMP ona high performance embedded multicore MPSoC”, Proceedings of IEEEInternational on Parallel and Distributed Processing Symposium, IPDPS2009, pp. 1-8.
Sunita Chandrasekaran is a PostdoctoralFellow at the High Performance Computing and Tools (HPCTools) researchgroup at the University of Houston, Texas, USA. Her current area of workspans HPC, Exascale solutions accelerators, heterogeneous and multicoreembedded technology solutions. Her research interests include parallelprogramming, reconfigurable computing, accelerators, and runtimesupport.
Her research contributions include exploring newerapproaches to building effective toolchain-addressing programmerproductivity and performance while targeting current HPC and embeddedsystems. She is a member of the Multicore Association (MCA), OpenMP, andOpenACC. Sunita earned a Ph.D. in Computer Science Engineering fromNanyang Technological University (NTU), Singapore, in the area ofdeveloping tools and algorithms to ease programming on FPGAs. She earneda B.E in Electrical & Electronics from Anna University, India.
Barbara Chapman is a professor of Computer Science at the University of Houston, whereshe teaches and performs research on a range of HPC-related themes. Herresearch group has developed OpenUH, an open source reference compilerfor OpenMP with Fortran, C and C++ that also supports Co-Array Fortran(CAF) and CUDA. In 2001, she founded cOMPunity, a not-for-profitorganization that enables research participation in the development andmaintenance of the OpenMP industry standard for shared memory parallelprogramming. She is a member of the Multicore Association (MCA), whereshe collaborates with industry to define low-level programminginterfaces for heterogeneous computers.
She is also the memberof OpenACC standard, a programming standard for parallel computingdeveloped by Cray, CAPS, Nvidia and PGI. Her group also works withcolleagues in the U.S. DoD and the U.S. DoE to help define and promotethe OpenSHMEM programming interface. Barbara has conducted research onparallel programming languages and compiler technology for more than 15years, and has written two books, published numerous papers, and editedvolumes on related topics. She earned a B.Sc. (First Class Honors) inmathematics from Canterbury University, New Zealand, and a Ph.D. incomputer science at Queen’s University, Belfast.