What the new OpenMP standard brings to embedded multicore software designEditor's note: The authors describe how the OpenMP shared memory muliticore programming model is being adapted to the needs of embedded systems designs.
Multicore embedded systems are widely used in telecommunication systems, robotics, automotive vision systems, medical applications, life critical systems, and more. Today, in order to provide high throughput, low latency and energy-efficient solutions, such systems usually consist of heterogeneous cores operating on different ISAs, operating systems, and dedicated memory systems.
Although multicore systems have a lot of potential, the limited availability of multicore software development tools and standards has slowed their full adoption. Programmers may be required to write low-level code, schedule work units, and manage synchronization between cores if they are to reap significant benefits from these systems.
As system complexity increases, it is unrealistic to expect programmers to handle all the low-level details of managing data movement, DMA, cache coherency, and utilizing synchronization primitives explicitly in order to exploit application concurrency. Handling these manually is not only time consuming, but also an error-prone and laborious task. Some of the existing approaches that address this issue include defining language extensions or using parallel programming libraries, but these are low-level programming strategies that demand complex programming even from an expert programmer. They also require a thorough understanding of the low-level details of the hardware platform under consideration.
Currently, hardware vendors supply vendor-specific development tools that are tied to the details of the device they were originally designed for; this may preclude use of the software on any future device even from the same family. This is a major concern. Software portability is needed.
In this article, we discuss a possible ‘silver bullet’ solution to programming multicore systems. We cannot solve the overall problems but we can solve several pieces of it.
Summary of OpenMP
We will first look into OpenMP, one of the most familiar parallel programming models, and its functionalities. We will then discuss what more is needed to migrate this powerful model, OpenMP, before programmers can use it to solve software issues in the embedded world.
OpenMP is a shared memory programming model that can easily boost the performance of a code. OpenMP is a collection of directives, library routines, and environment variables that may be used in conjunction with C, C++, or Fortran to express a shared memory parallel computation.
Over three major revisions of the standard, OpenMP has evolved to support a powerful set of capabilities for intra-node parallel programming that has achieved widespread use. OpenMP is regulated by OpenMP ARB that is jointly defined by a group of major computer hardware and software vendors, such as AMD, IBM, Intel, Cray, HP, Fujitsu, NVIDIA, and Texas Instruments. A number of OpenMP API implementations are available from various vendors and open source communities such as Intel, IBM, GNU (GCC), and OpenUH.
OpenMP uses a fork-join model for parallel execution. The program starts with a single thread, the master thread. When the parallel region is entered, master creates a team of parallel worker threads; the statements within the parallel block are executed in parallel by the worker threads; at the end of the parallel region, all threads wait are synchronized (joined). (Figure 1) After this stage, serial execution occurs.
Major features of OpenMP include parallel loops and sections, identification of tasks that may be dynamically scheduled for execution. Data in parallel regions may be shared by all the threads or remain private to each thread. As a result, it can help the application developer reduce a code’s memory footprint.
Synchronization constructs are used to coordinate data accesses as well as specify certain task schedules. OpenMP compilers translate OpenMP directives into multithreaded code containing calls to functions in a custom runtime library. The runtime library manages the execution of the multithreaded code on the multicore system, and includes functions for thread creation and management, loop and task scheduling, and synchronizations.
OpenMP helps programmers by reducing overall coding time. All the user needs to do is annotate the program code with OpenMP directives. The programmer does not need to rewrite the code from scratch, but he may be required to rewrite certain portions of the code. OpenMP is good at retaining most of the code structure and still be able to express parallelism. The implementation must then work out the detailed mapping of the computation. Most of the details are left to the compiler and the OpenMP runtime library to deal with, while reducing the burden on the programmer. Addition of OpenMP directives does not break the serial code; however incorrect combination of directives may lead to parallel bugs.
The main advantage of OpenMP is the maturity factor. It has been evolving since 1997. The most recent version of OpenMP is 3.1, ratified in 2011. Currently the ARB is working toward extending OpenMP for special purpose devices (accelerators) that includes DSPs and GPUs; the new specification will be released soon. Big enhancements to the new specification include support for accelerators and coprocessors, addition of newer features to OpenMP tasks, and addition of high-level affinity support.
A simple code examplee
Here is a simple code snippet illustrating how OpenMP could be used on an embarrassingly parallel Monte Carlo program code:
main(int argc, char *argv)
unsigned short x;
#pragma omp parallel
x = 1;
x = 1;
x = omp_get_thread_num();
points = 0;
printf(“What thread number am I %d\n”, x);
#pragma omp for firstprivate(x) private(a,b) reduction(+:points)
for (i = 0; i < num_steps; i++)
a = erand48(x);
b = erand48(x);
if (a*a + b*b <= 1.0) points++;
pi = 4.0 * points/num_steps;
printf("Estimate of pi: %7.5f, Points%d, Num_steps %d\n", pi, points, num_steps);
The above figure shows the parallel version of the code. There are a few insertions of OpenMP directives in the code; minor alterations have been made to the original serial code. The code can still be compiled serially; (x will not be equal to omp__get__thread__num(); but needs to be initialized to 0) the compiler would simply ignore the pragmas and execute the code. Presence of an OpenMP directive ensures that the master thread forks a number of child threads. Each thread is allocated parts of the work and each thread’s contribution is added to a variable count. A reduction clause is used within a parallel construct to add up the individual thread contributions and address potential race conditions.
In the multicore world, currently most (but not all) of the hardware and software implementations are based on proprietary solutions. This is acceptable if we are looking at only a single processor. But unfortunately that is not the case. Multicore systems exist in a variety of flavors. A mixture of different types of cores is available on the same platform. These complicated systems demand better software solutions. If the multicore embedded industry is to quickly adopt these multicore devices, one of the key factors to consider is moving from proprietary solutions to open standards.
Multicore Association (MCA)
Multicore Association (MCA) was formed in 2005 to provide open standards for multicore platforms. Its main objective is to achieve software portability across a wide range of multicore systems and to foster a broader ecosystem. The association has put together a cohesive set of APIs to standardize communication (MCAPI), resource sharing (MRAPI), and virtualization spanning cores on different chips. The association has also put together another working group to define APIs for the tasking model called MTAPI. This is currently under development. We have been actively participating with the MTAPI working group, which plans to release the specification soon.
MCA APIs enable system developers to write portable program codes that will scale through different architectures and operating systems. APIs such as MCAPI enables efficient movement and sharing of data between cores so that the cores are running in parallel. If data is shared (i.e. by reference), the MCAPI implementation can provide data locking and synchronization. The MRAPI shared memory primitive inherits the semantics of the shared memory in POSIX threads, but provides the ability to manage access to physically coherent shared memory between heterogeneous threads running on different cores and operating systems.
A point to note here is that MCA APIs are still low-level library-based protocol (compared to OpenMP) that would make programming still tedious, although commercial tools are available that simplify this process.. Moreover, MCA APIs cannot be used to explore the fine-grained data parallelism available in the embedded platforms. But a high-level model like OpenMP can be a suitable solution. However, there is a problem. OpenMP as-is was originally meant for HPC and general-purpose programming, the standard implementation is quite heavyweight, so it cannot be directly used for embedded systems. Moreover, an implementation is tightly coupled with certain operating systems (e.g., SMP GNU/Linux), libraries, and vendor-specific APIs (e.g., POSIX/Solaris threads). These severely obstruct deployment of OpenMP in embedded systems.
Our goal is to enable OpenMP to serve as a vehicle for productive programming of heterogeneous embedded systems by making use of MCA APIs without compromising performance. One of the potential directions is to exploit the capabilities of the MCA APIs to support an implementation of the de facto shared memory programming standard, OpenMP. This requires careful selection of appropriate characteristics of the MCA APIs, determining the translation strategy, and delivering the corresponding compiler and runtime implementations.
Typically an OpenMP compiler translates an OpenMP directive into multithreaded code containing function calls to a customized runtime library. This runtime library manages the parallel execution on the multicore systems and includes functions for thread creation, management, work scheduling, and synchronization. To meet these responsibilities, the runtime library usually relies on system components such as operating systems, hardware, and thread libraries. Some embedded systems lack some of these features.
Hence we plan to use MCA APIs (MRAPI) to capture the essential capabilities required to manage resources in embedded systems. These resources include heterogeneous/homogeneous chips, hardware accelerators, and memory regions. MRAPI can be used to store all of the runtime data structures and program data with shared data scope so that all nodes working on different devices can share the data.
Figure 2 shows the basic idea of MRAPI (as stated in the MCA website). Using MCAPI with zero-copy for shared data could simplify the process.
Key design choices
Thread waiting and awakening stages need to be effectively handled since idleness of a thread can largely affect the performance of an implementation meant for embedded systems, especially when there are not many threads available in such systems. Traditionally, conditional variables and signal handling techniques are used in HPC implementations. But these may not be suitable for embedded systems, especially when they may incur overhead that is quite expensive for embedded systems, such as context switching that takes hundreds of cycles to execute. Moreover, conditional variables and signal handling are not implemented in the MRAPI library. Hence we plan to consider a spin_lock mechanism for embedded systems. The advantages of spin lock are that the thread does not need to switch between sleep and wake-up stages every time since this implementation is simple and lightweight; it only uses a small amount of memory, which could be fitted into a small but fast shared cache-coherent cache (e.g. L2), thus minimizing cache read misses.
OpenMP relies heavily on explicit barrier synchronization to coordinate the work among threads. Most of the OpenMP constructs involve implicit barrier, which is the primary cause for incurring overhead. For example, implicit barriers are required at the end of parallel regions; they are also used implicitly at the end of work-sharing construct. Thus a good barrier implementation is important to achieve good performance and scalability. The barrier algorithms can be classified into centralized barrier, centralized blocking barrier, and tournament barrier. We plan to explore an optimal barrier strategy that is not too heavyweight, especially in terms of memory consumption for an embedded platform. We will adopt the spin_lockmechanism in the place of conditional variables and mutexes to implement the strategy.
For compilation purposes, we plan to use our in-house OpenUH compiler that can translate a given code into an intermediate file using a source-to-source approach. This intermediate file will be bare C code with runtime library function calls. The transformation of OpenMP, which lowers OpenMP pragmas into corresponding bare code with runtime library calls, is mainly performed in two steps: OMP_Prelower and LOWER_MP. The intermediate file will serve as the input to the compiler (backend) that is meant for the multicore platform being considered. During the linking phase, the linker will link all the object codes together with runtime library libOpenMP and libMCA, which were previously compiled by the native compiler.
Figure 3 shows how OpenMP can act together with the MCA APIs. Generally the runtime library layer relies on lower-level components such as operating systems, thread libraries, and hardware threads. But we know that embedded systems may lack some of these features. These systems are complex, with the presence of special purpose accelerators that will not rely on any operating system but will be tied to processor(s) that will run on multiple operating systems. We show in the figure that the runtime library does not directly rely on each of the operating systems of the cores but relies instead on an uniform interfaces defined by the MCA.
We plan to evaluate our implementation with suitable embedded benchmarks such as EEMBC and MiBench. The primary focus of this solution will be to develop an implementation that will ease multicore programming. This solution also will address portability issues and ensure that the implementation does not depend on any dedicated operating system or hardware.
OpenMP and MTAPI
MCA launched a new working group, Multicore Task Management API (MTAPI) that is creating an industry-standard specification supporting the coordination of tasks on embedded parallel systems. MTAPI is expected to abstract hardware details and let the software developer focus on creating the parallel solution. But like the other APIs of MCA, MTAPI is also a low-level model. We need a high-level model to improve programmer’s productivity. Hence we plan to use OpenMP, which introduced the concept of tasks and the task construct in 2008 Version 3.0. However, OpenMP cannot target processors that use multiple cores, accelerators and operating systems. So using OpenMP as a high-level model, we could easily consider MTAPI as the target of our OpenMP translation. This will not only help in achieving programmability but also achieve portability.
Modifying a given code for a given platform may be the easiest way out to obtain the best parallelism but it is a painful procedure. Our goal is to make parallel programming easy, flexible and ready-to-use.
Programming using OpenMP is a pleasure (depending on your experience), primarily because it has the ability to express parallelism in a very natural way. Standards such as MCA APIs will allow vendors to work together, especially when the devices are moving toward smarter, powerful and run-time efficient platforms. We aim to use these APIs as the target of our high-level model OpenMP’s translation. This will enable us to develop a portable, programmable, and productive solution for programming on multicore embedded systems.
Sunita Chandrasekaran is a Postdoctoral Fellow at the High Performance Computing and Tools (HPCTools) research group at the University of Houston, Texas, USA. Her current area of work spans HPC, Exascale solutions accelerators, heterogeneous and multicore embedded technology solutions. Her research interests include parallel programming, reconfigurable computing, accelerators, and runtime support. Her research contributions include exploring newer approaches to build effective toolchain addressing programmer productivity and performance while targeting current HPC and embedded systems. She is a member of the Multicore Association (MCA). Sunita earned a Ph.D. in Computer Science Engineering from Nanyang Technological University (NTU), Singapore, in the area of developing tools and algorithms to ease programming on FPGAs. She earned a B.E in Electrical & Electronics from Anna University, India.
Barbara Chapman is a professor of Computer Science at the University of Houston, where she teaches and performs research on a range of HPC-related themes. Her research group has developed OpenUH, an open source reference compiler for OpenMP with Fortran, C and C++ that also supports Co-Array Fortran (CAF) and CUDA. In 2001, she founded cOMPunity, a not-for-profit organization that enables research participation in the development and maintenance of the OpenMP industry standard for shared memory parallel programming. She is a member of the Multicore Association (MCA), where she collaborates with industry to define low-level programming interfaces for heterogeneous computers. Her group also works with colleagues in the U.S. DoD and the U.S. DoE to help define and promote the OpenSHMEM programming interface. Barbara has conducted research on parallel programming languages and compiler technology for more than 15 years, and has written two books, published numerous papers, and edited volumes on related topics. She earned a B.Sc. (First Class Honors) in mathematics from Canterbury University, New Zealand, and a Ph.D. in computer science at Queen’s University, Belfast.