In the our previous article, A Portable OpenMP Runtime Library Based on MCA APIs for Embedded Systems – Part 2 , we discussed means to achieve better programming strategies for multicore embedded systems. We highlighted the barrier implementation strategies and noted that there is room for improvement.
We considered a dual core Power Architecture multicore platform from Freescale and evaluated our implementation. Measuring the overhead of the OpenMP directives while evaluating libGOMP and libEOMP implementations, we figured that there was little to no overhead and in some cases our libEOMP was faring better. We also evaluated libEOMP using some of the benchmark codes from MiBench benchmark suite. (libEOMPis a term that we have given to our novel runtime implementation intended totarget multicore embedded platforms; ‘E’ stands for Embedded.)
In this article, we will explore better programming strategies and show how the MCA layer could be used for an OpenMP implementation targeting multicore embedded systems.
Typically an OpenMP compiler translates OpenMP directives into multithreaded code that consists of functions calls to a customized runtime library. Although the division of work between an OpenMP compiler and a runtime library is implementation-defined, the essential functionality provided by a runtime includes:
- Efficient handling of threads creation, synchronization, and management.
- Parallel distributing loop iterations among threads when assisting the compiler to transform work-sharing construct.
- Fulfilling the OpenMP memory model and managing the shared memory.
- Supporting runtime library routines and environment variables.
Our runtime design inherits basic features from some of the commonly used open-source OpenMP implementations such as OpenUH , but since the features are not tailored for embedded systems, we’ve had to adopt several strategies to customize the implementations for embedded systems.
Optimized thread creation
Earlier we mentioned that we create a thread pool by querying the number of nodes available on the platform, using MRAPI metadata primitive. This allows us to generate only the number of threads required in the thread pool, avoiding thread oversubscription. At a later stage, if the programmer specifies a number of threads less than the size of the thread pool, CPU cycles will be freed by handling the idle threads efficiently.
Handling idle threads efficiently
Once threads are created, there are idle threads waiting to be scheduled in the pool. These idle threads should neither consume too much system resource nor have a long time delay when they receive a wake-up signal.
Traditionally, a master thread handles the idle threads; i.e., if a worker thread finishes its execution, it declares itself as idle by changing the status of the message queue. It then sleeps on a conditional wait until a wake-up signal with a new microtask is received.
This is conventionally accomplished using the POSIX threads mutex lock and conditional variables. This is called a ‘master-initiated’ approach, as it is the master thread’s responsibility to wake up all the worker threads.
The advantage of this approach is that CPU cycles will be released when worker threads are idle using conditional wait. This method is widely used in conventional OpenMP runtime implementations with large thread pools, as it would be catastrophic if hundreds of idle threads are busy with polling new microtasks. This is especially important when we are dealing with embedded systems where the resources are scarce. Idle thread polling is also a power hog.
However, a drawback of the master-initiated approach is that all worker threads will compete for the mutex lock when they receive the wake-up signal. That is, worker threads have to acquire the lock before they can proceed to execute the new microtask. This will cause a large performance overhead when the system scales up.
In our runtime design, we introduce a worker-initiated idle-threads-handling mechanism. The key idea is to pass the duty of waking up the idle threads to the worker threads, thus freeing the master thread of this responsibility and avoiding the need of using mutex locks.
Specifically, each worker thread maintains a private message queue. When the master thread assigns a new microtask, it puts the new microtask into the queue. When a worker thread finishes its execution, it will check for a new microtask from its private queue. Therefore worker threads do not need to compete for the mutex lock, and there is no signaling overhead associated with inter-process communication.
It is still possible that this technique will increase the workload for the worker threads. However, such an overhead will not count into the critical path, as the idle worker threads initialize the communication instead of the busy master thread. The cost of this kind of polling operation is negligible since it performs a read operation from a local message queue and no inter-process communication takes place.
Furthermore, we also introduce an interval, δ, between two polling attempts. The value of δ could be based on either randomness or the Poisson distribution. In our evaluation, we choose a heuristics value that equals to the average overhead of the runtime to finish a microtask. Therefore, the idle threads can sleep until the next microtask is available. It will be interesting to see if this made a difference compared to vendor-specific OpenMP implementation, when we evaluate our implementation.
Better barrier implementations
In the previous article, weconcluded that centralized barrier is not the way to go. OpenMP heavilyrelies on barrier operations to synchronize threads. Implicit barriersare required at the end of the parallel region; they are also usedimplicitly at the end of the work-sharing construct. The OpenMPdevelopers use explicit barriers to synchronize the threads in the team.Therefor an efficient barrier implementation is essential to achievegood performance and scalability.
In our runtime design, weemploy the tournament algorithm  to implement the OpenMP barrier.Compared to the traditional centralized barrier and blocking barrieralgorithms that require O(n) operations on critical path, the tournament barrier only takes log(n) operations, where n is the number of processes. As a result, it is clear that the tournament barrier offers better scalability.
Algorithm 1 below shows the tournament barrier. The idea of the tournamentalgorithm comes from the tournament game. Two threads play against eachother in each game. The role of each thread, (i.e. the winner, loser, orchampion) is predefined. The winners from each round play against eachother until there is a champion, who will wake-up all threads andrelease the barrier. Let' see how this barrier implementation fares onthe target platform.
Data : round_t rounds[p][log P ]
while 1 do
if round->role & LOSER then
round->opponent = current_thd->sense;
while champion_sense ≠ current_thd->sense do
else if round->role & WINNER then
while round→flag ≠ current_thd->sense do
else if round->role & CHAMPION then
while round_flag ≠ current_thd->sense do
champion_sense = current_thd->sense;
In the next article, we will discuss the results obtained from the optimized strategies discussed above. It will be interesting to see how the optimizations perform on more than just a dual core processor used as the testing platform earlier. Although we intend to keep the overhead minimum as the primary necessity of this work, however creating this MxAPI-based translation layer will be a good start before we move towards heterogeneous platforms.
Itwill be interesting to evaluate how each of our design choices farewhen targeting a variety of platforms. That will provide opportunitiesto tweak our implementation and develop an almost portable one. But wemust not forget that multicore embedded systems are heterogeneous too.When dealing with heterogeneous systems, we are looking at devices thatcould be running on multiple OSes, so providing software support getscomplicated.
The potential benefits of heterogeneity are great.As our individual processing cores are simplified, and all manner ofavailable parallelism must necessarily be exploited to provide desiredlevels of performance, the availability of a variety of devices withdifferent strengths and weaknesses will greatly expand the range ofalgorithms that may be successfully parallelized.
However,success in the embedded marketplace will depend heavily on the kind ofsupport provided to application developers. The growing complexity andparallelism in hardware has led to higher demands on applicationsoftware, in turn leading to critical needs for more productive forms ofhardware and software design.
In the next article, we will talkabout results obtained while evaluating our programming techniques on aneight- core embedded platform, but the target still will be ahomogeneous one. Later in this series we will explore what it takes toprogram a heterogeneous embedded platform using MxAPI as the translationlayer.
2. J. M. Mellor-Crummey and M. L. Scott. Algorithms for Scalable Synchronization on Shared-memory Multiprocessors. ACM Trans. Comput. Syst., 9(1):21–65, Feb. 1991. ISSN 0734-2071.
Sunita Chandrasekaran is a Postdoctoral Fellow at the High Performance Computing and Tools(HPCTools) research group at the University of Houston, Texas, USA. Hercurrent area of work spans HPC, Exascale solutions accelerators,heterogeneous and multicore embedded technology solutions. Her researchinterests include parallel programming, reconfigurable computing,accelerators, and runtime support. Her research contributions includeexploring newer approaches to build effective toolchain addressingprogrammer productivity and performance while targeting current HPC andembedded systems. She is a member of the Multicore Association (MCA),OpenMP and OpenACC. Sunita earned a Ph.D. in Computer ScienceEngineering from Nanyang Technological University (NTU), Singapore, inthe area of developing tools and algorithms to ease programming onFPGAs. She earned a B.E in Electrical & Electronics from AnnaUniversity, India.
Cheng Wang is currently a PhD studentin the Department of Computer Science at the University of Houston. Hisresearch interests are high performance computing, compilers, parallelprogramming models and multicore embedded systems. He is a studentmember of IEEE and ACM. Cheng Wang received the BS degree from XidianUniversity in 2010.
Barbara Chapman is a professor ofComputer Science at the University of Houston, where she teaches andperforms research on a range of HPC-related themes. Her research grouphas developed OpenUH, an open source reference compiler for OpenMP withFortran, C and C++ that also supports Co-Array Fortran (CAF) andCUDA. In 2001, she founded cOMPunity, a not-for-profit organization thatenables research participation in the development and maintenance ofthe OpenMP industry standard for shared memory parallel programming. Sheis a member of the Multicore Association (MCA), where she collaborateswith industry to define low-level programming interfaces forheterogeneous computers. She is also the member of OpenACC standard, aprogramming standard for parallel computing developed by Cray, CAPS,Nvidia and PGI. Her group also works with colleagues in the U.S. DoD andthe U.S. DoE to help define and promote the OpenSHMEM programminginterface.
Barbara has conducted research on parallel programminglanguages and compiler technology for more than 15 years, and haswritten two books, published numerous papers, and edited volumes onrelated topics. She earned a B.Sc. (First Class Honors) in mathematicsfrom Canterbury University, New Zealand, and a Ph.D. in computer scienceat Queen’s University, Belfast.