A portable OpenMP runtime library based on the embedded MCA APIs: Part 2 - Embedded.com

A portable OpenMP runtime library based on the embedded MCA APIs: Part 2

In our earlier article on building a portable OpenMP runtime library for embedded multicore designs based on the Multicore Association APIs, we described how the feature sets of the two programming models could be configured to work together and how to create threads in an optimal manner as well as handle the memory system efficiently

In this article we discuss another important consideration: how to deal with a key implementation challenge relating to synchronization of primitives used in these two multicore programming models.

In the previous article, “A Portable OpenMP Runtime Library Based on MCA APIs for Embedded Systems – Part 1”, an overview of MCA APIs was given. We established a correlation between the feature sets of OpenMP and MCA. We also discussed methodologies to create threads in an optimal manner and to handle the memory system efficiently in the more limited resource environment of most embedded designs.

Synchronization primitives
The OpenMP synchronization primitives usually include ‘master’, ‘single’, ‘critical’, and ‘barrier’ constructs. OpenMP relies heavily on barrier operations to synchronize threads in parallel. A work-sharing construct has implicit barriers in place at the end, although OpenMP relies on explicit barriers for finer control and coordination of work among threads.

The synchronization constructs are typically translated into runtime library calls during compilation. Hence an effective barrier implementation at runtime enables better performance and scalability.

The traditional OpenMP runtime implementation for high-performance computing domains usually relies largely on POSIX thread synchronization primitives, such as mutexes and semaphores. However there are several obstacles to adopting similar approaches for embedded systems. Therefore we have considered alternate approaches to implement barrier constructs for embedded platforms [1]. For experimental purposes we chose centralized blocking barrier construction, based on a centralized shared thread counter, mutexes, and conditional variables, which is adopted by many current barrier implementations.

In the centralized blocking barrier, each thread updates a shared counter atomically once it reaches the barrier. All threads will be blocked on a conditional wait until the value of the counter is equal to the team size. The last thread will send a signal to wake up all other threads. This is a good approach for high performance computing domains but not for embedded platforms, hence we tweaked this barrier approach and call our version the centralized barrier.

In our implementation, each thread still updates a shared counter and waits for the value to be equal to the number of threads in the team. Instead of using a conditional wait, each thread sets a spin_lock to continuously check until the barrier point is reached. Spin_lock requires few resources to set up the blocking of a thread, thus it does not exhaust the already limited amount of resources available on an embedded platform.

This barrier implementation uses a smaller amount of memory. Early evaluation over the two approaches showed that centralized was ~10x and ~22x better than centralized_blocking approach for 4 and 32 threads respectively. The centralized barrier approach does not scale well in general.

The centralized barrier strategy contains both read and write contention for shared variables (all threads contend for the same set of variables), also we believe that locking the global counter is hampering the scalability factor. But we still see that the results were better than the centralized blocking approach for 32 threads, probably since no overhead was incurred due to signal handling and context switches.

We used MRAPI synchronization primitives for mapping strategies and coordinating access to shared resources. Figure 1 shows the pseudo-code for the centralized barrier implementation.

We are continuing to brainstorm and improve the barrier construct implementation by exploring other algorithms such as tree barrier, tournament barrier, and so on.

global_barrier_flag = 0;
global_count; // save number of threads encourntered
void __ompc_barrier(void)
{
   …
   int barrier_flag=0;
   if(active_tema_size>1){
      barrier_flag = global_barrier_flag;
      mrapi_mutex_lock(…);
      global_count++;
      mrapi_mutex_unlock(…);
      if(global_count == active_team_size){
        count_barrier = 0;
        global_barrier_flag = barrier_flag^1;
      }else{
         while(barrier_flag==global_barrier_flag);
      }
    }
}

Figure 1: Pseudo code for barrier implementation

We also provide support for other constructs such as ‘critical’ that defines a critical section of code that only one thread can access at a time. When the critical construct is encountered, the critical section will be outlined and two runtime library calls, ompc_critical and ompc_end critical respectively, will be inserted at the beginning and at the end of the critical section.

The former is implemented as an MRAPI mutex_lock , and the latter as an MRAPI mutex_unlock . The ‘single’ construct specifies that the encapsulated code can only be executed by a single thread. Therefore, only the thread that encounters the single construct will execute the code within that region.

The basic idea is that each thread tries to update a global counter, which is protected by MRAPI mutexes. Thus only the first thread that gains access to the mutexes can update the global counter returns a flag for the thread to execute that single region. The ‘master’ construct defines that only the master thread will execute the code. Since the node id has been stored in the MRAPI resource tree, it is fairly easy to find the thread that is the master thread.

Our implementation also provides support for work-sharing constructs that define a key component of data parallelism that is widely exploited in today’s multicore embedded systems. We are in the process of improving the results; this will definitely be discussed in future articles.

Architecture and compilation overview
Our target architecture is a Freescale P1022 Reference Design Kit (RDK) that is a state-of-the-art dual-core Power Architecture multicore platform from Freescale. It supports 36-bit physical addressing and double precision floating point. The memory hierarchy consists of three levels: 32KB I/D L1 with 256KB shared L2, and 512 MB 64-bit off-chip DDR memory.

We used the OpenUH compiler [2] to perform a source-to-source translation of a given application code into an intermediate file, which is a bare C code runtime library function call. This is fed into the backend native compiler, Power Architecture GCC compiler, and a compiler toolchain for the Freescale e500v2 processor to obtain an object code. The linker links object codes together with the runtime libraries associated with OpenMP as well as MRAPI that were previously compiled by the native compiler.

Results and evaluation
We performed several evaluationsteps. To make the discussions a little easier, we named ourimplementation libEOMP. The EPCC micro benchmark suite is a set ofprograms that measures the overhead of the OpenMP directives andevaluates different OpenMP runtime library implementations.
EPCCconsists of 2 benchmarks: syncbench evaluates the overhead for each ofthe OpenMP directives, while schedbench measures the loop schedulingoverhead using different chunk sizes. We will discuss syncbenchevaluation results. Once we improve our implementation for work-sharingconstructs, we will evaluate the same using schedbench.

For eachof the OpenMP directives, we ran the experiment ten times with 10,000iterations, and calculated the average time taken. We compared ourresults with the vendor-specific OpenMP runtime library libGOMP.

Table 1 shows that the overall performance of libEOMP is quite competitive withlibGOMP. The time difference is less than 1 microsecond, barelynoticeable by programmers. This motivates us to further improve libEOMPand achieve better results.

Table 1 also shows thatdirectives such as the parallel and single constructs in libEOMP evenoutperformed those in libGOMP. This is due to the sophisticated threadcreation and optimized thread management technique that we used toimplement the parallel construct.

We also see that the presenceof the implicit barrier hidden in most of the OpenMP directivescontributes to the results achieved. For example, the overhead of theparallel construct is dominated by two barrier constructs, one at thebeginning of the parallel region and one at the end of the parallelregion.

At the beginning of the parallel region, the barrierconstruct ensures that all the worker threads are ready for execution.At the end of the parallel region, as per the OpenMP specification,there is a need for a barrier construct that ensures that all the workerthreads have finished execution.

So we could see that theimplementation of the barrier construct is crucial for optimumperformance. We plan to improve our current barrier implementation byexploring other algorithms suitable for embedded platforms.

Table 1: Average execution time (us) for EPCC synchbench

Wealso evaluated libEOMP using some of the benchmark codes from MiBenchbenchmark suite [4]. The suite has a variety of benchmark codes chosenfrom different domains. We first parallelized these benchmarks by usingOpenMP since the original version of the code in MiBench is sequential.

Inthe FFT algorithm, logN stages are needed for a given wave of length Nand the tasks within each stage are equally distributed to each of thethreads. For the DGEMM and Dijkstra algorithms, the data size is evenlydivided among threads.

Figure 2 shows a comparison ofexecution times using different data sizes and numbers of threads. Fromthe figure we see that libEOMP achieves almost the same performance asthat of libGOMP. For the Dijkstra algorithm, libEOMP outperforms libGOMPdue to the efficient implementations of the parallel and singleconstructs, as discussed earlier. Our implementation performs well whenthe data size increases dramatically (the data size in FFT is increasedto 8 Mbytes) thus demonstrating good scalability.

Figure 2a: Execution time of FFT

Figure 2b: Execution time of DGEMM

Figure 2c: Execution time of Dijkstra

Multicoreprocessors continue to evolve, placing a greater burden on softwaredevelopers. There are plenty of supporting software tools andinfrastructure that aim to help developers obtain maximum performance onmulticore embedded architectures.

But the rate at which thecomplexity of hardware architecture is growing, and programmers do nothave the luxury to consider newer languages or build newer tools toexploit the rich feature sets of the hardware. Leaving aside issues suchas debugging, the growth of hardware complexity is outstripping theability of software development to keep up. MCA tries to address thisconcern by developing industry standards for communication andsynchronization to target systems with heterogeneous cores.

Onthe OpenMP side, we were able to leverage MCA functionalities to create anew runtime-based implementation for multicore embedded devices.Performance-wise, we noticed that the additional layer did not incur anyadditional overhead when compared to a vendor-specific runtimeimplementation.

By adding the new MCA layer we provide a sourcelevel portability which does not rely on any specific architecture orOS. This may not be the case if we use a vendor-specific library likelibGOMP. Now that there is a prototype implementation in place, we willconcentrate on improving our implementation strategy to obtain betterperformance and scalability.

Standards take a long time todevelop and establish. There are better chances of more and moreprogrammers embracing such standards if OS, processor, and tool vendorsprovide MCA API as part of their implementation.

Read A portable OpenMP runtime library based on MCAPI/MRAPI: Part 1

References
1. R. Nanjegowda, O. Hernandez, B. Chapman, and H. H. Jin. “Scalability Evaluation of Barrier Algorithms for OpenMP”. In Proc. of the 5th International Workshop on OpenMP: Evolving OpenMP in an Age of Extreme Parallelism, IWOMP ’09 , pages 42–52. Springer-Verlag, 2009.

2. C. Liao, O. Hernandez, B. M. Chapman, W. Chen, and W. Zheng. “OpenUH: An Optimizing, Portable OpenMP Compiler.” Concurrency and Computation: Practice and Experience, 19(18):2317–2332, 2007 .

3. J. Bull. “Measuring Synchronization and Scheduling Overheads in OpenMP.” In Proc. of the First European Workshop on OpenMP, pages 99–105, 1999 .

4. MiBench

Sunita Chandrasekaran is a Postdoctoral Fellow at the High Performance Computing and Tools(HPCTools) research group at the University of Houston, Texas, USA. Hercurrent area of work spans HPC, Exascale solutions accelerators,heterogeneous and multicore embedded technology solutions. Her researchinterests include parallel programming, reconfigurable computing,accelerators, and runtime support. Her research contributions includeexploring newer approaches to build effective toolchain addressingprogrammer productivity and performance while targeting current HPC andembedded systems. She is a member of the Multicore Association (MCA),OpenMP and OpenACC.

Sunita earned a Ph.D. in Computer ScienceEngineering from Nanyang Technological University (NTU), Singapore, inthe area of developing tools and algorithms to ease programming onFPGAs. She earned a B.E in Electrical & Electronics from AnnaUniversity, India.

Cheng Wang is currently a PhD studentin the Department of Computer Science at the University of Houston. Hisresearch interests are high performance computing, compilers, parallelprogramming models and multicore embedded systems. He is a studentmember of IEEE and ACM. Cheng Wang received the BS degree from XidianUniversity in 2010.

Barbara Chapman is a professor ofComputer Science at the University of Houston, where she teaches andperforms research on a range of HPC-related themes. Her research grouphas developed OpenUH, an open source reference compiler for OpenMP withFortran, C and C++ that also supports Co-Array Fortran (CAF) andCUDA. In 2001, she founded cOMPunity, a not-for-profit organization thatenables research participation in the development and maintenance ofthe OpenMP industry standard for shared memory parallel programming. Sheis a member of the Multicore Association (MCA), where she collaborateswith industry to define low-level programming interfaces forheterogeneous computers. She is also the member of OpenACC standard, aprogramming standard for parallel computing developed by Cray, CAPS,Nvidia and PGI. Her group also works with colleagues in the U.S. DoD andthe U.S. DoE to help define and promote the OpenSHMEM programminginterface.

Barbara has conducted research on parallel programminglanguages and compiler technology for more than 15 years, and haswritten two books, published numerous papers, and edited volumes onrelated topics. She earned a B.Sc. (First Class Honors) in mathematicsfrom Canterbury University, New Zealand, and a Ph.D. in computer scienceat Queen’s University, Belfast.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.