A portable OpenMP runtime library based on the embedded MCA APIs: Part 2

Sunita Chandrasekaran, Cheng Wang, and Barbara Chapman, University of Houston

April 22, 2013

Sunita Chandrasekaran, Cheng Wang, and Barbara Chapman, University of HoustonApril 22, 2013

In our earlier article on building a portable OpenMP runtime library for embedded multicore designs based on the Multicore Association APIs, we described how the feature sets of the two programming models could be configured to work together and how to create threads in an optimal manner as well as handle the memory system efficiently

In this article we discuss another important consideration: how to deal with a key implementation challenge relating to synchronization of primitives used in these two multicore programming models.

In the previous article, “A Portable OpenMP Runtime Library Based on MCA APIs for Embedded Systems – Part 1”, an overview of MCA APIs was given. We established a correlation between the feature sets of OpenMP and MCA. We also discussed methodologies to create threads in an optimal manner and to handle the memory system efficiently in the more limited resource environment of most embedded designs.

Synchronization primitives
The OpenMP synchronization primitives usually include ‘master’, ‘single’, ‘critical’, and ‘barrier’ constructs. OpenMP relies heavily on barrier operations to synchronize threads in parallel. A work-sharing construct has implicit barriers in place at the end, although OpenMP relies on explicit barriers for finer control and coordination of work among threads.

The synchronization constructs are typically translated into runtime library calls during compilation. Hence an effective barrier implementation at runtime enables better performance and scalability.

The traditional OpenMP runtime implementation for high-performance computing domains usually relies largely on POSIX thread synchronization primitives, such as mutexes and semaphores. However there are several obstacles to adopting similar approaches for embedded systems. Therefore we have considered alternate approaches to implement barrier constructs for embedded platforms [1]. For experimental purposes we chose centralized blocking barrier construction, based on a centralized shared thread counter, mutexes, and conditional variables, which is adopted by many current barrier implementations.

In the centralized blocking barrier, each thread updates a shared counter atomically once it reaches the barrier. All threads will be blocked on a conditional wait until the value of the counter is equal to the team size. The last thread will send a signal to wake up all other threads. This is a good approach for high performance computing domains but not for embedded platforms, hence we tweaked this barrier approach and call our version the centralized barrier.

In our implementation, each thread still updates a shared counter and waits for the value to be equal to the number of threads in the team. Instead of using a conditional wait, each thread sets a spin_lock to continuously check until the barrier point is reached. Spin_lock requires few resources to set up the blocking of a thread, thus it does not exhaust the already limited amount of resources available on an embedded platform.

This barrier implementation uses a smaller amount of memory. Early evaluation over the two approaches showed that centralized was ~10x and ~22x better than centralized_blocking approach for 4 and 32 threads respectively. The centralized barrier approach does not scale well in general.

The centralized barrier strategy contains both read and write contention for shared variables (all threads contend for the same set of variables), also we believe that locking the global counter is hampering the scalability factor. But we still see that the results were better than the centralized blocking approach for 32 threads, probably since no overhead was incurred due to signal handling and context switches.

We used MRAPI synchronization primitives for mapping strategies and coordinating access to shared resources. Figure 1 shows the pseudo-code for the centralized barrier implementation.

We are continuing to brainstorm and improve the barrier construct implementation by exploring other algorithms such as tree barrier, tournament barrier, and so on.

global_barrier_flag = 0;
global_count; // save number of threads encourntered
void __ompc_barrier(void)
{
   ...
   int barrier_flag=0;
   if(active_tema_size>1){
      barrier_flag = global_barrier_flag;
      mrapi_mutex_lock(...);
      global_count++;
      mrapi_mutex_unlock(...);
      if(global_count == active_team_size){
        count_barrier = 0;
        global_barrier_flag = barrier_flag^1;
      }else{
         while(barrier_flag==global_barrier_flag);
      }
    }
}


Figure 1: Pseudo code for barrier implementation

We also provide support for other constructs such as ‘critical’ that defines a critical section of code that only one thread can access at a time. When the critical construct is encountered, the critical section will be outlined and two runtime library calls, ompc_critical and ompc_end critical respectively, will be inserted at the beginning and at the end of the critical section.

The former is implemented as an MRAPI mutex_lock, and the latter as an MRAPI mutex_unlock. The ‘single’ construct specifies that the encapsulated code can only be executed by a single thread. Therefore, only the thread that encounters the single construct will execute the code within that region.

The basic idea is that each thread tries to update a global counter, which is protected by MRAPI mutexes. Thus only the first thread that gains access to the mutexes can update the global counter returns a flag for the thread to execute that single region. The ‘master’ construct defines that only the master thread will execute the code. Since the node id has been stored in the MRAPI resource tree, it is fairly easy to find the thread that is the master thread.

Our implementation also provides support for work-sharing constructs that define a key component of data parallelism that is widely exploited in today’s multicore embedded systems. We are in the process of improving the results; this will definitely be discussed in future articles.

Architecture and compilation overview
Our target architecture is a Freescale P1022 Reference Design Kit (RDK) that is a state-of-the-art dual-core Power Architecture multicore platform from Freescale. It supports 36-bit physical addressing and double precision floating point. The memory hierarchy consists of three levels: 32KB I/D L1 with 256KB shared L2, and 512 MB 64-bit off-chip DDR memory.

We used the OpenUH compiler [2] to perform a source-to-source translation of a given application code into an intermediate file, which is a bare C code runtime library function call. This is fed into the backend native compiler, Power Architecture GCC compiler, and a compiler toolchain for the Freescale e500v2 processor to obtain an object code. The linker links object codes together with the runtime libraries associated with OpenMP as well as MRAPI that were previously compiled by the native compiler.


< Previous
Page 1 of 2
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER