Thread synchronization techniques for better multicore system power/performance tradeoffs

Faheem Sheikh, Mentor Graphics

June 12, 2012

Faheem Sheikh, Mentor Graphics

A four-core use-case
Let’s consider a parallel multithreaded application use-case consisting of eight child threads on a four-core system. Figure 2 below shows a sequence diagram of RTOS fork-join services supporting this use-case where each column represents a single core. On core zero (C0) parent thread forks (Thread_Fork) eight children threads one by one without leaving control of the processor.

While parent thread is forking children, there are three cores free in the system and they get to run the first three child threads without waiting for the parent to finish. The fourth child has to wait until the parent is done with creating all of the threads.

This is required to ensure the parent does not hop around the cores while creating child threads. As soon as the parent is done spanning the children, it is preempted by the fourth child and from then on, maximum utilization of all the cores is depicted by the rectangular shapes. Idle time of a core is indicated through an absence of any rectangle (empty space).


Click on image to enlarge.

Figure 2: Sequence diagram showing proposed scheme implementing fork-join.

Once a parent gets control back (on whichever core is free), some of the children will have already completed their job and Thread_Join requires only a status check on a flag in the thread control block.

This is the case for child zero through six where no synchronization is required because of the priority relationship between parent and child threads. The last thread to join the child has not yet finished and since the parent is the lowest priority thread, this means the system is now lightly loaded.

Therefore, the core on which the parent is running is taken to a low power mode. When thread seven finishes it wakes up this core using a hardware assisted event, serving a synchronization purpose.

We can compare this use-case with its counterpart, which relies on RTOS constructs like message passing or events for thread synchronization. First thing to note is that since typically a child inherits its parent’s priority, the parent will be able to create all the children, but then child four has to wait until the parent calls Thread_Join against the first child thread.

If the parent suspends on this call because completion message/event has not arrived from the first child, child four gets a chance to execute on the core vacated by the parent. Of course, this depends on the duration child one takes to complete. As a result, time that should be spent in computation is being wasted on internal OS logic, thus increasing synchronization overhead. Also, every child thread upon completion generates a software event message that the RTOS scheduler has to process, resulting in additional overhead.

When this happens, the parent, in accordance with the program logic, goes through the join routine and will always get suspended if that thread has not finished again resulting in overhead and low utilization of the core. If power management capabilities are not built into the RTOS this scheme will also consume more power.

To verify the proposed solution discussed in this article, a CPU-bound dot product application was developed and tested on an ARM Cortex A9 MPCore with four identical cores each operating on 400MHz. The RTOS used was the Nucleus RTOS from Mentor Graphics compiled with Mentor’s Sourcery GNU tools for ARM EABI.

The results for proposed and message passing (MP) based fork-join schemes are shown in Figure 3 below for matrix sizes of 2048, 4096, and 8192.


Click on image to enlarge.

Figure 3: Dot product runtime comparison.

Since overhead is a function of number of threads, each matrix size is evaluated against 2, 4, 8, and 16 threads. Time taken in seconds by dot product application is plotted on the vertical axis.

Note that for all cases the technique suggested in this work results in lower execution time with an average of around three percent of savings in overhead in addition to promise of power saving. Since overhead is a fraction of absolute execution time in a highly parallelizable application such as dot product, the savings here are significant.

Faheem Sheikh joined the Embedded Systems Division of Mentor Graphics in 2007, where he is working as a senior technical lead. His current focus is software research and development for symmetric multiprocessor architectures. Faheem has a Masters (2005) and PhD (2009) in computer engineering from Lahore University of Management Sciences, Pakistan. He has more than ten technical publications in leading international conferences and journals.

< Previous
Page 4 of 4
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER