Threading and Parallel Programming Constructs used in multicore systems development: Part 3
By Shameem Akhter and Jason Roberts, Intel Corp.
Important in the parallel programming environment, condition variables
are based on
Dijkstra's semaphore semantics, with the
exception that no
stored value is associated with the operation.
This means condition variables do not contain the actual condition
to test; a shared data state is used instead to maintain the condition
for threads. A thread waits or wakes up other cooperative threads until
a condition is satisfied.
The condition variables are preferable to locks when pooling
requires and needs some scheduling behavior among threads. To operate
on shared data, condition variable C, uses a lock, L. Three basic
atomic operations are performed on a condition variable C:
* wait(L):
Atomically releases the lock and waits, where wait returns the lock
been acquired again
* signal(L): Enables one of the
waiting threads to run, where signal returns the lock is still acquired
* broadcast(L): Enables all of the
waiting threads to run, where broadcast returns the lock is still
acquired
To control a pool of threads, use of a signal function is
recommended. The penalty for using a broadcast-based signaling function
could be severe and extra caution needs to be undertaken before waking
up all waiting threads.
For some instances, however, broadcast signaling can be effective.
As an example, a "write" lock might allow all "readers" to proceed at
the same time by using a broadcast mechanism.
To show the use of a condition variable for a synchronization
problem, the pseudocode in Figure
4.11 below solves the producer-consumer problem discussed
earlier. A variable LC is used to maintain the
association between condition variable C and an associated lock L.

 |
| Figure
4.11 Use of a Condition Variable for the Producer-Consumer Problem
Monitors, |
For structured synchronization, a higher level construct is
introduced for simplifying the use of condition variables and locks,
known as a monitor. The purpose of the monitor is to simplify the
complexity of primitive synchronization operations and remove the
implementation details from application developers.
The compiler for the language that supports monitors automatically
inserts lock operations at the beginning and the end of each
synchronization-aware routine.
Most recent programming languages do not support monitor explicitly,
rather they expose lock and unlock operations to the developers. The
Java language supports explicit monitor objects along with synchronized
blocks inside a method. In Java, the monitor is maintained by the
"synchronized" constructs, such as

where the "condition" primitives are used by wait(), notify(), or
notifyAll() methods. Do not confuse this with the Monitor object in the
Java SDK though. The Java Monitor object is used to perform resource
management in Java Management Extension (JMX). Similarly, the monitor
object in C# is used as lock construct.
Messages
The message is a special method of communication to transfer
information or a signal from one domain to another. The definition of
domain is different for different scenarios.
For multi-threading environments, the domain is referred to as the
boundary of a thread. The three M's of message passing are
multi-granularity, multithreading, and multitasking (Ang 1996).
In general, the conceptual representations of messages get
associated with processes rather than threads. From a message-sharing
perspective, messages get shared using an intra-process, inter-process,
or process-process approach, as shown in Figure 4.12 below.
 |
| Figure
4.12 Message Passing Model |
Two threads that communicate with messages and reside in the same
process use intra-process messaging. Two threads that communicate and
reside in different processes use inter-process messaging.
From the developer's perspective, the most common form of messaging
is the process-process approach, when two processes communicate with
each other rather than depending on the thread.
In general, the messaging could be devised according to the memory
model of the environment where the messaging operation takes place.
Messaging for the shared memory model must be synchronous, whereas for
the distributed model messaging can be asynchronous. These operations
can be viewed at a somewhat different angle.
When there is nothing to do after sending the message and the sender
has to wait for the reply to come, the operations need to be
synchronous, whereas if the sender does not need to wait for the reply
to arrive and in order to proceed then the operation can be
asynchronous.
The generic form of message communication can be represented as
follows:

The generic form of message passing gives the impression to
developers that there must be some interface used to perform message
passing. The most common interface is the Message
Passing Interface
(MPI). MPI is used as the medium of communication, as illustrated
in Figure 4.13, below.
 |
| Figure
4.13 Basic MPI Communication Environment |
To synchronize operations of threads, semaphores, locks, and
condition variables are used. These synchronization primitives convey
status and access information.
To communicate data, they use thread messaging. In thread messaging,
synchronization remains explicit, as there is acknowledgement after
receiving messages.
The acknowledgement avoids primitive synchronization errors, such as
deadlocks or race conditions. The basic operational concepts of
messaging remain the same for all operational models. From an
implementation point of view, the generic client-server model can be
used for all messaging models.
Inside hardware, message processing occurs in relationship with the
size of the message. Small messages are transferred between processor
registers and if a message is too large for processor registers, caches
get used. Larger messages require main memory. In the case of the
largest messages, the system might use processor-external DMA, as shown
in Figure 4.14 below.
 |
| Figure
4.14 System Components Associated with Size of Messages |
Flow Control-based Concepts
In the parallel computing domain, some restraining mechanisms allow
synchronization among multiple attributes or actions in a system. These
are mainly applicable for shared-memory multiprocessor or multi-core
environments. The following sections covers only two of these concepts,
fence and barrier.
Fence.
The fence mechanism is implemented using instructions and in fact, most
of the languages and systems refer to this mechanism as a fence
instruction. On a shared memory multiprocessor or multi-core
environment, a fence instruction ensures consistent memory operations.
At execution time, the fence instruction guarantees completeness of
all pre-fence memory operations and halts all post-fence memory
operations until the completion of fence instruction cycles. This fence
mechanism ensures proper memory mapping from software to hardware
memory models, as shown in Figure
4.15, below.
The semantics of the fence instruction depend on the architecture.
The software memory model implicitly supports fence instructions. Using
fence instructions explicitly could be error-prone and it is better to
rely on compiler technologies. Due to the performance penalty of fence
instructions, the number of memory fences needs to be optimized.
 |
| Figure
4.15 Fence Mechanism |
Barrier.
The barrier mechanism is a synchronization method by which threads in
the same set keep collaborating with respect to a logical computational
point in the control flow of operations. Through this method, a thread
from an operational set has to wait for all other threads in that set
to complete in order to be able to proceed to the next execution step.
This method guarantees that no threads proceed beyond an execution
logical point until all threads have arrived at that logical point.
Barrier synchronization is one of the common operations for shared
memory multiprocessor and multi-core environments.
Due to the aspect of waiting for a barrier control point in the
execution flow, the barrier synchronization wait function for ith
thread can be represented as
(Wbarrier)i = f ((Tbarrier)i, (Rthread)i)
where Wbarrier is the wait time for a
thread, Tbarrier is the number of threads has
arrived, and Rthread is the arrival rate of
threads.
For performance consideration and to keep the wait time within a
reasonable timing window before hitting a performance penalty, special
consideration must be given to the granularity of tasks. Otherwise, the
implementation might suffer significantly.
Implementation-dependent Threading
Features
The functionalities and features of threads in different
environments
are very similar; however the semantics could be different. That is why
the conceptual representations of threads in Windows and Linux remain
the same, even though the way some concepts are implemented could be
different.
Also, with the threading APIs of Win32, Win64, and POSIX
threads
(Pthreads), the semantics are different as well. Windows threading
APIs
are implemented and maintained by Microsoft and work on Windows only,
whereas the implementation of Pthreads APIs allows developers to
implement threads on multiple platforms.
The IEEE only defined the Pthreads APIs and let the implementation
be done by OS developers. Due to the implementation issues of Pthreads,
not all features exist in Pthreads APIs. Developers use Pthreads as a
wrapper of their own thread implementations. There exists a native
Linux Pthreads library similar to Windows native threads, known as
Native
POSIX Thread Library (NPTL).
Consider the different mechanisms used to signal threads in Windows
and in POSIX threads. Windows uses an event model to signal one or more
threads that an event has occurred. However, no counterpart to Windows
events is implemented in POSIX threads. Instead, condition variables
are used for this purpose.
These differences are not necessarily limited to cross-library
boundaries. There may be variations within a single library as well.
For example, in the Windows Win32 API, Microsoft has implemented two
versions of a mutex.
The first version, simply referred to as a mutex, provides one
method for providing synchronized access to a critical section of the
code. The other mechanism, referred to as a CriticalSection,
essentially does the same thing, with a completely different API.
What's the difference?
The conventional mutex object in Windows is a kernel mechanism. As a
result, it requires a user-mode to kernel-mode transition to work. This
is an expensive operation, but provides a more powerful synchronization
mechanism that can be used across process boundaries.
However, in many applications, synchronization is only needed within
a single process. Therefore, the ability for a mutex to work across
process boundaries is unnecessary, and leads to wasted overhead. To
remove overhead associated with the standard mutex Microsoft
implemented the CriticalSection, which provides a user-level locking
mechanism. This eliminates any need to make a system call, resulting in
much faster locking.
Even though different threading libraries will have different ways
of implementing the features described here, the key to being able to
successfully develop multithreaded applications is to understand the
fundamental concepts behind these features.
To read Part 1, go to Synchronization
and critical sections
To read Part 2, go to Synchronization
Primitives
This series of three articles was excerpted
from Multi-Core
Programming by
Shameem Akhter and Jason Roberts, published by Intel Press. Copyright
© 2006 Intel Corporation. All rights reserved. This book can be
purchased on line.
Shameem Akhter is a platform architect at Intel
Corp., focusing on single socket multicore architecture and
performance analysis. Jason Roberts is a senior software engineer at
Intel, working on a number of different multithreaded software products
ranging from desktop, handheld and embedded DSP platforms.