Threading and Parallel Programming Constructs used in multicore systems development: Part 2
By Shameem Akhter and Jason Roberts, Intel Corp.
As in traditional computing environment, in the
parallel
programming realm synchronization is typically performed by
three types of primitives:
semaphores, locks,
and
condition variables.
The use of these primitives depends on the application requirements.
These synchronization primitives are
implemented by atomic operations and use
appropriate memory fences.
A memory fence, sometimes called a memory barrier, is a processor
dependent operation that guarantees that threads see other threads'
memory operations by maintaining reasonable order. To hide the
granularity of these synchronization primitives, higher level
synchronizations are used. That way application developers have to
concern themselves less about internal details.
Semaphores
Semaphores, the first set of software-oriented primitives to accomplish
mutual exclusion of parallel process synchronization, were introduced
by the well known mathematician Edsger Dijkstra
in his 1968 paper, "The
Structure of the "THE"-Multiprogramming System" (Dijkstra 1968).
Dijkstra illustrated that synchronization can be achieved by using
only traditional machine instructions or hierarchical structure. He
proposed that a semaphore can be represented by an integer, sem, and
showed that a semaphore can be bounded by two basic atomic operations,
P (proberen, which means test) and V (verhogen, which means increment).
These atomic operations are also referred as synchronizing
primitives. Even though the details of Dijkstra's semaphore
representation have evolved, the fundamental principle remains same.
Where, P represents the "potential delay" or "wait" and V represents
the "barrier removal" or "release" of a thread. These two synchronizing
primitives can be represented for a semaphore s as follows:
where semaphore value sem is initialized with the value 0 or 1
before the parallel processes get started. In Dijkstra's
representation, T referred to processes. Threads are used here instead
to be more precise and to remain consistent about the differences
between threads and processes.
The P operation blocks the calling thread if the value remains 0,
whereas the V operation, independent of P operation, signals a blocked
thread to allow it to resume operation.
These P and V operations are "indivisible actions" and perform
simultaneously. The positive value of sem represents the number of
threads that can proceed without blocking, and the negative number
refers to the number of blocked threads.
When the sem value becomes zero, no thread is waiting, and if a
thread needs to decrement, the thread gets blocked and keeps itself in
a waiting list. When the value of sem gets restricted to only 0 and 1,
the semaphore is a binary semaphore.
To use semaphore, you can consider semaphore as a counter, which
supports two atomic operations. Implementation of semaphores varies.
From a usability perspective, two kinds of semaphores exist: strong and
weak.
These represent the success of individual calls on P. A strong
semaphore maintains First-Come-First-Serve (FCFS) model and
provides guarantee to threads to calls on P and avoid starvation.
And a weak semaphore is the one which does not provide any guarantee
of service to a particular thread and the thread might starve. For
example, in POSIX, the semaphores
can get into starvation status and implemented differently than what
Dijkstra defined and considered as a weak semaphore (Reek 2002).
According to Dijkstra, the mutual exclusion of parallel threads
using P and V atomic operations represented as follows:
Semaphores are largely of historical interest. They are the
unstructured "goto" statements
of multi-threaded programming. Most
programming environments provide higher-level structured
synchronization primitives.
However, like the goto, semaphores are occasionally the best
primitive to use. A typical use of a semaphore is protecting a shared
resource of which at most n instances are allowed to exist
simultaneously. The semaphore starts out with value n. A thread that
needs to acquire an instance of the resource performs operation P. It
releases the resource using operation V.
Let's examine how semaphores might be used for the producer-consumer
problem and whether the problem can be resolved using semaphores or
not. Producer-consumer is a classic synchronization
problem, also known as the bounded-buffer
problem.
Here a producer function generates data to be placed in a shared
buffer and a consumer function receives the data out of the buffer and
operates on it, where both producer and consumer functions execute
concurrently.
Pseudo-code using a semaphore
for the producer-consumer problem is shown in Figure 4.7 below.
 |
| Figure
4.7 Pseudo-code of Producer-Consumer Problem |
Here neither producer nor consumer maintains any order. If the
producer function operates forever prior to the consumer function then
the system would require an infinite capacity and that is not possible.
That is why the buffer size needs to be within a boundary to handle
this type of scenario and make sure that if the producer gets ahead of
the consumer then the time allocated for the producer must be
restricted. The problem of synchronization can be removed by adding one
more semaphores in the previous solution shown in Figure 4.7 above.
Adding the semaphore would maintain the boundary of buffer as shown
in Figure 4.8 below, where
sEmpty and sFull retain the constraints of buffer capacity for
operating threads.
 |
| Figure
4.8 Dual Semaphores Solution for Producer-Consumer Problem |
Instead of using two independent semaphores and having a
constraint-based solution, the solution in Figure 4.8 above can be implemented
using other synchronization primitives as well. The following sections
discuss how to solve the producer-consumer problem using locks and
conditional variables primitives.
Locks
Locks are similar to semaphores except that a single thread handles a
lock at one instance. Two basic atomic operations get performed on a
lock:
* acquire():
Atomically waits for the lock state to be unlocked and sets the lock
state to lock.
* release():
Atomically changes the lock state from locked to unlocked.
At most one thread acquires the lock. A thread has to acquire a lock
before using a shared resource; otherwise it waits until the lock
becomes available.
When one thread wants to access shared data, it first acquires the
lock, exclusively performs operations on the shared data and later
releases the lock for other threads to use.
The level of granularity can be either
coarse or fine depending on the type of shared data that needs to be
protected from threads. The coarse granular locks have higher lock
contention than finer granular ones.
To remove issues with lock granularity, most of the processors
support the Compare and Swap (CAS)
operation, which provides a way to implement lock-free synchronization.
The atomic CAS operations guarantee that the shared data remains
synchronized among threads.
If you require the use of locks, it is recommended that you use the
lock inside a critical section with a single entry and single exit, as
shown in Figure 4.9 below.
 |
| Figure
4.9 A Lock Used Inside a Critical Section |
From an implementation perspective, it is always safe to use
explicit locks rather than relying on implicit locks. In general a lock
must not be held for a long periods of time. The explicit locks are
defined by the developer, whereas implicit locks come from the
underlying framework used, such as database engines provides lock the
maintain data consistency.
In the produce-consumer problem, if the consumer wants to consume a
shared data before the producer produces, it must wait. To use locks
for the producer-consumer problem, the consumer must loop until the
data is ready from the producer. The reason for looping is that the
lock does not support any wait operation, whereas Condition Variables
does.
Lock Types
An application can have different types of locks according to the
constructs required to accomplish the task. You must avoid mixing lock
types within a given task. For this reason, special attention is
required when using any third party library.
If your application has some third party dependency for a resource R
and the third party uses lock type L for R, then if you need to use a
lock mechanism for R, you must use lock type L rather any other lock
type. The following sections cover these locks and define their
purposes.
Mutexes. The
mutex is the simplest lock an implementation can use. Some texts use
the mutex as the basis to describe locks in general. The release of a
mutex does not depend on the release() operation only. A timer
attribute can be added with a mutex.
If the timer expires before a release operation, the mutex releases
the code block or shared memory to other threads. A try-finally clause
can be used to make sure that the mutex gets released when an exception
occurs. The use of a timer or try-finally clause helps to prevent a
deadlock scenario.
Recursive Locks. Recursive
locks are locks that may be repeatedly acquired by the thread that
currently owns the lock without causing the thread to deadlock. No
other thread may acquire a recursive lock until the owner releases it
once for each time the owner acquired it.
Thus when using a recursive lock, be sure to balance acquire
operations with release operations. The best way to do this is to
lexically balance the operations around single-entry single-exit
blocks, as was shown for ordinary locks.
The recursive lock is most useful inside a recursive function. In
general, the recursive locks are slower than nonrecursive locks. An
example of recursive locks use is shown in Figure 4.10 below.
 |
| Figure
4.10 An Example of Recursive Lock Use |
Read-Write Locks. Read-Write
locks are also called shared-exclusive or multiple-read/single-write
locks or non-mutual exclusion semaphores. Read-write locks allow
simultaneous read access to multiple threads but limit the write access
to only one thread.
This type of lock can be used efficiently for those instances where
multiple threads need to read shared data simultaneously but do not
necessarily need to perform a write operation. For lengthy shared data,
it is sometimes better to break the data into smaller segments and
operate multiple read-write locks on the dataset rather than having a
data lock for a longer period of time.
Spin Locks.
Spin locks are non-blocking locks owned by a thread. Waiting threads
must "spin," that is, poll the state of a lock rather than get blocked.
Spin locks are used mostly on multiprocessor systems.
This is because while the thread spins in a single-core processor
system, no process resources are available to run the other thread that
will release the lock. The appropriate condition for using spin locks
is whenever the hold time of a lock is less than the time of blocking
and waking up a thread.
The change of control for threads involves context switching of
threads and updating thread data structures, which could require more
instruction cycles than spin locks. The spin time of spin locks should
be limited to about 50 to 100 percent of a thread context switch
(Kleiman
1996) and should not be held during calls to other subsystems.
Improper use of spin locks might cause thread starvation. Think
carefully before using this locking mechanism. The thread starvation
problem of spin locks can be alleviated by using a queuing technique,
where every waiting thread to spin on a separate local flag in memory
using
First-In, First-Out (FIFO) or
queue construct.
Next in Part 3, Condition
variables, messages and flow control
To read Part 1, go to Synchronization
and critical sections.
This series of three articles was excerpted from Multi-Core
Programming by
Shameem Akhter and Jason Roberts, published by Intel Press. Copyright
© 2006 Intel Corporation. All rights reserved. This book can be
purchased on line.
Shameem
Akhter is a platform architect at Intel Corp., focusing on single
socket multicore architecture and performance analysis. Jason Roberts
is a senior software engineer at Intel, working on a number of
different multithreaded software products ranging from desktop,
handheld and embedded DSP platforms.
References:
(Reek 2002) Reek, Kenneth A. 2002. The
Well-Tempered Semaphore: Theme with Variations. ACM SIGSE Bulletin
(March), Pp. 356-359.
(Kleiman
1996) Kleiman, Steve, Devang Shah
and Bart Smaaslders, 1996 Programming with Threads, Prentice Hall