Threading and Parallel Programming Constructs used in multicore systems development: Part 2As in traditional computing environment, in the parallel programming realm synchronization is typically performed by three types of primitives: semaphores, locks, and condition variables.
A memory fence, sometimes called a memory barrier, is a processor dependent operation that guarantees that threads see other threads' memory operations by maintaining reasonable order. To hide the granularity of these synchronization primitives, higher level synchronizations are used. That way application developers have to concern themselves less about internal details.
Semaphores, the first set of software-oriented primitives to accomplish mutual exclusion of parallel process synchronization, were introduced by the well known mathematician Edsger Dijkstra in his 1968 paper, "The Structure of the "THE"-Multiprogramming System" (Dijkstra 1968).
Dijkstra illustrated that synchronization can be achieved by using only traditional machine instructions or hierarchical structure. He proposed that a semaphore can be represented by an integer, sem, and showed that a semaphore can be bounded by two basic atomic operations, P (proberen, which means test) and V (verhogen, which means increment).
These atomic operations are also referred as synchronizing primitives. Even though the details of Dijkstra's semaphore representation have evolved, the fundamental principle remains same. Where, P represents the "potential delay" or "wait" and V represents the "barrier removal" or "release" of a thread. These two synchronizing primitives can be represented for a semaphore s as follows:
where semaphore value sem is initialized with the value 0 or 1 before the parallel processes get started. In Dijkstra's representation, T referred to processes. Threads are used here instead to be more precise and to remain consistent about the differences between threads and processes.
The P operation blocks the calling thread if the value remains 0, whereas the V operation, independent of P operation, signals a blocked thread to allow it to resume operation.
These P and V operations are "indivisible actions" and perform simultaneously. The positive value of sem represents the number of threads that can proceed without blocking, and the negative number refers to the number of blocked threads.
When the sem value becomes zero, no thread is waiting, and if a thread needs to decrement, the thread gets blocked and keeps itself in a waiting list. When the value of sem gets restricted to only 0 and 1, the semaphore is a binary semaphore.
To use semaphore, you can consider semaphore as a counter, which supports two atomic operations. Implementation of semaphores varies. From a usability perspective, two kinds of semaphores exist: strong and weak.
These represent the success of individual calls on P. A strong semaphore maintains First-Come-First-Serve (FCFS) model and provides guarantee to threads to calls on P and avoid starvation.
And a weak semaphore is the one which does not provide any guarantee of service to a particular thread and the thread might starve. For example, in POSIX, the semaphores can get into starvation status and implemented differently than what Dijkstra defined and considered as a weak semaphore (Reek 2002).
According to Dijkstra, the mutual exclusion of parallel threads using P and V atomic operations represented as follows:
Semaphores are largely of historical interest. They are the unstructured "goto" statements of multi-threaded programming. Most programming environments provide higher-level structured synchronization primitives.
However, like the goto, semaphores are occasionally the best primitive to use. A typical use of a semaphore is protecting a shared resource of which at most n instances are allowed to exist simultaneously. The semaphore starts out with value n. A thread that needs to acquire an instance of the resource performs operation P. It releases the resource using operation V.
Let's examine how semaphores might be used for the producer-consumer problem and whether the problem can be resolved using semaphores or not. Producer-consumer is a classic synchronization problem, also known as the bounded-buffer problem.
Here a producer function generates data to be placed in a shared buffer and a consumer function receives the data out of the buffer and operates on it, where both producer and consumer functions execute concurrently.
Pseudo-code using a semaphore for the producer-consumer problem is shown in Figure 4.7 below.
|Figure 4.7 Pseudo-code of Producer-Consumer Problem|
Here neither producer nor consumer maintains any order. If the producer function operates forever prior to the consumer function then the system would require an infinite capacity and that is not possible.
That is why the buffer size needs to be within a boundary to handle this type of scenario and make sure that if the producer gets ahead of the consumer then the time allocated for the producer must be restricted. The problem of synchronization can be removed by adding one more semaphores in the previous solution shown in Figure 4.7 above.
Adding the semaphore would maintain the boundary of buffer as shown in Figure 4.8 below, where sEmpty and sFull retain the constraints of buffer capacity for operating threads.
|Figure 4.8 Dual Semaphores Solution for Producer-Consumer Problem|
Instead of using two independent semaphores and having a constraint-based solution, the solution in Figure 4.8 above can be implemented using other synchronization primitives as well. The following sections discuss how to solve the producer-consumer problem using locks and conditional variables primitives.
Locks are similar to semaphores except that a single thread handles a lock at one instance. Two basic atomic operations get performed on a lock:
Atomically waits for the lock state to be unlocked and sets the lock
state to lock.
* release(): Atomically changes the lock state from locked to unlocked.
At most one thread acquires the lock. A thread has to acquire a lock before using a shared resource; otherwise it waits until the lock becomes available.
When one thread wants to access shared data, it first acquires the lock, exclusively performs operations on the shared data and later releases the lock for other threads to use.
The level of granularity can be either coarse or fine depending on the type of shared data that needs to be protected from threads. The coarse granular locks have higher lock contention than finer granular ones.
To remove issues with lock granularity, most of the processors support the Compare and Swap (CAS) operation, which provides a way to implement lock-free synchronization. The atomic CAS operations guarantee that the shared data remains synchronized among threads.
If you require the use of locks, it is recommended that you use the lock inside a critical section with a single entry and single exit, as shown in Figure 4.9 below.
|Figure 4.9 A Lock Used Inside a Critical Section|
From an implementation perspective, it is always safe to use explicit locks rather than relying on implicit locks. In general a lock must not be held for a long periods of time. The explicit locks are defined by the developer, whereas implicit locks come from the underlying framework used, such as database engines provides lock the maintain data consistency.
In the produce-consumer problem, if the consumer wants to consume a shared data before the producer produces, it must wait. To use locks for the producer-consumer problem, the consumer must loop until the data is ready from the producer. The reason for looping is that the lock does not support any wait operation, whereas Condition Variables does.
An application can have different types of locks according to the constructs required to accomplish the task. You must avoid mixing lock types within a given task. For this reason, special attention is required when using any third party library.
If your application has some third party dependency for a resource R and the third party uses lock type L for R, then if you need to use a lock mechanism for R, you must use lock type L rather any other lock type. The following sections cover these locks and define their purposes.
Mutexes. The mutex is the simplest lock an implementation can use. Some texts use the mutex as the basis to describe locks in general. The release of a mutex does not depend on the release() operation only. A timer attribute can be added with a mutex.
If the timer expires before a release operation, the mutex releases the code block or shared memory to other threads. A try-finally clause can be used to make sure that the mutex gets released when an exception occurs. The use of a timer or try-finally clause helps to prevent a deadlock scenario.
Recursive Locks. Recursive locks are locks that may be repeatedly acquired by the thread that currently owns the lock without causing the thread to deadlock. No other thread may acquire a recursive lock until the owner releases it once for each time the owner acquired it.
Thus when using a recursive lock, be sure to balance acquire operations with release operations. The best way to do this is to lexically balance the operations around single-entry single-exit blocks, as was shown for ordinary locks.
The recursive lock is most useful inside a recursive function. In general, the recursive locks are slower than nonrecursive locks. An example of recursive locks use is shown in Figure 4.10 below.
|Figure 4.10 An Example of Recursive Lock Use|
Read-Write Locks. Read-Write locks are also called shared-exclusive or multiple-read/single-write locks or non-mutual exclusion semaphores. Read-write locks allow simultaneous read access to multiple threads but limit the write access to only one thread.
This type of lock can be used efficiently for those instances where multiple threads need to read shared data simultaneously but do not necessarily need to perform a write operation. For lengthy shared data, it is sometimes better to break the data into smaller segments and operate multiple read-write locks on the dataset rather than having a data lock for a longer period of time.
Spin Locks. Spin locks are non-blocking locks owned by a thread. Waiting threads must "spin," that is, poll the state of a lock rather than get blocked. Spin locks are used mostly on multiprocessor systems.
This is because while the thread spins in a single-core processor system, no process resources are available to run the other thread that will release the lock. The appropriate condition for using spin locks is whenever the hold time of a lock is less than the time of blocking and waking up a thread.
The change of control for threads involves context switching of threads and updating thread data structures, which could require more instruction cycles than spin locks. The spin time of spin locks should be limited to about 50 to 100 percent of a thread context switch (Kleiman 1996) and should not be held during calls to other subsystems.Improper use of spin locks might cause thread starvation. Think carefully before using this locking mechanism. The thread starvation problem of spin locks can be alleviated by using a queuing technique, where every waiting thread to spin on a separate local flag in memory using First-In, First-Out (FIFO) or queue construct.
Next in Part 3, Condition
variables, messages and flow control
To read Part 1, go to Synchronization and critical sections.
This series of three articles was excerpted from Multi-Core Programming by Shameem Akhter and Jason Roberts, published by Intel Press. Copyright © 2006 Intel Corporation. All rights reserved. This book can be purchased on line.
Akhter is a platform architect at Intel Corp., focusing on single
socket multicore architecture and performance analysis. Jason Roberts
is a senior software engineer at Intel, working on a number of
different multithreaded software products ranging from desktop,
handheld and embedded DSP platforms.
(Reek 2002) Reek, Kenneth A. 2002. The
Well-Tempered Semaphore: Theme with Variations. ACM SIGSE Bulletin
(March), Pp. 356-359.
(Kleiman 1996) Kleiman, Steve, Devang Shah and Bart Smaaslders, 1996 Programming with Threads, Prentice Hall