A memory fence, sometimes called a memory barrier, is a processordependent operation that guarantees that threads see other threads'memory operations by maintaining reasonable order. To hide thegranularity of these synchronization primitives, higher levelsynchronizations are used. That way application developers have toconcern themselves less about internal details.
Semaphores, the first set of software-oriented primitives to accomplishmutual exclusion of parallel process synchronization, were introducedby the well known mathematician Edsger Dijkstra in his 1968 paper, “TheStructure of the “THE”-Multiprogramming System” (Dijkstra 1968).
Dijkstra illustrated that synchronization can be achieved by usingonly traditional machine instructions or hierarchical structure. Heproposed that a semaphore can be represented by an integer, sem, andshowed that a semaphore can be bounded by two basic atomic operations,P (proberen, which means test) and V (verhogen, which means increment).
These atomic operations are also referred as synchronizingprimitives. Even though the details of Dijkstra's semaphorerepresentation have evolved, the fundamental principle remains same.Where, P represents the “potential delay” or “wait” and V representsthe “barrier removal” or “release” of a thread. These two synchronizingprimitives can be represented for a semaphore s as follows:
where semaphore value sem is initialized with the value 0 or 1before the parallel processes get started. In Dijkstra'srepresentation, T referred to processes. Threads are used here insteadto be more precise and to remain consistent about the differencesbetween threads and processes.
The P operation blocks the calling thread if the value remains 0,whereas the V operation, independent of P operation, signals a blockedthread to allow it to resume operation.
These P and V operations are “indivisible actions” and performsimultaneously. The positive value of sem represents the number ofthreads that can proceed without blocking, and the negative numberrefers to the number of blocked threads.
When the sem value becomes zero, no thread is waiting, and if athread needs to decrement, the thread gets blocked and keeps itself ina waiting list. When the value of sem gets restricted to only 0 and 1,the semaphore is a binary semaphore.
To use semaphore, you can consider semaphore as a counter, whichsupports two atomic operations. Implementation of semaphores varies.From a usability perspective, two kinds of semaphores exist: strong andweak.
These represent the success of individual calls on P. A strongsemaphore maintains First-Come-First-Serve (FCFS) model andprovides guarantee to threads to calls on P and avoid starvation.
And a weak semaphore is the one which does not provide any guaranteeof service to a particular thread and the thread might starve. Forexample, in POSIX, the semaphorescan get into starvation status and implemented differently than whatDijkstra defined and considered as a weak semaphore (Reek 2002).
According to Dijkstra, the mutual exclusion of parallel threadsusing P and V atomic operations represented as follows:
Semaphores are largely of historical interest. They are theunstructured “goto ” statementsof multi-threaded programming. Mostprogramming environments provide higher-level structuredsynchronization primitives.
However, like the goto, semaphores are occasionally the bestprimitive to use. A typical use of a semaphore is protecting a sharedresource of which at most n instances are allowed to existsimultaneously. The semaphore starts out with value n. A thread thatneeds to acquire an instance of the resource performs operation P. Itreleases the resource using operation V.
Let's examine how semaphores might be used for the producer-consumerproblem and whether the problem can be resolved using semaphores ornot. Producer-consumeris a classic synchronizationproblem, also known as the bounded-bufferproblem.
Here a producer function generates data to be placed in a sharedbuffer and a consumer function receives the data out of the buffer andoperates on it, where both producer and consumer functions executeconcurrently.
Pseudo-code using a semaphorefor the producer-consumer problem is shown in Figure 4.7 below.
|Figure4.7 Pseudo-code of Producer-Consumer Problem|
Here neither producer nor consumer maintains any order. If theproducer function operates forever prior to the consumer function thenthe system would require an infinite capacity and that is not possible.
That is why the buffer size needs to be within a boundary to handlethis type of scenario and make sure that if the producer gets ahead ofthe consumer then the time allocated for the producer must berestricted. The problem of synchronization can be removed by adding onemore semaphores in the previous solution shown in Figure 4.7 above.
Adding the semaphore would maintain the boundary of buffer as shownin Figure 4.8 below, wheresEmpty and sFull retain the constraints of buffer capacity foroperating threads.
|Figure4.8 Dual Semaphores Solution for Producer-Consumer Problem|
Instead of using two independent semaphores and having aconstraint-based solution, the solution in Figure 4.8 above can be implementedusing other synchronization primitives as well. The following sectionsdiscuss how to solve the producer-consumer problem using locks andconditional variables primitives.
Locks are similar to semaphores except that a single thread handles alock at one instance. Two basic atomic operations get performed on alock:
* acquire():Atomically waits for the lock state to be unlocked and sets the lockstate to lock.
* release():Atomically changes the lock state from locked to unlocked.
At most one thread acquires the lock. A thread has to acquire a lockbefore using a shared resource; otherwise it waits until the lockbecomes available.
When one thread wants to access shared data, it first acquires thelock, exclusively performs operations on the shared data and laterreleases the lock for other threads to use.
The level of granularity can be eithercoarse or fine depending on the type of shared data that needs to beprotected from threads. The coarse granular locks have higher lockcontention than finer granular ones.
To remove issues with lock granularity, most of the processorssupport the Compare and Swap (CAS)operation, which provides a way to implement lock-free synchronization.The atomic CAS operations guarantee that the shared data remainssynchronized among threads.
If you require the use of locks, it is recommended that you use thelock inside a critical section with a single entry and single exit, asshown in Figure 4.9 below .
|Figure4.9 A Lock Used Inside a Critical Section|
From an implementation perspective, it is always safe to useexplicit locks rather than relying on implicit locks. In general a lockmust not be held for a long periods of time. The explicit locks aredefined by the developer, whereas implicit locks come from theunderlying framework used, such as database engines provides lock themaintain data consistency.
In the produce-consumer problem, if the consumer wants to consume ashared data before the producer produces, it must wait. To use locksfor the producer-consumer problem, the consumer must loop until thedata is ready from the producer. The reason for looping is that thelock does not support any wait operation, whereas Condition Variablesdoes.
An application can have different types of locks according to theconstructs required to accomplish the task. You must avoid mixing locktypes within a given task. For this reason, special attention isrequired when using any third party library.
If your application has some third party dependency for a resource Rand the third party uses lock type L for R, then if you need to use alock mechanism for R, you must use lock type L rather any other locktype. The following sections cover these locks and define theirpurposes.
Mutexes. Themutex is the simplest lock an implementation can use. Some texts usethe mutex as the basis to describe locks in general. The release of amutex does not depend on the release() operation only. A timerattribute can be added with a mutex.
If the timer expires before a release operation, the mutex releasesthe code block or shared memory to other threads. A try-finally clausecan be used to make sure that the mutex gets released when an exceptionoccurs. The use of a timer or try-finally clause helps to prevent adeadlock scenario.
Recursive Locks. Recursivelocks are locks that may be repeatedly acquired by the thread thatcurrently owns the lock without causing the thread to deadlock. Noother thread may acquire a recursive lock until the owner releases itonce for each time the owner acquired it.
Thus when using a recursive lock, be sure to balance acquireoperations with release operations. The best way to do this is tolexically balance the operations around single-entry single-exitblocks, as was shown for ordinary locks.
The recursive lock is most useful inside a recursive function. Ingeneral, the recursive locks are slower than nonrecursive locks. Anexample of recursive locks use is shown in Figure 4.10 below .
|Figure4.10 An Example of Recursive Lock Use|
Read-Write Locks. Read-Writelocks are also called shared-exclusive or multiple-read/single-writelocks or non-mutual exclusion semaphores. Read-write locks allowsimultaneous read access to multiple threads but limit the write accessto only one thread.
This type of lock can be used efficiently for those instances wheremultiple threads need to read shared data simultaneously but do notnecessarily need to perform a write operation. For lengthy shared data,it is sometimes better to break the data into smaller segments andoperate multiple read-write locks on the dataset rather than having adata lock for a longer period of time.
Spin Locks.Spin locks are non-blocking locks owned by a thread. Waiting threadsmust “spin,” that is, poll the state of a lock rather than get blocked.Spin locks are used mostly on multiprocessor systems.
This is because while the thread spins in a single-core processorsystem, no process resources are available to run the other thread thatwill release the lock. The appropriate condition for using spin locksis whenever the hold time of a lock is less than the time of blockingand waking up a thread.
The change of control for threads involves context switching ofthreads and updating thread data structures, which could require moreinstruction cycles than spin locks. The spin time of spin locks shouldbe limited to about 50 to 100 percent of a thread context switch(Kleiman1996) and should not be held during calls to other subsystems.
Improper use of spin locks might cause thread starvation. Thinkcarefully before using this locking mechanism. The thread starvationproblem of spin locks can be alleviated by using a queuing technique,where every waiting thread to spin on a separate local flag in memoryusing First-In, First-Out (FIFO) or queue construct.
Next in Part 3, Conditionvariables, messages and flow control
To read Part 1, go to Synchronizationand critical sections.
This series of three articles was excerpted from Multi-CoreProgramming byShameem Akhter and Jason Roberts, published by Intel Press. Copyright© 2006 Intel Corporation. All rights reserved. This book can bepurchased on line.
ShameemAkhter is a platform architect at Intel Corp., focusing on singlesocket multicore architecture and performance analysis. Jason Robertsis a senior software engineer at Intel, working on a number ofdifferent multithreaded software products ranging from desktop,handheld and embedded DSP platforms.
(Reek 2002) Reek, Kenneth A. 2002. TheWell-Tempered Semaphore: Theme with Variations. ACM SIGSE Bulletin(March), Pp. 356-359.
(Kleiman1996) Kleiman, Steve, Devang Shahand Bart Smaaslders, 1996 Programming with Threads, Prentice Hall