Threading and Parallel Programming Constructs used in multicore systems development: Part 3 -

Threading and Parallel Programming Constructs used in multicore systems development: Part 3

Important in the parallel programming environment, condition variablesare based on Dijkstra's semaphore semantics, with theexception that nostored value is associated with the operation.

This means condition variables do not contain the actual conditionto test; a shared data state is used instead to maintain the conditionfor threads. A thread waits or wakes up other cooperative threads untila condition is satisfied.

The condition variables are preferable to locks when poolingrequires and needs some scheduling behavior among threads. To operateon shared data, condition variable C , uses a lock, L . Three basicatomic operations are performed on a condition variable C :

* wait(L ):Atomically releases the lock and waits, where wait returns the lockbeen acquired again
* signal(L ): Enables one of thewaiting threads to run, where signal returns the lock is still acquired
* broadcast(L ): Enables all of thewaiting threads to run, where broadcast returns the lock is stillacquired

To control a pool of threads, use of a signal function isrecommended. The penalty for using a broadcast-based signaling functioncould be severe and extra caution needs to be undertaken before wakingup all waiting threads.

For some instances, however, broadcast signaling can be effective.As an example, a “write” lock might allow all “readers” to proceed atthe same time by using a broadcast mechanism.

To show the use of a condition variable for a synchronizationproblem, the pseudocode in Figure4.11 below solves the producer-consumer problem discussedearlier. A variable LC is used to maintain theassociation between condition variable C and an associated lock L .

Figure4.11 Use of a Condition Variable for the Producer-Consumer ProblemMonitors ,

For structured synchronization, a higher level construct isintroduced for simplifying the use of condition variables and locks,known as a monitor. The purpose of the monitor is to simplify thecomplexity of primitive synchronization operations and remove theimplementation details from application developers.

The compiler for the language that supports monitors automaticallyinserts lock operations at the beginning and the end of eachsynchronization-aware routine.

Most recent programming languages do not support monitor explicitly,rather they expose lock and unlock operations to the developers. TheJava language supports explicit monitor objects along with synchronizedblocks inside a method. In Java, the monitor is maintained by the”synchronized” constructs, such as

where the “condition” primitives are used by wait(), notify(), ornotifyAll() methods. Do not confuse this with the Monitor object in theJava SDK though. The Java Monitor object is used to perform resourcemanagement in Java Management Extension (JMX). Similarly, the monitorobject in C# is used as lock construct.

The message is a special method of communication to transferinformation or a signal from one domain to another. The definition ofdomain is different for different scenarios.

For multi-threading environments, the domain is referred to as theboundary of a thread. The three M's of message passing aremulti-granularity, multithreading, and multitasking (Ang 1996).

In general, the conceptual representations of messages getassociated with processes rather than threads. From a message-sharingperspective, messages get shared using an intra-process, inter-process,or process-process approach, as shown in Figure 4.12 below.

Figure4.12 Message Passing Model

Two threads that communicate with messages and reside in the sameprocess use intra-process messaging. Two threads that communicate andreside in different processes use inter-process messaging.

From the developer's perspective, the most common form of messagingis the process-process approach, when two processes communicate witheach other rather than depending on the thread.

In general, the messaging could be devised according to the memorymodel of the environment where the messaging operation takes place.Messaging for the shared memory model must be synchronous, whereas forthe distributed model messaging can be asynchronous. These operationscan be viewed at a somewhat different angle.

When there is nothing to do after sending the message and the senderhas to wait for the reply to come, the operations need to besynchronous, whereas if the sender does not need to wait for the replyto arrive and in order to proceed then the operation can beasynchronous.

The generic form of message communication can be represented asfollows:

The generic form of message passing gives the impression todevelopers that there must be some interface used to perform messagepassing. The most common interface is the MessagePassing Interface(MPI). MPI is used as the medium of communication, as illustratedin Figure 4.13, below .

Figure4.13 Basic MPI Communication Environment

To synchronize operations of threads, semaphores, locks, andcondition variables are used. These synchronization primitives conveystatus and access information.

To communicate data, they use thread messaging. In thread messaging,synchronization remains explicit, as there is acknowledgement afterreceiving messages.

The acknowledgement avoids primitive synchronization errors, such asdeadlocks or race conditions. The basic operational concepts ofmessaging remain the same for all operational models. From animplementation point of view, the generic client-server model can beused for all messaging models.

Inside hardware, message processing occurs in relationship with thesize of the message. Small messages are transferred between processorregisters and if a message is too large for processor registers, cachesget used. Larger messages require main memory. In the case of thelargest messages, the system might use processor-external DMA, as shownin Figure 4.14 below .

Figure4.14 System Components Associated with Size of Messages

Flow Control-based Concepts
In the parallel computing domain, some restraining mechanisms allowsynchronization among multiple attributes or actions in a system. Theseare mainly applicable for shared-memory multiprocessor or multi-coreenvironments. The following sections covers only two of these concepts,fence and barrier .

Fence. The fence mechanism is implemented using instructions and in fact, mostof the languages and systems refer to this mechanism as a fenceinstruction. On a shared memory multiprocessor or multi-coreenvironment, a fence instruction ensures consistent memory operations.

At execution time, the fence instruction guarantees completeness ofall pre-fence memory operations and halts all post-fence memoryoperations until the completion of fence instruction cycles. This fencemechanism ensures proper memory mapping from software to hardwarememory models, as shown in Figure4.15, below .

The semantics of the fence instruction depend on the architecture.The software memory model implicitly supports fence instructions. Usingfence instructions explicitly could be error-prone and it is better torely on compiler technologies. Due to the performance penalty of fenceinstructions, the number of memory fences needs to be optimized.

Figure4.15 Fence Mechanism

Barrier .The barrier mechanism is a synchronization method by which threads inthe same set keep collaborating with respect to a logical computationalpoint in the control flow of operations. Through this method, a threadfrom an operational set has to wait for all other threads in that setto complete in order to be able to proceed to the next execution step.

This method guarantees that no threads proceed beyond an executionlogical point until all threads have arrived at that logical point.Barrier synchronization is one of the common operations for sharedmemory multiprocessor and multi-core environments.

Due to the aspect of waiting for a barrier control point in theexecution flow, the barrier synchronization wait function for iththread can be represented as

(W barrier ) i = f ((T barrier )i, (R thread ) i )

where W barrier is the wait time for athread, T barrier is the number of threads hasarrived, and R thread is the arrival rate ofthreads.

For performance consideration and to keep the wait time within areasonable timing window before hitting a performance penalty, specialconsideration must be given to the granularity of tasks. Otherwise, theimplementation might suffer significantly.

Implementation-dependent ThreadingFeatures
The functionalities and features of threads in differentenvironmentsare very similar; however the semantics could be different. That is whythe conceptual representations of threads in Windows and Linux remainthe same, even though the way some concepts are implemented could bedifferent.

Also, with the threading APIs of Win32, Win64, and POSIXthreads(Pthreads), the semantics are different as well. Windows threadingAPIsare implemented and maintained by Microsoft and work on Windows only,whereas the implementation of Pthreads APIs allows developers toimplement threads on multiple platforms.

The IEEE only defined the Pthreads APIs and let the implementationbe done by OS developers. Due to the implementation issues of Pthreads,not all features exist in Pthreads APIs. Developers use Pthreads as awrapper of their own thread implementations. There exists a nativeLinux Pthreads library similar to Windows native threads, known asNativePOSIX Thread Library (NPTL).

Consider the different mechanisms used to signal threads in Windowsand in POSIX threads. Windows uses an event model to signal one or morethreads that an event has occurred. However, no counterpart to Windowsevents is implemented in POSIX threads. Instead, condition variablesare used for this purpose.

These differences are not necessarily limited to cross-libraryboundaries. There may be variations within a single library as well.For example, in the Windows Win32 API, Microsoft has implemented twoversions of a mutex.

The first version, simply referred to as a mutex, provides onemethod for providing synchronized access to a critical section of thecode. The other mechanism, referred to as a CriticalSection,essentially does the same thing, with a completely different API.What's the difference?

The conventional mutex object in Windows is a kernel mechanism. As aresult, it requires a user-mode to kernel-mode transition to work. Thisis an expensive operation, but provides a more powerful synchronizationmechanism that can be used across process boundaries.

However, in many applications, synchronization is only needed withina single process. Therefore, the ability for a mutex to work acrossprocess boundaries is unnecessary, and leads to wasted overhead. Toremove overhead associated with the standard mutex Microsoftimplemented the CriticalSection, which provides a user-level lockingmechanism. This eliminates any need to make a system call, resulting inmuch faster locking.

Even though different threading libraries will have different waysof implementing the features described here, the key to being able tosuccessfully develop multithreaded applications is to understand thefundamental concepts behind these features.

To read Part 1, go to Synchronizationand critical sections
To read Part 2 , go to SynchronizationPrimitives

This series of three articles was excerptedfrom Multi-CoreProgramming byShameem Akhter and Jason Roberts, published by Intel Press. Copyright© 2006 Intel Corporation. All rights reserved. This book can bepurchased on line.

Shameem Akhter is a platform architect at IntelCorp., focusing on single socket multicore architecture andperformance analysis. Jason Roberts is a senior software engineer atIntel, working on a number of different multithreaded software productsranging from desktop, handheld and embedded DSP platforms.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.