By Rishiyur S. Nikhil (Bluespec) and Arvind (MIT)
A well-known method for parallel programming is to extend a sequential
language with threads. These may be regarded as multiple sequential
programs that have the same view of all data-two threads may hold the
address of the same variable and they both access the contents of the
variable with the standard notations for variable access.
Local variables of a procedure invocation in one thread are private,
or inaccessible to other threads, in much the same way that each
recursive invocation of a procedure in a sequential program refers to
its own copy of local variables.
Threads
One notation to express creation of a new thread replaces a function
call with a fork:
F(i)
replaced by
fork F(i)
When the machine encounters a normal procedure call, it suspends
execution at that point, executes F to completion, returns, and
continues execution of the sequel. When the machine encounters a fork,
on the other hand, it simultaneously executes F and the sequel.
The two threads are called the "parent" and "child" threads,
respectively. For example, we could express a parallel loop as follows
(we switch from Fortran to C notation, because thread-and-lock based
programming is more common in C than in Fortran):
int main (. . .)
{
float A [N] ;
for (i = 0; i
< N; i++) {
fork BODY(i,
A);
}
. . .
. . .
}
void BODY (int i, float A[])
{
A[i] =
f(A[i]);
}
At the end of the for loop, we have N child threads running in
parallel, executing BODY(O,A), ..., BODY(N-1,A) respectively. Each one,
in parallel, computes with one element of A. Note that each thread has
its copy of i but, because of C's procedure call semantics, A refers to
the same array for all of them.
A fundamental problem with explicit threads is the choice of thread
granularity; that is, where should the programmer place fork calls for
maximum advantage? The cost of a fork is typically hundreds to
thousands of times the cost of a procedure call.
If the program generates too many threads, it may slow down because
of the overhead of thread creation and context-switching between
threads. If the program generates too few threads, then processors may
remain idle. In many programs, particularly recursive programs, it is
difficult for the programmer to find the right balance by placing forks
in just the right places.
The task is complicated by the fact that the right placement may
depend on input data, it may depend on the number of processors in the
machine on which the program is run, it may depend on the other load on
those processors, and so on; there is no single "correct" solution.
Programmers sometimes deal with this issue by writing all kinds of
conditional code that, at run-time, chooses between a fork or an
ordinary procedure call based on querying the load on the system. This
greatly obscures the original clarity of the program.
Locks
When there is a danger of a race condition, data that is shared by
multiple threads is usually protected by a lock data structure: a
thread acquires the lock, manipulates the shared data, and then
releases the lock. When a thread tries to acquire a lock, the
implementation forces it to wait if the lock is currently held by
another thread.
When the lock is released, only one of possibly many waiting threads
is allowed to acquire it. This ensures that the shared data is not
simultaneously manipulated by multiple threads and so there are no race
conditions.
Normally, only one thread may hold the lock at a time, although
there are many variations: for example, read-write locks permit many
threads that merely read the shared data to hold the lock
simultaneously, while any thread that updates the shared data must
acquire the lock exclusively.
Let us illustrate the use of locks by extending the above example.
In the above code, the parent thread has no way of knowing when the
child threads have completed; the loop simply forks the child threads
and the parent goes on.
Thus, the parent does not know when it is safe to assume that all
the new values in A are ready. One solution is to have a counter,
initialized to N, that is decremented by each loop iteration when it
has completed its work. The parent thread, observing the counter drop
to zero, infers that all child threads have completed and that A is
ready.
int main (...)
{
float A[N];
int counter = N;
for (i = 0; i < N; i++) {
fork BODY(i,
A, & counter);
}
...
while (counter > 0) { /* spin */};
...
... /* Now, A is ready */
...
...
}
void BODY(int i, float A[], int
*counter)
{
A [i] = f (A [i]) ;
*counter = *counter - 1;
}
Unfortunately, there is a race condition: each thread, in trying to
decrement the counter, reads it, subtracts one, and stores it. Now, it
is possible that two threads simultaneously read it, substract one, and
store it.
In this case the counter, instead of being decremented by two, is
only decremented by one. The counter will thus never reach zero and the
parent thread will be stuck in its spin loop.
Things could be even worse. Depending on the processor architecture
and the compiler, the counter-decrement statement of BODY may not
compile into a single read followed by a single write.
For example, counter may have a 4-byte representation, and the
compiler may produce two 2-byte reads and two 2-byte writes. If these
finer granularity operations from different threads are interleaved
there could be even more possibilities for the actual value stored and
read in the counter variable.
We introduce a lock to avoid this:
int main (...)
{
float A[N] ;
int counter = N;
lock lk;
for (i = 0; i < N; i++)
fork BODY(i,
A, & counter, &
lk);
}
...
... /* May do other work, unrelated to
A */
...
while (counter > 0) { /* spin */ }
/* Now, A is ready for use */
...
...
}
void BODY(int i,
float A[], int
*counter, lock*1k)
{
A[i] = f(A[i]);
wait(lk);
/* acquire the lock */
*counter = *counter - 1;
signal(lk);
/* release the lock */
}
The lock operations ensure that the entire read-decrement-write
sequence of the enclosed statement is performed in its entirety by one
thread before another thread can attempt it. This eliminates the race
condition.
Threads (or processes, or tasks) and locks have been used for
parallel programming for decades. They were originally developed for
operating systems, even on uniprocessor machines, in order to express
computations that are conceptually if not physically parallel.
For example, an operating system must "simultaneously" execute many
users' programs, as well as respond to unpredictable events from disks,
networks, printers, terminals, etc. This is most easily expressed as a
parallel program, even though the threads may be time-multiplexed on a
single processor.
While threads and locks have long been used in operating systems,
they started becoming visible in users' programs since the early 1980s
when so-called bus-based shared-memory multiprocessors started becoming
available.
In these machines, a small number of processors (typically less than
a dozen) are plugged into a common bus sharing a common memory system.
Users could now write parallel programs that ran faster because they
actually ran in parallel on multiple processors, without
time-multiplexing.
Higher-Level Constructs for
Shared-Memory Programming
Coordinating the interaction of parallel threads, e.g., by protecting
shared data with locks, is called synchronization of a parallel
program. It is somewhat tricky to be convinced that a thread-based
parallel program has adequate synchronization. For example, a purist
may complain that access to the counter in the spin loop (while) in the
parent thread has not been protected by lock acquisition.
Fortunately, in this case, the omission is benign-since the loop
only reads the counter and since the counter value decrements
monotonically from N down to zero, reading a zero value is still a
correct guarantee that the child processes have completed. In general,
however, all accesses to shared data should be protected by lock
acquisition.
(Actually, some C compilers
perform "optimizations" that may prevent
the parent thread from ever reading zero! Because reading the counter
variable is not protected by a lock, some C compilers might feel free
to make a copy of counter that is "closer" to the parent thread, e.g.,
in a register or some other nearby memory. In this case, the decrements
down to zero are never seen in the local copy. Declaring the variable
counter to be volatile can sometimes prevent this behavior.)
To prevent inadvertent synchronization errors many languages provide
higher-level constructs to package up common parallel paradigms. For
example, the paradigm of executing iterations of a loop in parallel and
waiting for all iterations to complete (also called a barrier) is so
common that some languages introduce a special parallel loop construct:
doall i = 1, N
A(i) = f(A(i))
end do
The compiler now manages all the details and will often use a more
sophisticated and efficient barrier implementations than the simple
counter described above.
Another common language construct is the monitor [17]. Here, the programmer declares
a shared data structure along with a collection of procedures to
manipulate it. The data structure is not directly visible to the rest
of the program, so it can only be manipulated by calling the monitor
procedures.
The compiler ensures that these procedures implicitly acquire a lock
on entry when they are called and release the lock on exit just before
returning, ensuring one-at-a-time access to the shared data. This
approach to structured synchronization is used in the language Modula [29], in the POSIX threads
extensions to C and Fortran [19],
and most recently in synchronized methods in Java [8].
Another popular language construct is the future (for example, in
the language Multilisp [15]).
In parallel programs, we often encounter the following scenario: we
wish to fork a function, which will ultimately return some result, but
we do not need the result immediately, since there is other useful work
to be done before we need the result.
However, when we eventually do need the result, we need some way of
waiting for it, since it may not be ready yet. Futures are a notation
to express this. For example, suppose we want a program that
recursively traverses a binary tree, adding up some integer information
available at each tree node. We can express this using futures, as
follows:
typedef struct tree
int info;{
struct tree *left, *right; }
*treeptr;
int tree _Add(treeptr tp)
{
future int x;
int y;
if (tp == NULL)
return 0;
else {
future x = tree-Add (tp->left);
y = tree Add (tp->right);
return touch (x) + y + tp->info;
}
}
The declaration future int x conceptually declares x to be an
integer together with a lock that is initially in the locked state. If
the tree is not empty, the future call forks a call to traverse the
left subtree. This forked call will ultimately deliver its integer
result into the integer component of x and signal the lock indicating
that the x is ready.
Meanwhile, the parent thread goes on to traverse the right subtree.
Note that this happens recursively at every level of the tree and so it
can generate threads exponentially. After the right subtree's value has
been computed (y), we wait for the result x by executing touch(x),
which waits for the lock to be signalled and then returns the integer
value in x. This is combined into the final answer and returned.
Constructs like doall, monitors, and futures make it easier to write
robust parallel programs because they hide the details of lock
manipulation completely; one is less likely to forget inadvertently to
protect shared data access with suitable lock operations.
Recently, two languages, Jade [28]
and Cilk [l2], have gone
further. Both rely on programmer annotations to indicate what can be
executed in parallel, but both provide a guarantee that parallel
execution is faithful to the semantics of sequential execution (where
the parallelism annotations are ignored). This is a powerful guarantee
because it allows one to debug the program in sequential mode using
standard tools.
In Jade, the programmer specifies that a section of code can be
executed in parallel (analagous to forking it), but must also declare
what data is accessed by the section and how (read or write). This
declaration is executable, so that it can handle dynamic data
structures.
The run-time system checks that data accesses by the parallel code
indeed conform to the declarations and introduces all the
synchronization necessary to preserve the sequential semantics. Cilk
provides extensions to C for forking threads and to wait for their
completion. A thread-fork is basically an annotation on a procedure
call.
This induces a certain precedence between the parallel activities of
a program-it is a directed acyclic graph (DAG) or partial order. The
scheduler in Cilk's run-time system is carefully designed so that
threads access memory according to this partial order, preserving the
sequential semantics.
Next in Part 4: Explicit
parallelism - message passing
To read Part 1, go to How sequential languages obscure
parallelism.
To read Part 2, go to How to achieve parallel execution?
Rishiyur
S. Nikhil and Arvind are the coauthors of Implicit
Parallel Programming in pH, published by Elsevier/Morgan
Kaufmann.
Rishiyur is Chief Technical
Officer at ESL specialist Bluespec Inc. Previous to its
acquisition by Broadcom he led the Bluespec technology team at
Sandburst Corp. For the previous 9 years he was at Cambridge Research
Laboratory (DEC/Compaq), including over a year as Acting Director.
Earlier, he was an Associate Professor of Computer Science and
Engineering at MIT. He holds several patents in functional programming,
dataflow and multithreaded architectures, parallel processing and
compiling.
Arvind is the Johnson Professor of
Computer Science and Engineering at the Massachusetts Institute of
Technology. His work laid the foundations for Bluespec, centering on
high-level specification and description of architectures and protocols
using Term Rewriting Systems (TRSs),
encompassing hardware synthesis as well as verification of
implementations against TRS specifications. Previously, he contributed
to the development of dynamic dataflow architectures, the implicitly
parallel programming languages Id and pH, and compilation of these
languages on parallel machines. He was also president and founder of
Sandburst Corp.
To
read more on this topic on Embedded.com go to More
about multicores, multithreading and multiprocessing.
References:
[1] "Microsoft:
Parallel Programming Model Ten Years off," by Rick Merritt,
Embedded.com, July 23, 2007
[2] "The Problem with Threads" by
Edward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3] "Why
threads are a bad idea (for most purposes)," J.K. Ousterhout,
Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, and
D. Gelernter. "Linda
and Friends," IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. "Properties
of a First-order Functional Language with Sharing." Theoretical
Computer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. "The Development of
Erlang." In Proc. Intl. Conf. on Functional Programming (ICFP),
Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,
and M. Williams. "Concurrent
Programming in Erlang." Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The Java
Programming Language, second edition. Addison-Wesley, 1998. ISBN
0-201-31006-6.
[9] Arvind, A. Caro, J.-W.
Maessen, and S. Aditya. A
Multithreaded Substrate and Compilation Model for the Implicitly
Parallel Language pH. In Proc. Wkshp. on Languages and Compilers
for Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,
and W. Plouffe. The
(preliminary) Id Report. Technical Report 114, Dept. of Information
and Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executing
a Program on the MIT Tagged-Token Dataflow Architecture. IEEE
Trans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.
Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime
System. In Proc. 5th. ACM Symp. on Principles and Practice
of Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,
July 19-21 1995.
[13] D. E. Culler, A. Dusseau,
S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.
Yelick. Parallel
Programming in Split-C. In Proc. Supercomputing 93, Portland OR,
USA, November 1993.
[14] A. Geist, A. Begeulin, J.
Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVM
Parallel Virtual Machine. A User's Guide and Tutorial for Network
Parallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:
A
Language for Concurrent Symbolic Computation. ACM Trans. on
Programming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High Performance
Fortran Forum. High Performance Fortran Language Specification,
Version 1.0. Technical Report CRPC-TR92225, Center for Research on
Parallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:
an Operating System Structuring Concept. Communications of the ACM,
17(10):549-557, October 1974.
[18] R. Hughes. Why
Functional Programming Matters. The Computer Journal, 32(2):98-107,
1989.
[19] Threads
standard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing Interface
Forum. MPI:
A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. "Cid: A Parallel
'Shared-memory' C for Distributed Memory Machines." In Proc.
7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing
(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,
August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:
a Language with Implicit Parallelism. In A Comparative Study of
Parallel Programming Languages: The Salishan Problems, John Feo
(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,
and Arvind. Id Nouveau. Technical Report CSG Memo 265, Computation
Structures Group, MIT Lab. for Computer Science, Cambridge MA 02139,
USA, July 1986.
[24] OpenMP Architecture Review
Board. OpenMP
Specifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones.
The Implementation of Functional
Programming Languages. Prentice Hall, 1987.
[26] S. L. Peyton Jones
(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.
Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.
Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, and
P. Wadler. Haskell
98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C.
Diniz. Commutativity
Analysis: A New Analysis Framework for Parallelizing Compilers. ACM
Trans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,
November 1997.
[28] M. C. Rinard, D. J.
Scales, and M. S. Lam. Jade: A
High-Level, MachineIndependent Language for Parallel Programming.
IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth.
Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.