By Rishiyur S. Nikhil (Bluespec) and Arvind (MIT)
Implicit Producer-Consumer Parallelism
Most of the parallelism in matrix multiplication comes from data
parallelism-an independent inner product is computed at each of the n2
components of the result matrix.
Another kind of parallelism that is very important is
producer-consumer parallelism, and this arises naturally in functional
languages with nonstrict data structures. Suppose, instead of scaling
vector a by 5, we have a more complicated mapping function f:
b = array (1,n)
[ (i, f (A!i)) | i <- [1.. n]]
The computation of all elements of b may be done in parallel, but
they may take widely differing times. In languages with strict data
structures (most existing programming languages), a consumer of b must
wait until all components have been defined (i.e., until the slowest
component computation has completed) because b is simply not available
to the consumer until it is fully defined.
In languages like pH with nonstrict data structures, on the other
hand, a consumer may be given a reference to b as soon as it is
allocated and may attempt to read its components immediately.
These reads automatically wait as long as the components are empty
and automatically resume as they become filled. Thus, the consumer can
overlap some of its work with the computation of b's components.
For a more dramatic example of producer-consumer parallelism,
suppose we have one computation that produces a list of sample readings
(perhaps obtaining them from some measuring instrument) and another
computation that analyzes these readings looking for interesting
events. In pH,
producer instrument =
if no-more-readings(instrument)
then []
else
let
x = get_reading(instrument)
in
x : produce (instrument)
analyzer [] = []
analyzer (x:xs) = if
is_interesting(x)
then x : analyzer(xs)
else analyzer(xs)
The notation "x:xs" is pronounced "x cons xs" and represents a list
with x in front, followed by the remaining list, xs. The notation []
represents an empty list. The producer function gets a reading x from
the instrument and produces a list containing x followed by more
readings, until the instrument indicates that there are no more
readings.
The analyzer function handles two cases, and this is indicated by
matching the actual argument against one of the two patterns [] and
(x:xs). In the former case, it returns an empty list of interesting
readings.
In the latter case, it checks if the first reading x is interesting.
If so, it returns it, followed by the result of analyzing the remaining
readings xs. Otherwise, it discards x and simply returns the result of
analyzing the remaining readings.
Now, the interesting producer-consumer parallelism occurs when we
plug these functions together:
events = analyzer (producer
(a_particular_instrument))
Suppose the instrument yielded a reading every ten seconds, there
are a hundred readings, and it takes five seconds to analyze each
reading. With conventional ("strict") data structures, the producer
function would, after 1000 seconds or so, return a list of 100
elements.
At this point, the analyzer would begin its work and, about 500
seconds later, eventually return a list of interesting readings.
Overall, the computation takes 1500 seconds or more.
With nonstrict data structures, however, as soon as the instrument
has yielded its first reading, the producer can return the first
element of the list, leaving the "tail" of the list temporarily empty,
to be filled in later when it gets the next reading.
Having returned this first reading, the analyzer can examine it
immediately. If it is interesting, the analyzer, in turn, can
immediately return the first element of its result list, leaving its
tail empty, to be filled in later.
There is an implicit synchronization necessary here. In particular,
the analyzer, having examined the first reading, may attempt to proceed
to the next reading, which may not be available yet.
The semantics of nonstrict data structures specify that the analyzer
automatically blocks until the tail of its input list is no longer
empty (i.e., has been filled in by
the producer).
One can visualize this as a chain that is gradually produced by the
producer and incrementally consumed by the analyzer as it becomes
available. The work of the analyzer and producer can thus be
overlapped, instead of being run seriatim. Thus, each 5-second activity
of the analyzer can take place during the 10-second wait for the next
instrument reading, and the overall computation can be completed in
about 1000 seconds.
Although we have illustrated producer-consumer parallelism with a
simple, linear data structure (a list), the same principle applies to
arbitrary data structures in pH and nonstrict functional languages.
For example, the producer may organize the instrument readings in a
sorted tree. The picture changes to one where the fringe, or frontier,
of the tree grows incrementally as readings arrive. The analyzer, in
parallel, pursues all branches up to the current fringe, consuming
values at the fringe as it moves outward.
Similarly, the production of different elements of an array may take
varied times, and a consumer of the array can simultaneously pick up
those values that are ready at any given point in time.
Next in Part 6:
So, why
aren't we
all using functional languages yet?
To read Part 1, go to
How sequential languages obscure
parallelism.
To read Part 2, go to
How to achieve parallel execution?
To read Part 3, go to
Explicit Parallelism: Shared-Memory
Programming with Threads and Locks
To read Part 4, go to
Explicit Parallelism:
Message-Passing programming
Rishiyur
S. Nikhil and Arvind are the coauthors of Implicit
Parallel Programming in pH, published by Elsevier/Morgan
Kaufmann.
Rishiyur is Chief Technical
Officer at ESL specialist Bluespec Inc. Previous to its
acquisition by Broadcom he led the Bluespec technology team at
Sandburst Corp. For the previous 9 years he was at Cambridge Research
Laboratory (DEC/Compaq), including over a year as Acting Director.
Earlier, he was an Associate Professor of Computer Science and
Engineering at MIT. He holds several patents in functional programming,
dataflow and multithreaded architectures, parallel processing and
compiling.
Arvind is the Johnson Professor of
Computer Science and Engineering at the Massachusetts Institute of
Technology. His work laid the foundations for Bluespec, centering on
high-level specification and description of architectures and protocols
using Term Rewriting Systems (TRSs),
encompassing hardware synthesis as well as verification of
implementations against TRS specifications. Previously, he contributed
to the development of dynamic dataflow architectures, the implicitly
parallel programming languages Id and pH, and compilation of these
languages on parallel machines. He was also president and founder of
Sandburst Corp.
To
read more on this topic on Embedded.com go to More
about multicores, multithreading and multiprocessing.
References:
[1] "Microsoft:
Parallel Programming Model Ten Years off," by Rick Merritt,
Embedded.com, July 23, 2007
[2] "The Problem with Threads" by
Edward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3] "Why
threads are a bad idea (for most purposes)," J.K. Ousterhout,
Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, and
D. Gelernter. "Linda
and Friends," IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. "Properties
of a First-order Functional Language with Sharing." Theoretical
Computer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. "The Development of
Erlang." In Proc. Intl. Conf. on Functional Programming (ICFP),
Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,
and M. Williams. "Concurrent
Programming in Erlang." Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The Java
Programming Language, second edition. Addison-Wesley, 1998. ISBN
0-201-31006-6.
[9] Arvind, A. Caro, J.-W.
Maessen, and S. Aditya. A
Multithreaded Substrate and Compilation Model for the Implicitly
Parallel Language pH. In Proc. Wkshp. on Languages and Compilers
for Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,
and W. Plouffe. The
(preliminary) Id Report. Technical Report 114, Dept. of Information
and Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executing
a Program on the MIT Tagged-Token Dataflow Architecture. IEEE
Trans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.
Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime
System. In Proc. 5th. ACM Symp. on Principles and Practice
of Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,
July 19-21 1995.
[13] D. E. Culler, A. Dusseau,
S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.
Yelick. Parallel
Programming in Split-C. In Proc. Supercomputing 93, Portland OR,
USA, November 1993.
[14] A. Geist, A. Begeulin, J.
Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVM
Parallel Virtual Machine. A User's Guide and Tutorial for Network
Parallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:
A
Language for Concurrent Symbolic Computation. ACM Trans. on
Programming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High Performance
Fortran Forum. High Performance Fortran Language Specification,
Version 1.0. Technical Report CRPC-TR92225, Center for Research on
Parallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:
an Operating System Structuring Concept. Communications of the ACM,
17(10):549-557, October 1974.
[18] R. Hughes. Why
Functional Programming Matters. The Computer Journal, 32(2):98-107,
1989.
[19] Threads
standard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing Interface
Forum. MPI:
A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. "Cid: A Parallel
'Shared-memory' C for Distributed Memory Machines." In Proc.
7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing
(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,
August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:
a Language with Implicit Parallelism. In A Comparative Study of
Parallel Programming Languages: The Salishan Problems, John Feo
(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,
and Arvind. Id Nouveau. Technical Report CSG Memo 265, Computation
Structures Group, MIT Lab. for Computer Science, Cambridge MA 02139,
USA, July 1986.
[24] OpenMP Architecture Review
Board. OpenMP
Specifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones.
The Implementation of Functional
Programming Languages. Prentice Hall, 1987.
[26] S. L. Peyton Jones
(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.
Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.
Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, and
P. Wadler. Haskell
98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C.
Diniz. Commutativity
Analysis: A New Analysis Framework for Parallelizing Compilers. ACM
Trans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,
November 1997.
[28] M. C. Rinard, D. J.
Scales, and M. S. Lam. Jade: A
High-Level, MachineIndependent Language for Parallel Programming.
IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth.
Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.