So far in this series, we have seen that automatic parallelization ofsequential programs is hard because precise dependence analysis ishard; data parallel programming languages have a limited applicationspace and choosing good data distributions is hard; and programmingwith explictly parallel threads and locks is difficult because it ishard to avoid races and deadlocks.
The message-passing model for parallel programming consists of acollection of conventional sequential programs that can communicate bysending and receiving messages.
This model is popular because it closely mirrors many parallelmachines available today, called “multicomputers” and “clusters”, whichconsist of a collection of conventional computers each of whichcontains an input-output device that is connected to a shared network.
Each computer is either a traditional uniprocessor, or a small SMP(symmetric multiprocessor), typically with four processors or less.Some multicomputers have been designed as parallel machines and havevery sophisticated, fast interconnection networks, but even acollection of workstations on a local-area network-a configurationfound in most offices today-can be regarded as a multicomputer.
In a multicomputer, since each computer (or node) is independent, aprogram on node Nd2 has no way directlyto read a variable x on nodeNd 1 . Instead, node Nd1 can communicate thevalue in variable x to nodeNd2 . The program onnode Nd1 executes:
while the program on node Nd2 executes:
as a result of which a chunk of data is effectively copied fromvariable x in node Nd 1 into variable y in node Nd 2 .
For the moment, think of send as depositing a message into thenetwork, which will ultimately be delivered to the recipient; inparticular, after executing send, a node in the multicomputer cancontinue to compute, even though the recipient may not yet haveattempted to execute receive.
These are called “nonblocking” or “asynchronous” sends (we willdiscuss this issue in a later section). The receive call is blocking;that is, execution waits at that statement until a message is actuallyreceived.
|Figure1.6: Layout of matrices in a multicomputer (each rectangle is aseparate computer).|
Matrix Multiplication withMessage-Passing
Let us reconsider matrix multiplication of n x n matrices. Assume thatour multicomputer has exactly n x n nodes, and assume that the (i,j)'thnode contains the variables A(i,j), B(i,j), and C(i,j), as shown inFigure 1.6 above . One couldimagine that each node computes an innerproduct; let us concentrate on the node (4,6):
private real k % (each node has aprivate variable k)
C(4,6) = 0.0
do k = 1,n
C(4,6) = C(4,6) + A(4,k) * B(k,6)
Unfortunately, in the message-passing model, this node (4,6) cannotdirectly access components that it needs: A(4,1), B(1,6), A(4,2),B(2,6), … So, the system is organized as in Figure 1.7 below.
Every node has variables called aij and bij which we assumehavebeen preloaded with the appropriate matrix components, and variable cijwhich will contain C(i,j) after the matrix multiplication. Each nodealso has variables my_i and my_j which we assume have beenpreloaded with the node's indices. Finally, assume that each nodecontains two 1 x n vectors called ai and bj.
|Figure1.7: Multicomputer matrix multiplication, local memories only.|
We assume that nodes can communicate as shown in the figure;thatis, node (i,j) communicates with nodes (i-l,j), (i+l,j), (i,j-1), and(i,j+1), which we will henceforth abbreviate as up, down, left, andright, respectively.We assume the communication “wraps around” at theedges so that, forexample, the “up neighbor” of a node in the top row is thecorresponding node in the same column in the bottom row.
It is important to be aware that this does not mean that themachine's physical communication network has this two-dimensionaltopology, just that nodes are capable of communicating in this pattern.In general, a machine's physical network topology has very little to dowith a program's abstract communication patterns.
The first thing we do is to collect, in each node, a copy of itsentire row of matrix A, using the vector ai. Here is the code (allnodes executed this code simultaneously ):
ai(j) = aij
do jj = 1, n – 1
j = j + 1
if (j.gt.n) j = 1
The net effect is to “slide” matrix A to the left (with wraparoundat the edge ). Each node, when it receives a component of A fromtheright, deposits it into the correct slot of ai. In particular, node(i,j) receives Ai,j+1 , Ai,j+2 , Ai,j+3 ,… in turn.
Similarly, we can slide matrix B upward (with wraparound at theedge). As each node receives a component of B from below, it isincorporated into bj. In particular, node (i, j) receives Bi+1,j ,Bi+2,j , Bi+3,j , … in turn. Here is the code(all nodes executedthis code simultaneously):
i = my_i
bj(i) = bij
do ii = 1, n-1
i = i + 1
if (i.gt.n) i = 1
At this point, the (i,j )'th node contains, in its ai and bivariables, the i'th row of A and j'th column of B. It can now performthe inner product (for which no communication is needed-it is a purelylocal computation):
cij = 0.0
do k = 1,n
cij = cij + ai(k) * bj(k)
The cijs now represent the desiredresult matrix.Message-Passing Matrix Multiplication,Version 2
If one regards communication as the dominant time-consuming operation(as indeed was the case infirst-generation parallel machines ), thenour program takes about 2N time steps: N steps in each of the two loopsthat shifted A and B. In fact, they can be performed together in asingle loop:
i = my_i
bj(i) = bij
j = my_j
ai(j) = aij
do iijj = 1, n-1
i = i + 1
if (j.gt.n) i = 1
j = j + 1
if (j.gt.n) j = 1
Assuming that the two sends upward and leftward take place inparallel, this combined loop takes only N communication time steps.
A small optimization can replace the two vectors ai and bj in eachnode by a single vector ab. We initialize all components to 1.0. Foreach value received from the south, the node multiplies it into ab(i).
For each value received from the east, the node multiplies it intoab(j) . At the end of the communication phase, ab will contain all theindividual products in the inner product. These can now be summed forthe final value of C(i,j).
Message-Passing MatrixMultiplication, Version 3
The matrix multiply programs still use storage proportional to n3 ,since there are n2 nodes and each node contains a vector ortwo of sizen. A very clever optimization that reduces this down to n2 storage ispossible if we observe that the additions in the inner product can beperformed in a different order without changing the final result of thecomputation.
We first show the solution and then convince ourselves that it iscorrect. To avoid getting buried in subscripts and ellipses, we willwork specifically with 4 x 4 matrices. Figure1.8 below shows theinitial configuration.
|Figure1.8: Multicomputer matrix multiplication, n2 storage.|
This configuration can always be achieved from that of Figure 1.7 bywriting a prelude that rotates the four rows of A to the left by 0, 1,2, and 3 steps, respectively, and rotates the four columns of B upwardby 0, 1, 2, and 3, respectively. Each node now executes the followingcode:
cij = aij* bij
do k = 1, n-1
cij = cij + aij * bij
To convince ourselves that this works, let us focus on the nodecomputing C2,3 (highlighted in the figure). Initially,
cij = A2,4 X B4,3
In the first iteration of the loop, we slide all A values left andall B values up. So, this node receives A2,1 and B1,3 from theright and below, respectively, and the next partial sum is:
cij = (A2,4 X B4,3 ) +(A2,1 X B1,3 )
In the next iteration, the node receives A2,2 and B2,3 from theright and below, respectively, and the next partial sum is:
cij = (A2,4 X B4,3 )+ (A2,1 X B1,3 ) + (A2,2 X B2,3 )
In the final iteration, the node recieves A2,3 and B3,3 from theright and below, respectively, and the final sum is:
cij = (A2,4 X B4,3 )+ (A2,1 X B1,3 ) + (A2,2 X B2,3 )+(A2,3 X B3,3 )
which is clearly the same as the inner product, which is usuallywritten with the summands in the following order:
This solution is sometimes called a “systolic” solution, because itsbehavior suggests the pumping of the heart: all nodes communicate, thencompute, then communicate, then compute, and so on. (It is alsosometimes called the “just in time,” or “zero inventory” solution! )
Communication Issues and Their Effecton Programming
Multicomputer systems have many variations in the semantics ofcommunication. One aspect is synchronicity. At one extreme, even ifnode Nd1 attempts to send a message very early, it is not allowed toproceed until node Nd2 attempts to receive.
This is called “synchronous communication”, or “blocking sends” andis analogous to handing a letter directly in person to someone-if thatperson is not ready yet, we have to wait.
At the other extreme, node Nd1 can send a message at any time andproceed, with the network undertaking to transport and hold on to themessage until Nd2 eventually tries to receive it. This is called”asynchronous communication”, or “nonblocking sends” and is analogousto a pipe of unbounded capacity through which we send letters to theaddressee.
In between these extremes, the network may bound the number ofmessages allowed in transit between nodes to some threshold c. Nd1 cansend up to c messages before other nodes attempt to consume them;however, any attempt to send more messages will cause Ndl toblockuntil other nodes drain the network by receiving messages.
In the program examples above, we assumed asynchronouscommunications. Suppose, instead, the network only allowed one messageto be in transit between nodes; once a node has done a send, it cannotdo another send until the first message has been consumed by a receive.
Unfortunately, our program will get stuck (or deadlocked ): every nodesends to the left and then tries sending upward, but since no node isyet trying to receive from the right, none of the the original leftwardsends can complete, and all the nodes will wait forever. So, we have torewrite our program as follows:
cij = aij * bij
do k = 1, n-1
cij = cij + aij * bij
and this solves the problem.
Other variations in network properties include: whether twoconsecutive messages sent from Nd 1 to Nd 2 will arrive in the sameorder; whether messages are always delivered (or sometimes lost);whether they are sometimes duplicated; whether they are sometimescorrupted (for example, a messagetransmitted via a modem over a phoneline or over a satellite link is often corrupted ), and so on.
Another concern is protection – if two independent parallel programsare running simultaneously on the same parallel machine, some mechanismis necessary to ensure that one program's messages are not delivered tothe other program and vice versa.
Most of these variations are architecture specific and not of directconcern to the programmer because they are usually hidden under layersof “protocol” software that provide the illusion of a reliable,private, nonblocking, order-preserving network. These layers ofsoftware can add significant overhead to sending and receiving amessage.
Communication overhead usually has two components-a fixed overheadthat is paid whether we send 1 byte or 1000 bytes, and a per-byteoverhead. Existing multicomputers vary greatly in the overheads theyimpose, particularly the fixed overhead.
For workstations using standard network communications (e.g.,TCPover Ethernet), the fixed overhead can cost 105 -106 instructions;almost all of this overhead is typically in operating-system-mediatedcommunication protocol software, not in actual transit time on thenetwork. On special-built multicomputers and clusters, which often havecustom-built networks and network interfaces and much leaner, customcommunication software, the fixed overhead can still cost 102 -103 instructions.
Per-byte overheads are visible as limitations on bandwidth: thenumber of bytes per second that can be pushed through a channel.Bandwidths of standard local area networks have been improving (from 10Mbits/s to 100 Mbits/s and, now, to 1 Gbits/s).
Multicomputer networks have also continually improved, with bandwidthstypically an order of magnitude greater than current local-area networktechnology. However, even these bandwidths are lower than main memorybandwidths by another order of magnitude.
These performance properties are impossible to hide, and they have aprofound impact on how we program multicomputers. It should be clearthat the only programs that will successfully see a performanceimprovement by executing on a multicomputer are those programs thatachieve a “good” computation to communication ratio.
In other words, each node needs to perform a lot of computation foreach communication, in order that the communication overhead isamortized over a lot of useful work. This naturally leads to “blocked”program structures where each node works on large blocks of data andcommunicates infrequently in large blocks of data.
(This use of the word “block”refers to the fact that we are dealing with a chunk, or block of data,and is unrelated to its use in phrases like “blocking sends”, where itmeans that the program is blocked, or stuck, until the send operationcompletes. )
|Figure1.9: A 4 x 4 matrix multiplication.|
For matrix multiplication, chunking up the data is particularlyeasy, leading to a so-called “blocked” matrix multiply program.Consider the 4 x 4 matrix multiplication shown in Figure 1.9 above. Anexample component in the result is x with value:
x = ap + bq + cr + ds
However, it is also possible to view it as shown in Figure 1.10 below , where
X = A x P + C x R
and “x” and “+” are themselves matrix multiplication and addition on2 x 2 matrices, respectively (weleave it as an exercise to the reader to verify that x still has thevalue ap+bq+cr+ds ).
|Figure1.10: Viewing 4 x 4 matrices as 2 x 2 blocks.|
This insight allows us to generalize our systolic matrixmultiplication program trivially to work on blocks of size N, insteadof individual words:
real aij(N,N), bij(N,N), cij(N,N)
cij = matmult(aij, bij, N)
do k = 1, n/N – 1
send(left, aij, N*N)
send(up, bij, N*N)
receive(right, aij, N*N)
receive(down, bij, N*N)
cij = matadd(cij,
matmult(aij, bij, N),
This kind of “blocking” has two advantages. First, the fixedcommunication overhead for each send is amortized over four elements,instead of one. Second, each element that is received by a node is usedN times without further communication, once for each of the N localinner products that it participates in.
Now, the loop does an N x N matrix multiplication and addition forevery two sends and receives. By making N sufficiently large, clearlywe can make the computation to communication ratio as high as we want.
Of course, the question arises, does the application program reallyneed such large matrices? And the answer is, yes, the usual reason forturning to parallel processing in the first place is for speed on largedata sets.
In closing, we observe that although we have described blocking asan important consideration in programming multicomputers, it turns outto be important even on uniprocessors and on conventional bus-basedshared-memory multiprocessors.
The reasons, at a certain abstract level, are the same-minimizingdata movement between “near” and “far” memory. Modern uniprocessors andbus-based multiprocessors typically have a memory hierarchy: disks (forpaging virtual memory), main memory, and one or more levels of caches.
Data is moved automatically on cache misses and page faults, andthese movements have overheads. By using a blocked algorithm, sinceelements of a subblock may be used multiple times, the number of cachemisses and page faults is reduced, thereby improving performance, oftendramatically.
Message-Passing Programming Is NotEasy
We hope the reader will agree that obtaining the final, efficient,message-passing program from the original mathematical specification ofmatrix multiplication is nontrivial. It takes careful study toconvince oneself that the final solution actually works.
If this is the case even for something as simple, structured, andstraightforward as matrix multiplication, one can imagine how difficultit is to write programs with more complicated, dynamically unfolding,unpredictable structure, such as sparse matrix programs, or symbolicprograms that use graphs and trees instead of matrices.
Next in Part 5: Implicitparallel programming – Declarative languages
To read Part 1, go to How sequential languages obscureparallelism.
To read Part 2, go to How to achieve parallel execution?
To read Part 3, go to Explicit Parallelism: Shared-MemoryProgramming with Threads and Locks
Rishiyur is Chief TechnicalOfficer at ESL specialist Bluespec Inc . Previous to itsacquisition by Broadcom he led the Bluespec technology team atSandburst Corp. For the previous 9 years he was at Cambridge ResearchLaboratory (DEC/Compaq), including over a year as Acting Director.Earlier, he was an Associate Professor of Computer Science andEngineering at MIT. He holds several patents in functional programming,dataflow and multithreaded architectures, parallel processing andcompiling.
Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing Term Rewriting Systems (TRSs),encompassing hardware synthesis as well as verification ofimplementations against TRS specifications. Previously, he contributedto the development of dynamic dataflow architectures, the implicitlyparallel programming languages Id and pH, and compilation of theselanguages on parallel machines. He was also president and founder ofSandburst Corp.
Toread more on this topic on Embedded.com go to Moreabout multicores, multithreading and multiprocessing.
 “Microsoft:Parallel Programming Model Ten Years off,” by Rick Merritt,Embedded.com, July 23, 2007
 “The Problem with Threads” byEdward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
 “Whythreads are a bad idea (for most purposes),” J.K. Ousterhout,Invited Talk, USENIX Technical Conference, January, 1996
 S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
 Z. Ariola and Arvind. “Propertiesof a First-order Functional Language with Sharing.” TheoreticalComputer Science, 146(1-2):69-108, July 1995.
 J. Armstrong. “The Development ofErlang.” In Proc. Intl. Conf. on Functional Programming (ICFP),Amsterdam, The Netherlands, pages 196-203, 1997.
 J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.” Prentice Hall, 1993. ISBN: 0-13-285792-8.
 K. Arnold and J. Gosling. The JavaProgramming Language, second edition. Addison-Wesley, 1998. ISBN0-201-31006-6.
 Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
 Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
 Arvind and R. S. Nikhil. Executinga Program on the MIT Tagged-Token Dataflow Architecture. IEEETrans. on Computers, 39(3):300-318, March 1990.
 R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded RuntimeSystem. In Proc. 5th. ACM Symp. on Principles and Practiceof Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,July 19-21 1995.
 D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
 A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVMParallel Virtual Machine. A User's Guide and Tutorial for NetworkParallel Computing. MIT Press, Cambridge MA, USA, 1994.
 R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
 High PerformanceFortran Forum. High Performance Fortran Language Specification,Version 1.0. Technical Report CRPC-TR92225, Center for Research onParallel Computation, Rice University, Houston TX, USA, May 3 1993.
 C. Hoare. Monitors:an Operating System Structuring Concept. Communications of the ACM,17(10):549-557, October 1974.
 R. Hughes. WhyFunctional Programming Matters. The Computer Journal, 32(2):98-107,1989.
 Threadsstandard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
 Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
 R. S. Nikhil. “Cid: A Parallel'Shared-memory' C for Distributed Memory Machines.” In Proc.7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,August 8-10 1994.
 R. S. Nikhil and Arvind. Id:a Language with Implicit Parallelism. In A Comparative Study ofParallel Programming Languages: The Salishan Problems, John Feo(editor), pages 169-215. North Holland, 1992.
 R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab. for Computer Science, Cambridge MA 02139,USA, July 1986.
 OpenMP Architecture ReviewBoard. OpenMPSpecifications, 1997 (Fortran), 1998 (C/C++)
 S. L. Peyton Jones.The Implementation of FunctionalProgramming Languages . Prentice Hall, 1987.
 S. L. Peyton Jones(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, andP. Wadler. Haskell98: A Non-strict, Purely Functional Language, February 1999.
 M. C. Rinard and P. C.Diniz. CommutativityAnalysis: A New Analysis Framework for Parallelizing Compilers. ACMTrans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,November 1997.
 M. C. Rinard, D. J.Scales, and M. S. Lam. Jade: AHigh-Level, MachineIndependent Language for Parallel Programming.IEEE Computer, 26(6):28-38, June 1993.
 N. Wirth.Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.
Thank you for the excellent article.
I believe the communication to computation ratio is the biggest challenge with high-performance computing these days.
It's worth mentioning that although clusters are the most popular choice for parallel computing, there are emerging alternatives such as FPGA and the use of multi, custom processors, on a single chip, for solving major computational problems while using direct memory access, therefore increasing throughput. Although it currently suffers from lack of higher level tools, we can see constant progress in this area.
– Yossi Deutsch
Zichron Yahakov, Israel