So far in this series, we have seen that automatic parallelization of sequential programs is hard because precise dependence analysis is hard; data parallel programming languages have a limited application space and choosing good data distributions is hard; and programming with explictly parallel threads and locks is difficult because it is hard to avoid races and deadlocks.
The message-passing model for parallel programming consists of a collection of conventional sequential programs that can communicate by sending and receiving messages.
This model is popular because it closely mirrors many parallel machines available today, called "multicomputers" and "clusters", which consist of a collection of conventional computers each of which contains an input-output device that is connected to a shared network.
Each computer is either a traditional uniprocessor, or a small SMP (symmetric multiprocessor), typically with four processors or less. Some multicomputers have been designed as parallel machines and have very sophisticated, fast interconnection networks, but even a collection of workstations on a local-area network-a configuration found in most offices today-can be regarded as a multicomputer.
In a multicomputer, since each computer (or node) is independent, a program on node Nd2 has no way directly to read a variable x on node Nd1. Instead, node Nd1 can communicate the value in variable x to node Nd2. The program on node Nd1 executes:
while the program on node Nd2 executes:
as a result of which a chunk of data is effectively copied from variable x in node Nd1 into variable y in node Nd2.
For the moment, think of send as depositing a message into the network, which will ultimately be delivered to the recipient; in particular, after executing send, a node in the multicomputer can continue to compute, even though the recipient may not yet have attempted to execute receive.
These are called "nonblocking" or "asynchronous" sends (we will discuss this issue in a later section). The receive call is blocking; that is, execution waits at that statement until a message is actually received.
|Figure 1.6: Layout of matrices in a multicomputer (each rectangle is a separate computer).|
Matrix Multiplication with
Let us reconsider matrix multiplication of n x n matrices. Assume that our multicomputer has exactly n x n nodes, and assume that the (i,j)'th node contains the variables A(i,j), B(i,j), and C(i,j), as shown in Figure 1.6 above. One could imagine that each node computes an inner product; let us concentrate on the node (4,6):
private real k
% (each node has a
private variable k)
C(4,6) = 0.0
do k = 1,n
C(4,6) = C(4,6) + A(4,k) * B(k,6)
Unfortunately, in the message-passing model, this node (4,6) cannot directly access components that it needs: A(4,1), B(1,6), A(4,2), B(2,6), ... So, the system is organized as in Figure 1.7 below.Every node has variables called aij and bij which we assume have been preloaded with the appropriate matrix components, and variable cij which will contain C(i,j) after the matrix multiplication. Each node also has variables my_i and my_j which we assume have been preloaded with the node's indices. Finally, assume that each node contains two 1 x n vectors called ai and bj.
|Figure 1.7: Multicomputer matrix multiplication, local memories only.|
It is important to be aware that this does not mean that the machine's physical communication network has this two-dimensional topology, just that nodes are capable of communicating in this pattern. In general, a machine's physical network topology has very little to do with a program's abstract communication patterns.
The first thing we do is to collect, in each node, a copy of its entire row of matrix A, using the vector ai. Here is the code (all nodes executed this code simultaneously):
ai(j) = aij
do jj = 1, n - 1
j = j + 1
if (j.gt.n) j = 1
The net effect is to "slide" matrix A to the left (with wraparound at the edge). Each node, when it receives a component of A from the right, deposits it into the correct slot of ai. In particular, node (i,j) receives Ai,j+1, Ai,j+2, Ai,j+3, ... in turn.
Similarly, we can slide matrix B upward (with wraparound at the edge). As each node receives a component of B from below, it is incorporated into bj. In particular, node (i, j) receives Bi+1,j, Bi+2,j, Bi+3,j, ... in turn. Here is the code (all nodes executed this code simultaneously):
i = my_i
bj(i) = bij
do ii = 1, n-1
i = i + 1
if (i.gt.n) i = 1
At this point, the (i,j )'th node contains, in its ai and bi variables, the i'th row of A and j'th column of B. It can now perform the inner product (for which no communication is needed-it is a purely local computation):
cij = 0.0
do k = 1,n
cij = cij + ai(k) * bj(k)
The cijs now represent the desired result matrix.