By Rishiyur S. Nikhil (Bluespec) and Arvind (MIT)
So far in this series, we have seen that automatic parallelization of
sequential programs is hard because precise dependence analysis is
hard; data parallel programming languages have a limited application
space and choosing good data distributions is hard; and programming
with explictly parallel threads and locks is difficult because it is
hard to avoid races and deadlocks.
The message-passing model for parallel programming consists of a
collection of conventional sequential programs that can communicate by
sending and receiving messages.
This model is popular because it closely mirrors many parallel
machines available today, called "multicomputers" and "clusters", which
consist of a collection of conventional computers each of which
contains an input-output device that is connected to a shared network.
Each computer is either a traditional uniprocessor, or a small SMP
(symmetric multiprocessor), typically with four processors or less.
Some multicomputers have been designed as parallel machines and have
very sophisticated, fast interconnection networks, but even a
collection of workstations on a local-area network-a configuration
found in most offices today-can be regarded as a multicomputer.
In a multicomputer, since each computer (or node) is independent, a
program on node Nd2 has no way directly
to read a variable x on node
Nd1. Instead, node Nd1 can communicate the
value in variable x to node
Nd2. The program on
node Nd1 executes:
send(Nd2,
x)
while the program on node Nd2 executes:
receive(Nd1,
y)
as a result of which a chunk of data is effectively copied from
variable x in node Nd1 into variable y in node Nd2.
For the moment, think of send as depositing a message into the
network, which will ultimately be delivered to the recipient; in
particular, after executing send, a node in the multicomputer can
continue to compute, even though the recipient may not yet have
attempted to execute receive.
These are called "nonblocking" or "asynchronous" sends (we will
discuss this issue in a later section). The receive call is blocking;
that is, execution waits at that statement until a message is actually
received.
 |
| Figure
1.6: Layout of matrices in a multicomputer (each rectangle is a
separate computer). |
Matrix Multiplication with
Message-Passing
Let us reconsider matrix multiplication of n x n matrices. Assume that
our multicomputer has exactly n x n nodes, and assume that the (i,j)'th
node contains the variables A(i,j), B(i,j), and C(i,j), as shown in
Figure 1.6 above. One could
imagine that each node computes an inner
product; let us concentrate on the node (4,6):
private real k
% (each node has a
private variable k)
C(4,6) = 0.0
do k = 1,n
C(4,6) = C(4,6) + A(4,k) * B(k,6)
end do
Unfortunately, in the message-passing model, this node (4,6) cannot
directly access components that it needs: A(4,1), B(1,6), A(4,2),
B(2,6), ... So, the system is organized as in Figure 1.7 below.
Every node has variables called aij and bij which we assume
have
been preloaded with the appropriate matrix components, and variable cij
which will contain C(i,j) after the matrix multiplication. Each node
also has variables my_i and my_j which we assume have been
preloaded with the node's indices. Finally, assume that each node
contains two 1 x n vectors called ai and bj.
 |
| Figure
1.7: Multicomputer matrix multiplication, local memories only. |
We assume that nodes can communicate as shown in the figure;
that
is, node (i,j) communicates with nodes (i-l,j), (i+l,j), (i,j-1), and
(i,j+1), which we will henceforth abbreviate as up, down, left, and
right, respectively.We assume the communication "wraps around" at the
edges so that, for
example, the "up neighbor" of a node in the top row is the
corresponding node in the same column in the bottom row.
It is important to be aware that this does not mean that the
machine's physical communication network has this two-dimensional
topology, just that nodes are capable of communicating in this pattern.
In general, a machine's physical network topology has very little to do
with a program's abstract communication patterns.
The first thing we do is to collect, in each node, a copy of its
entire row of matrix A, using the vector ai. Here is the code (all
nodes executed this code simultaneously):
j =
my_j
ai(j) = aij
do jj = 1, n - 1
send(left, ai(j))
j = j + 1
if (j.gt.n) j = 1
receive(right, ai(j))
end do
The net effect is to "slide" matrix A to the left (with wraparound
at the edge). Each node, when it receives a component of A from
the
right, deposits it into the correct slot of ai. In particular, node
(i,j) receives Ai,j+1, Ai,j+2, Ai,j+3,
... in turn.
Similarly, we can slide matrix B upward (with wraparound at the
edge). As each node receives a component of B from below, it is
incorporated into bj. In particular, node (i, j) receives Bi+1,j,
Bi+2,j, Bi+3,j, ... in turn. Here is the code
(all nodes executed
this code simultaneously):
i = my_i
bj(i) = bij
do ii = 1, n-1
send(up, bj(i))
i = i + 1
if (i.gt.n) i = 1
receive(down, bj(i))
end do
At this point, the (i,j )'th node contains, in its ai and bi
variables, the i'th row of A and j'th column of B. It can now perform
the inner product (for which no communication is needed-it is a purely
local computation):
cij = 0.0
do k = 1,n
cij = cij + ai(k) * bj(k)
end do
The cijs now represent the desired
result matrix.