By Rishiyur S. Nikhil (Bluespec) and Arvind (MIT)
Since it is hazardous and nontrivial for a programmer to take an
existing sequential program and simply annotate some parts of it (such
as loops) to run in parallel, what alternatives are open to us? There
are at least three approaches that have been, or are being pursued.
One approach is automatic
parallelization. We could retain existing sequential
languages
(for ease of programming and familiarity) and rely on an automatic
system such as a compiler to analyze and transform our sequential
programs safely into parallel ones, for example to convert our original
matrix multiplication program automatically into the final version.
Because it is systematic, it would presumably not accidentally
introduce race conditions the way a human might.
One popular variant of this approach is to make parallelization
relatively easy for the compiler by providing "data parallel" extensions in an
existing sequential language. These primitives package up common
paradigms for manipulating large data collections such as arrays.
Another possibility is explicit
parallelism. We could shift the burden of specifying
parallel
execution to the programmer. There are two major approaches:
Shared-memory
programming with parallel threads and locks. We could ask the
programmer to specify what should be executed in parallel (like our
doall loop above), but we could give the programmer tools and
methodologies that permit manipulation of shared state in a disciplined
way, thereby preventing unwanted race conditions.
Message-passing.
We could move to a radically different model of parallelism, where a
parallel program consists of N totally independent, conventional,
sequential programs, augmented with communication primitives that allow
them to send messages to each other. Since the N programs are
independent, they never directly manipulate each other's state.
(However, it is still possible to
have race conditions, because a processor can indirectly read and write
locations in another processor by sending it messages requesting it to
do so.)
A third option open to us is implicit parallelism, using state-free,
or declarative languages. We could pursue programming notations that
avoid the notion of state entirely, thereby avoiding parallelization
problems.
After all, the mathematical description of matrix multiplication
says nothing about state; it was only our encoding into a programming
notation that introduced it-perhaps we could move to programming
notations that are closer to mathematical notation. A related question
is, are such state-free notations adequate to express all programming
tasks?
And, of course, we could attempt a mixture of these approaches. The
rest of this series outlines all these approaches. We will end with a
discussion of parallel Haskell (pH), which may be regarded as a mixture
of two approaches: it is predominantly a declarative language (in fact,
a functional language), but it also permits disciplined use of shared
state when necessary.
Automatic Parallelization of
Sequential Programs
The goal of taking programs in traditional sequential programming
languages and converting them automatically into parallel programs has
been pursued by researchers since the early 1970s. That goal remains
elusive; however, the insights, terminology and technology that emerged
from that research have proven to be quite valuable.
We start by developing a deeper understanding of how reuse of
storage prevents parallel execution. Consider the following sequence of
assignment statements:
x = a +
b
% S1
y = 2 *
x
% S2
x = a -
b
% S3
print(x+y)
% S4
We focus on the variable x. Statements Sl and S3 are called defs of
x (for definitions) and statements S2 and S4 are called uses of x. We
say that there is a true data dependence from Sl to S2 and from S3 to
S4 due to x (
this is also called a
flow dependence or def-use
dependence).
In each case, a value is communicated (via x) from one statement to
the next and so, clearly, Sl must be executed before S2 and S3 must be
executed before S4.
Consider S2 and S3: there is no value being communicated, but, since
x is being reused, we must execute them in order so that S2 reads the
old value of x before S3 clobbers it with a new value.
This situation, where no value is communicated, but where statements
must still be ordered because of reuse of storage, is called an
antidependence (or a use-def
dependence).
Similarly, there is also the notion of a def-def dependence (or
output dependence), but we need not explore that further here.
The
fourth case (use-use) is clearly not a problem for parallelism.
<>It is important to note that an antidependence can be
eliminated by
introducing new storage locations. For example, we can change our
program fragment so that S3 uses a new variable z instead (
and we also
have to change S4 correspondingly):
x = a +
b
% S1
y = 2 *
x
% S2
z = a -
b
% S3
print(x+y)
% S4
Now, S3 is no longer restricted to follow S2, or even S1, for that
matter. By eliminating antidependencies, we obtain more parallelism. In
fact, this was exactly what happened in our matrix-multiplication
program-the conflict on variable s was an antidependency, and by
eliminating it [by replacing s with
C(i,j)], we obtained more parallelism.
This method is known in the literature as renaming. We can now phrase our
question more technically: Is it possible for a compiler to take a
sequential program and automatically perform renaming transformations
to convert it into a parallel program?
In order to do this, a compiler would first have to identify all
antidependencies. This seems trivial in our four-line program fragment
above, but it gets very complicated when we encounter aliasing-the fact
that the same storage location may be referred to by syntactically
distinct names. For example, let us package our four-line fragment into
a larger context:
subroutine foo(x,z,
a,b)
x=a+b
% Sl
y=2*x
% S2
z=a-b
% S3
print(z+y)
% S4
end
program bar
call foo(w,w,
6.001, 6.002)
stop
end
[
For those more familiar with C++,
the subroutine header corresponds
to:
void foo (float
&x, float
&z, float a, float b)]
Now, variables x and z actually refer to the same location
(corresponding to w) in the
main program, and we have our
antidependency all over again. Note that the antidependence may not
exist for other calls, such as:
call foo(v,w,
6.003, 6.004)
where v and w are distinct variables. It gets even more complicated
when variable locations are computed, as in indexed arrays:
program glurph
real U(10) % (U is an array of 10 elements)
read(i,j)
call foo(U(i), U(j), 6.003, 6.004)
Now, the antidependence exists if i = j, and since these values are
read from the input during program execution, the compiler has no hope
of deciding the issue. Or, consider this context:
do i = 1,10
do j = 1,10
call foo(U(i),
U(j), 6.003, 6.004)
end do
end do
Here, the antidependence occurs only in 10% of the iterations
(whenever i = j). But, to be conservative, the compiler must assume the
antidependence, and force S3 to be sequenced after S2.
One might imagine that the compiler produces two versions of the
body of foo, one sequential and the other parallel; at the entry to
foo, it could insert code that will test for aliasing and invoke the
appropriate body. However, note that this test would be executed a
hundred times, once per call to foo, which is a significant overhead.
The last example also introduces the idea that, in general, the
compiler has to perform subscript analysis to recognize dependences.
For example, in this loop:
do i = 1,N
do j = 2,N
Q(i,j) = P(j)
+ Q(i,J -1)
end do
end do
the compiler would have to realize that there is a dependence from
one j iteration to the next, but not from one i iteration to the next
(so, the i iterations can be executed in parallel, but within each such
iteration, the j loop must execute sequentially). Further, loop
transformations can sometimes eliminate dependences. For example,
s = 0.0
do 10 i = 1,N
P(i) = s
s=s+5.0
end do
appears to be a sequential loop because there is a dependence on s
from one iteration to the next, but the loop can be transformed, using
a technique called induction variable analysis, which recognizes that s
"tracks" i; that is, it is a function of i. Here is a transformed
version of the loop:
do i = 1,N
P(i) = (i-1) * 5.0
end do
s=N*5.0
Now, there is no dependence from one iteration to the next and the
iterations can be executed in parallel.
Subscript analysis, and aliasing detection in general, are
undecidable problems; that is, a compiler can only hope to obtain
approximate information and must conservatively assume the worst
whenever the information is inexact.
For example, if a compiler cannot prove that x and y point to
different objects, then it must assume that they may point to the same
object. Despite over two decades of research on this topic (which
originally began with so called "vectorizing
compilers"), the goal of automatic parallelization has been
achieved only on programs with relatively simple structure (simple
loops containing matrix computations with simple index expressions;
this includes matrix multiplication).
In fact, one of the effects of this research has been that, instead
of training compilers to recognize parallelism in arbitrary programs,
it has trained people to write programs conforming to the structured
parallelism that compilers can recognize automatically.
The problem is even more difficult for arrays with complex index
structures, such as sparse matrix computations, and for other data
structures such as trees, graphs, and queues which are common in
non-numerical, symbolic programs.
The additional difficulty arises because the data structures are
often dynamically allocated (and, so, not directly visible to the
compiler), often recursive, and rife with shared (aliased)
substructures.
Instead of subscript analysis,
one needs some kind of "path analysis"
to keep track of paths through chains of links between data structures
and, in some sense, to approximates at compile time the structure of
the complex data structure that will be built at run time.
Research in this area is still in its infancy, and there are many
who believe that the problem is intractable. One promising new approach
is ["commutativity analysis"] which takes advantage of the
encapsulation of state provided in object-oriented languages. Because
of this encapsulation, the only code that can modify a particular
object is in the methods of that object.
The basic idea of the analysis is to look at the methods associated
with an object and see if they commute pairwise, that is, if a method
call f followed by a method call g is equivalent to g followed by f. If
they commute, the methods can be called in parallel instead of
sequentially.