CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Making the transition from sequential to implicit parallel programming: Part 2
How to Achieve Parallel Execution?



Embedded.com
Since it is hazardous and nontrivial for a programmer to take an existing sequential program and simply annotate some parts of it (such as loops) to run in parallel, what alternatives are open to us? There are at least three approaches that have been, or are being pursued.

One approach is automatic parallelization. We could retain existing sequential languages (for ease of programming and familiarity) and rely on an automatic system such as a compiler to analyze and transform our sequential programs safely into parallel ones, for example to convert our original matrix multiplication program automatically into the final version. Because it is systematic, it would presumably not accidentally introduce race conditions the way a human might.

One popular variant of this approach is to make parallelization relatively easy for the compiler by providing "data parallel" extensions in an existing sequential language. These primitives package up common paradigms for manipulating large data collections such as arrays.

Another possibility is explicit parallelism. We could shift the burden of specifying parallel execution to the programmer. There are two major approaches:

Shared-memory programming with parallel threads and locks. We could ask the programmer to specify what should be executed in parallel (like our doall loop above), but we could give the programmer tools and methodologies that permit manipulation of shared state in a disciplined way, thereby preventing unwanted race conditions.

Message-passing. We could move to a radically different model of parallelism, where a parallel program consists of N totally independent, conventional, sequential programs, augmented with communication primitives that allow them to send messages to each other. Since the N programs are independent, they never directly manipulate each other's state.

(However, it is still possible to have race conditions, because a processor can indirectly read and write locations in another processor by sending it messages requesting it to do so.)

A third option open to us is implicit parallelism, using state-free, or declarative languages. We could pursue programming notations that avoid the notion of state entirely, thereby avoiding parallelization problems.

After all, the mathematical description of matrix multiplication says nothing about state; it was only our encoding into a programming notation that introduced it-perhaps we could move to programming notations that are closer to mathematical notation. A related question is, are such state-free notations adequate to express all programming tasks?

And, of course, we could attempt a mixture of these approaches. The rest of this series outlines all these approaches. We will end with a discussion of parallel Haskell (pH), which may be regarded as a mixture of two approaches: it is predominantly a declarative language (in fact, a functional language), but it also permits disciplined use of shared state when necessary.

Automatic Parallelization of Sequential Programs
The goal of taking programs in traditional sequential programming languages and converting them automatically into parallel programs has been pursued by researchers since the early 1970s. That goal remains elusive; however, the insights, terminology and technology that emerged from that research have proven to be quite valuable.

We start by developing a deeper understanding of how reuse of storage prevents parallel execution. Consider the following sequence of assignment statements:

    x = a + b                                         % S1
     y = 2 * x                                         % S2
     x = a - b                                          % 
S3  
     print(x+y)                                       % S4

We focus on the variable x. Statements Sl and S3 are called defs of x (for definitions) and statements S2 and S4 are called uses of x. We say that there is a true data dependence from Sl to S2 and from S3 to S4 due to x (this is also called a flow dependence or def-use dependence).

In each case, a value is communicated (via x) from one statement to the next and so, clearly, Sl must be executed before S2 and S3 must be executed before S4.

Consider S2 and S3: there is no value being communicated, but, since x is being reused, we must execute them in order so that S2 reads the old value of x before S3 clobbers it with a new value.

This situation, where no value is communicated, but where statements must still be ordered because of reuse of storage, is called an antidependence (or a use-def dependence).

Similarly, there is also the notion of a def-def dependence (or output dependence), but we need not explore that further here. The fourth case (use-use) is clearly not a problem for parallelism.

<>It is important to note that an antidependence can be eliminated by introducing new storage locations. For example, we can change our program fragment so that S3 uses a new variable z instead (and we also have to change S4 correspondingly):
    x = a + b                                         % S1
     y = 2 * x                                         % S2
     z = a - b                                          % 
S3  
     print(x+y)                                       % S
4

Now, S3 is no longer restricted to follow S2, or even S1, for that matter. By eliminating antidependencies, we obtain more parallelism. In fact, this was exactly what happened in our matrix-multiplication program-the conflict on variable s was an antidependency, and by eliminating it [by replacing s with C(i,j)], we obtained more parallelism.

This method is known in the literature as renaming. We can now phrase our question more technically: Is it possible for a compiler to take a sequential program and automatically perform renaming transformations to convert it into a parallel program?

In order to do this, a compiler would first have to identify all antidependencies. This seems trivial in our four-line program fragment above, but it gets very complicated when we encounter aliasing-the fact that the same storage location may be referred to by syntactically distinct names. For example, let us package our four-line fragment into a larger context:

    subroutine foo(x,z, a,b)
    x=a+b                                         % Sl
    y=2*x                                         % S2
    z=a-b                                          % S3
    print(z+y)                                   % S4
    end

    program bar
    call foo(w,w, 6.001, 6.002)
    stop
    end

[For those more familiar with C++, the subroutine header corresponds to:
void foo (float &x, float &z, float a, float b)]

Now, variables x and z actually refer to the same location (corresponding to w) in the main program, and we have our antidependency all over again. Note that the antidependence may not exist for other calls, such as:

    call foo(v,w, 6.003, 6.004)

where v and w are distinct variables. It gets even more complicated when variable locations are computed, as in indexed arrays:

    program glurph
    real U(10)     % (U is an array of 10 elements)

    read(i,j)
    call foo(U(i), U(j), 6.003, 6.004)

Now, the antidependence exists if i = j, and since these values are read from the input during program execution, the compiler has no hope of deciding the issue. Or, consider this context:

        do i = 1,10
            do j = 1,10
                    call foo(U(i), U(j), 6.003, 6.004)
        end do
        end do

Here, the antidependence occurs only in 10% of the iterations (whenever i = j). But, to be conservative, the compiler must assume the antidependence, and force S3 to be sequenced after S2.

One might imagine that the compiler produces two versions of the body of foo, one sequential and the other parallel; at the entry to foo, it could insert code that will test for aliasing and invoke the appropriate body. However, note that this test would be executed a hundred times, once per call to foo, which is a significant overhead.

The last example also introduces the idea that, in general, the compiler has to perform subscript analysis to recognize dependences. For example, in this loop:

        do i = 1,N
                do j = 2,N
                    Q(i,j) = P(j) + Q(i,J -1)
                end do
        end do

the compiler would have to realize that there is a dependence from one j iteration to the next, but not from one i iteration to the next (so, the i iterations can be executed in parallel, but within each such iteration, the j loop must execute sequentially). Further, loop transformations can sometimes eliminate dependences. For example,

        s = 0.0
        do 10 i = 1,N
                P(i) = s
                s=s+5.0
        end do

appears to be a sequential loop because there is a dependence on s from one iteration to the next, but the loop can be transformed, using a technique called induction variable analysis, which recognizes that s "tracks" i; that is, it is a function of i. Here is a transformed version of the loop:

        do i = 1,N
            P(i) = (i-1) * 5.0
        end do
        s=N*5.0

Now, there is no dependence from one iteration to the next and the iterations can be executed in parallel.

Subscript analysis, and aliasing detection in general, are undecidable problems; that is, a compiler can only hope to obtain approximate information and must conservatively assume the worst whenever the information is inexact.

For example, if a compiler cannot prove that x and y point to different objects, then it must assume that they may point to the same object. Despite over two decades of research on this topic (which originally began with so called "vectorizing compilers"), the goal of automatic parallelization has been achieved only on programs with relatively simple structure (simple loops containing matrix computations with simple index expressions; this includes matrix multiplication).

In fact, one of the effects of this research has been that, instead of training compilers to recognize parallelism in arbitrary programs, it has trained people to write programs conforming to the structured parallelism that compilers can recognize automatically.

The problem is even more difficult for arrays with complex index structures, such as sparse matrix computations, and for other data structures such as trees, graphs, and queues which are common in non-numerical, symbolic programs.

The additional difficulty arises because the data structures are often dynamically allocated (and, so, not directly visible to the compiler), often recursive, and rife with shared (aliased) substructures.

Instead of subscript analysis, one needs some kind of "path analysis" to keep track of paths through chains of links between data structures and, in some sense, to approximates at compile time the structure of the complex data structure that will be built at run time.

Research in this area is still in its infancy, and there are many who believe that the problem is intractable. One promising new approach is ["commutativity analysis"] which takes advantage of the encapsulation of state provided in object-oriented languages. Because of this encapsulation, the only code that can modify a particular object is in the methods of that object.

The basic idea of the analysis is to look at the methods associated with an object and see if they commute pairwise, that is, if a method call f followed by a method call g is equivalent to g followed by f. If they commute, the methods can be called in parallel instead of sequentially.

1 | 2

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :