Making the transition from sequential to implicit parallel programming: Part 2 - Embedded.com

Making the transition from sequential to implicit parallel programming: Part 2

Since it is hazardous and nontrivial for a programmer to take anexisting sequential program and simply annotate some parts of it (suchas loops) to run in parallel, what alternatives are open to us? Thereare at least three approaches that have been, or are being pursued.

One approach is automaticparallelization. We could retain existing sequentiallanguages(for ease of programming and familiarity) and rely on an automaticsystem such as a compiler to analyze and transform our sequentialprograms safely into parallel ones, for example to convert our originalmatrix multiplication program automatically into the final version.Because it is systematic, it would presumably not accidentallyintroduce race conditions the way a human might.

One popular variant of this approach is to make parallelizationrelatively easy for the compiler by providing “data parallel” extensions in anexisting sequential language. These primitives package up commonparadigms for manipulating large data collections such as arrays.

Another possibility is explicitparallelism. We could shift the burden of specifyingparallelexecution to the programmer. There are two major approaches:

Shared-memoryprogramming with parallel threads and locks . We could ask theprogrammer to specify what should be executed in parallel (like ourdoall loop above), but we could give the programmer tools andmethodologies that permit manipulation of shared state in a disciplinedway, thereby preventing unwanted race conditions.

Message-passing. We could move to a radically different model of parallelism, where aparallel program consists of N totally independent, conventional,sequential programs, augmented with communication primitives that allowthem to send messages to each other. Since the N programs areindependent, they never directly manipulate each other's state.

(However, it is still possible tohave race conditions, because a processor can indirectly read and writelocations in another processor by sending it messages requesting it todo so .)

A third option open to us is implicit parallelism, using state-free,or declarative languages. We could pursue programming notations thatavoid the notion of state entirely, thereby avoiding parallelizationproblems.

After all, the mathematical description of matrix multiplicationsays nothing about state; it was only our encoding into a programmingnotation that introduced it-perhaps we could move to programmingnotations that are closer to mathematical notation. A related questionis, are such state-free notations adequate to express all programmingtasks?

And, of course, we could attempt a mixture of these approaches. Therest of this series outlines all these approaches. We will end with adiscussion of parallel Haskell (pH), which may be regarded as a mixtureof two approaches: it is predominantly a declarative language (in fact,a functional language), but it also permits disciplined use of sharedstate when necessary.

Automatic Parallelization ofSequential Programs
The goal of taking programs in traditional sequential programminglanguages and converting them automatically into parallel programs hasbeen pursued by researchers since the early 1970s. That goal remainselusive; however, the insights, terminology and technology that emergedfrom that research have proven to be quite valuable.

We start by developing a deeper understanding of how reuse ofstorage prevents parallel execution. Consider the following sequence ofassignment statements:

    x = a +b                                        % S 1
     y = 2 *x                                       % S2
     x = a -b                                         % 
S3  
    print(x+y)                                      % S4

We focus on the variable x. Statements Sl and S3 are called defs ofx (for definitions) and statements S2 and S4 are called uses of x. Wesay that there is a true data dependence from Sl to S2 and from S3 toS4 due to x (this is also called aflow dependence or def-usedependence ).

In each case, a value is communicated (via x) from one statement tothe next and so, clearly, Sl must be executed before S2 and S3 must beexecuted before S4.

Consider S2 and S3: there is no value being communicated, but, sincex is being reused, we must execute them in order so that S2 reads theold value of x before S3 clobbers it with a new value.

This situation, where no value is communicated, but where statementsmust still be ordered because of reuse of storage, is called anantidependence (or a use-defdependence ).

Similarly, there is also the notion of a def-def dependence (oroutput dependence), but we need not explore that further here.Thefourth case (use-use) is clearly not a problem for parallelism.

<>It is important to note that an antidependence can beeliminated byintroducing new storage locations. For example, we can change ourprogram fragment so that S3 uses a new variable z instead (and we alsohave to change S4 correspondingly):
    x = a +b                                        % S 1
     y = 2 *x                                       % S2
     z = a -b                                         % 
S3  
    print(x+y)                                      % S
4

Now, S3 is no longer restricted to follow S2, or even S1, for thatmatter. By eliminating antidependencies, we obtain more parallelism. Infact, this was exactly what happened in our matrix-multiplicationprogram-the conflict on variable s was an antidependency, and byeliminating it [by replacing s withC(i,j) ], we obtained more parallelism.

This method is known in the literature as renaming. We can now phrase ourquestion more technically: Is it possible for a compiler to take asequential program and automatically perform renaming transformationsto convert it into a parallel program?

In order to do this, a compiler would first have to identify allantidependencies. This seems trivial in our four-line program fragmentabove, but it gets very complicated when we encounter aliasing-the factthat the same storage location may be referred to by syntacticallydistinct names. For example, let us package our four-line fragment intoa larger context:

    subroutine foo(x,z,a,b)
    x=a+b                                      % Sl
    y=2*x                                      % S2
    z=a-b                                       % S3
    print(z+y)                                 % S4
    end

    program bar     call foo(w,w,6.001, 6.002)
    stop     end

[For those more familiar with C++,the subroutine header correspondsto:
void foo (float&x, float&z, float a, float b) ]

Now, variables x and z actually refer to the same location(corresponding to w ) in themain program, and we have ourantidependency all over again. Note that the antidependence may notexist for other calls, such as:

    call foo(v,w,6.003, 6.004)

where v and w are distinct variables. It gets even more complicatedwhen variable locations are computed, as in indexed arrays:

    program glurph
    real U(10)     % (U is an array of 10 elements
)

    read(i,j)
    call foo(U(i), U(j), 6.003, 6.004)

Now, the antidependence exists if i = j, and since these values areread from the input during program execution, the compiler has no hopeof deciding the issue. Or, consider this context:

       do i = 1,10
           do j = 1,10
                   call foo(U(i),U(j), 6.003, 6.004)
       end do
       end do

Here, the antidependence occurs only in 10% of the iterations(whenever i = j). But, to be conservative, the compiler must assume theantidependence, and force S3 to be sequenced after S2.

One might imagine that the compiler produces two versions of thebody of foo, one sequential and the other parallel; at the entry tofoo, it could insert code that will test for aliasing and invoke theappropriate body. However, note that this test would be executed ahundred times, once per call to foo, which is a significant overhead.

The last example also introduces the idea that, in general, thecompiler has to perform subscript analysis to recognize dependences.For example, in this loop:

       do i = 1,N
               do j = 2,N                    Q(i,j) = P(j)+ Q(i,J -1)
               end do        end do

the compiler would have to realize that there is a dependence fromone j iteration to the next, but not from one i iteration to the next(so, the i iterations can be executed in parallel, but within each suchiteration, the j loop must execute sequentially). Further, looptransformations can sometimes eliminate dependences. For example,

       s = 0.0
       do 10 i = 1,N
               P(i) = s                s=s+5.0        end do

appears to be a sequential loop because there is a dependence on sfrom one iteration to the next, but the loop can be transformed, usinga technique called induction variable analysis, which recognizes that s”tracks” i; that is, it is a function of i. Here is a transformedversion of the loop:

       do i = 1,N
           P(i) = (i-1) * 5.0        end do
       s=N*5.0

Now, there is no dependence from one iteration to the next and theiterations can be executed in parallel.

Subscript analysis, and aliasing detection in general, areundecidable problems; that is, a compiler can only hope to obtainapproximate information and must conservatively assume the worstwhenever the information is inexact.

For example, if a compiler cannot prove that x and y point todifferent objects, then it must assume that they may point to the sameobject. Despite over two decades of research on this topic (whichoriginally began with so called “vectorizingcompilers”), the goal of automatic parallelization has beenachieved only on programs with relatively simple structure (simpleloops containing matrix computations with simple index expressions;this includes matrix multiplication).

In fact, one of the effects of this research has been that, insteadof training compilers to recognize parallelism in arbitrary programs,it has trained people to write programs conforming to the structuredparallelism that compilers can recognize automatically.

The problem is even more difficult for arrays with complex indexstructures, such as sparse matrix computations, and for other datastructures such as trees, graphs, and queues which are common innon-numerical, symbolic programs.

The additional difficulty arises because the data structures areoften dynamically allocated (and, so, not directly visible to thecompiler), often recursive, and rife with shared (aliased)substructures.

Instead of subscript analysis ,one needs some kind of “path analysis “to keep track of paths through chains of links between data structuresand, in some sense, to approximates at compile time the structure ofthe complex data structure that will be built at run time.

Research in this area is still in its infancy, and there are manywho believe that the problem is intractable. One promising new approachis [“commutativity analysis”] which takes advantage of theencapsulation of state provided in object-oriented languages. Becauseof this encapsulation, the only code that can modify a particularobject is in the methods of that object.

The basic idea of the analysis is to look at the methods associatedwith an object and see if they commute pairwise, that is, if a methodcall f followed by a method call g is equivalent to g followed by f. Ifthey commute, the methods can be called in parallel instead ofsequentially.

Data Parallel Programming Languages
Data parallel languages are typically sequential languages extendedwith constructs that, by definition, permit parallel operations onlarge homogeneous collections of data. The principal examples today areFortran 90 and HPF, or High Performance Fortran [16].

Consider the assignment statement:

       A = B + C

When A, B, and C are integers or floating-point numbers, thisstatement has the usual meaning. However, in Fortran 90, this statementhas a different meaning if the variables were vectors, e.g., declaredas follows:

       REAL, DIMENSION(1:100) :: A, B, C

Now, the meaning of the assignment statement is, for all i in therange 1 to 100,

       A(i) = B(i) + C(i)

that is, a complete vector addition. Or, suppose C was a scalar:

       REAL CREAL, DIMENSION(1:100) :: A, B

Then the meaning of the assignment statement is to execute 100assignments of the form A(i) = B(i) + C. Conceptually one can think ofC as being promoted to a vector containing 100 copies of its value,followed by a vector addition.

These concepts generalize to multiple dimensions-the variables may,in general, be n-dimensional matrices, and lower-dimensional variablesare automatically promoted to higher-dimensions before performing theelement-wise operations.

The assignment statements can be more sophisticated in picking whichelements of an array are involved. For example:

       A(4:100:2) = B(1:49) + 1.0

Here, the first 49 elements of B are selected, and incrementedvalues are stored into A(4), A(6), …, A(100) (the remaining locationsof A are untouched).

In data parallel assignment statements, all the element-wisecomputations can be done in parallel. This is by definition, and notmerely an observation inferred by reasoning about the statements. Thefollowing example illustrates the subtlety:

       A(1:99) = A(2:100) ” 1

Consider two of the element-wise assignments involved: A(1)=A(2)-1and A(2)=A(3)-1. The first one reads A(2) and the second one writesA(2). If executed in parallel without care, there is a racecondition-the first statement may read the old or the new value ofA(2), depending on how things are scheduled.

The language definition eliminates this ambiguity by defining thesemantics of the statement as (conceptually) performing all reads onthe right-hand sides before any of the writes on the left-hand sides.

In the general case, the implementation may have to allocatetemporary storage for all the results before doing any of the finalassignments to the left-hand side array. For example, it may transformthe above assignment into:

       REAL, DIMENSION(1:99) :: TEMP

       TEMP(1:99) = A(2:100) – 1
       A = TEMP

A clever compiler may of course avoid introducing such temporarystorage for a particular assignment statement, if it sees that there isno danger of a race (as in the assigment statements earlier in thissection).

In addition to these data parallel assignments, Fortran 90 also hasdata parallel loops (which may contain several assignment statements)and a variety of operators to perform reduction operations on arrays,such as finding the sum of elements in an array, or the maximumelement, etc.

Since these operators are primitives, they may be implemented by thecompiler using parallel algorithms-the programmer is essentiallyspecifying the parallelism at a very high level, leaving it up to thecompiler to manage all the tedious coding details.

Fortran 90 was originally developed for vector supercomputers,machines with hardware support to perform element-wise operations onarrays in parallel. However, nowadays the more common computingworkhorses are clusters and multicomputers, which are distributedmemory machines.

In such machines it is very important to have a good datadistribution-a partitioning of the data into pieces that aredistributed across the individual computers of the parallel machine. Abad distribution can result in a lot of communication and, sincecommunication is quite expensive in these machines, this results inpoor performance.

High Performance Fortran (HPF)was developed to address this problem. It extends Fortran 90 withdirectives by which the user can specify how an array is to bepartitioned across a parallel machine and how pieces of differentarrays are to be “aligned”, or colocated. For example,

               REAL, DIMENSION(100,100) :: A, B

    !HPF$ ALIGN A(I,J)with B(J,I)

    !HPF$ PROCESSORS P(4)

    !HPF$ DISTRIBUTEA(BLOCK, *) ONTO P

The second line specifies that for 1 <= i <= 100, 1 <= j<= 100, each element A(i,j) should be on the same processor asB(j,i). The third line specifies that conceptually, there are 4processors. The last line specifies that the rows of A should bedistributed in blocks over the 4 processors.

In particular, the chunk A(1,1) through A(25,100) should be onprocessor P1, the chunk A(26,1) through A(50,100) should be onprocessor P2, the chunk A(51,1) through A(75,100) should be onprocessor P3 and the chunk A(76,1) through A(100,100) should be onprocessor P4. By the ALIGN directive, this, in turn, placesrequirements on how B should be distributed. This is illustrated in Figure 1.5, below.

Figure1.5. High-Performance Fortran (HPF) array layouts.

From this programmer-supplied information, an HPF compiler does twothings. First, it decides how to distribute work across the processors.For example, given an assignment statement:

    A = A + 1

it may decide to perform the 2500 assignments for A(1,1) throughA(25,100) on processor P1, the 2500 assignments for the next chunk onprocessor P2, and so on.

Second, it decides what communication, if any, is necessary, andgenerates the necessary code. In the above assignment statement, nocommunication at all is necessary since, for each assignment, both theleft- and right-hand side data are always on the same processor.However, consider

    A(1:100,1:99) =A(1:100,2:100) + 1A(1:99,1:100) = A(2:100,1:100) + 1

The first statement, again, needs no communication, but the secondone does. For example, consider any element assignment A(25,j) =A(26,j)+1. The right-hand side element is on processor P2, whereas theleft-hand side is on processor P1. Consequently, every element in row26 needs to be communicated from processor P2 to processor P1.

HPF compilers detect such communication requirements and typicallygenerate appropriate message-passing code. In fact, they typically gofurther and try to produce a single send-receive of the entire 26th rowof 100 elements, rather than 100 individual sends and receives, sincethe former is typically much cheaper on most multicomputers (less fixedoverhead).

Choosing a good data distribution is not easy. In one part of thecomputation, it may be appropriate to align A with B in one particularway; in another part of the computation, another alignment may bebetter. A good alignment of A with B may suggest one distributionacross processors, whereas a good alignment of A with C may suggestanother.

Arrays may be redistributed dynamically to alleviate suchconflicting requirements, but redistribution is not free and its costhas to be weighed against not redistributing but paying more forcommunication somewhere else.

Many researchers are skeptical that such data distributionoptimization can be done manually by programmers and are working hardto devise automated means to insert HPF data distribution annotationsinto plain Fortran 90 code.

Although HPF and Fortran 90 are the principal examples of dataparallel programming languages, the same ideas can of course beembedded in other languages. For example, the language Split-C is a data parallelprogramming language from the University of California at Berkeley andis an extension of C [13].

It is a lower-level language because, unlike HPF, the programmer isaware of separate threads executing on different processors and of thedistinction between local data (on the same computer as a thread) andglobal data (on a different computer). The language [NESL] isa purpose-built dataparallel language from Carnegie-Mellon University allowing nested dataparallelism.

We close Part 2 in this series by noting that this style of dataparallel programming only addresses a certain class of applications,namely those based on dense rectangular arrays with simple, nested loopparallelism.

Like automatic parallelization in general, these methods are notvery appropriate for applications based on sparse matrices or thosebased on other data structures such as trees and graphs.

Next in Part 3: Explicitparallelism ” shared-memory programming with threads and locks
To read Part 1, go to How sequentiallanguages obscure parallelism.

RishiyurS. Nikhil and Arvind are the coauthors of ImplicitParallel Programming in pH, published by Elsevier/MorganKaufmann.

Rishiyur is Chief TechnicalOfficer at ESL specialist Bluespec Inc . Previous to itsacquisition by Broadcom he led the Bluespec technology team atSandburst Corp. For the previous 9 years he was at Cambridge ResearchLaboratory (DEC/Compaq), including over a year as Acting Director.Earlier, he was an Associate Professor of Computer Science andEngineering at MIT. He holds several patents in functional programming,dataflow and multithreaded architectures, parallel processing andcompiling.

Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing Term Rewriting Systems (TRSs),encompassing hardware synthesis as well as verification ofimplementations against TRS specifications. Previously, he contributedto the development of dynamic dataflow architectures, the implicitlyparallel programming languages Id and pH, and compilation of theselanguages on parallel machines. He was also president and founder ofSandburst Corp.

Toread more on this topic on Embedded.com go to Moreabout multicores, multithreading and multiprocessing.

References:
[1]Microsoft:Parallel Programming Model Ten Years off,” by Rick Merritt,Embedded.com, July 23, 2007
[2]The Problem with Threads” byEdward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3]Whythreads are a bad idea (for most purposes),” J.K. Ousterhout,Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. “Propertiesof a First-order Functional Language with Sharing.” TheoreticalComputer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. “The Development ofErlang.” In Proc. Intl. Conf. on Functional Programming (ICFP),Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.” Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The JavaProgramming Language, second edition. Addison-Wesley, 1998. ISBN0-201-31006-6.
[9] Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executinga Program on the MIT Tagged-Token Dataflow Architecture. IEEETrans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded RuntimeSystem. In Proc. 5th. ACM Symp. on Principles and Practiceof Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,July 19-21 1995.
[13] D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
[14] A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVMParallel Virtual Machine. A User's Guide and Tutorial for NetworkParallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High PerformanceFortran Forum. High Performance Fortran Language Specification,Version 1.0. Technical Report CRPC-TR92225, Center for Research onParallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:an Operating System Structuring Concept. Communications of the ACM,17(10):549-557, October 1974.
[18] R. Hughes. WhyFunctional Programming Matters. The Computer Journal, 32(2):98-107,1989.
[19] Threadsstandard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. “Cid: A Parallel'Shared-memory' C for Distributed Memory Machines.”  In Proc.7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:a Language with Implicit Parallelism. In A Comparative Study ofParallel Programming Languages: The Salishan Problems, John Feo(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab. for Computer Science, Cambridge MA 02139,USA, July 1986.
[24] OpenMP Architecture ReviewBoard. OpenMPSpecifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones.The Implementation of FunctionalProgramming Languages . Prentice Hall, 1987.
[26] S. L. Peyton Jones(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, andP. Wadler. Haskell98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C.Diniz. CommutativityAnalysis: A New Analysis Framework for Parallelizing Compilers. ACMTrans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,November 1997.
[28] M. C. Rinard, D. J.Scales, and M. S. Lam. Jade: AHigh-Level, MachineIndependent Language for Parallel Programming.IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth.Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.