Making the transition from sequential to implicit parallel programming: Part 5 - Embedded.com

Making the transition from sequential to implicit parallel programming: Part 5

Since dealing with state (reuse of storage) causes so much difficultyfor parallel execution, we can ask if there are programming notationsthat avoid the notion of state completely, and whether such notationsare adequate for general-purpose programming.

The answer to the first question is yes – such languages are calleddeclarative languages – but the answer to the second question is no -some programming tasks are difficult, if not impossible, to expressdeclaratively.

The efficient implementation of pure declarative languages is alsovery difficult (curiously, the implementation problem is the dual ofthe automatic parallelization problem, an observation that we willreturn to shortly). Nevertheless, declarative languages are anexcellent basis for more complete parallel programming languages and,indeed pH, the language advocated in this series of articles, is basedon this idea.

A well-known class of declarative languages is that of functionalprogramming languages, such as Lisp, SMLand Haskell [26] . Hughesprovides an excellent articulation of the advantages of functionalprogramming in “WhyFunctional Programming Matters[18] .A second well-known class isthat of logic programming languages, such as Prolog and KL1 and theirmore modern derivatives, constraint logic programming languages, suchas Oz. We will mainly discuss functional languages here (because oftheir relationship to pH), but many of our remarks also apply to thelatter.

As an aside, we note that the adjective “declarative” is subjective.Intuitively it means that the program describes what is to be computedrather than how to compute it. Such a description is “higher level”, ormore “abstract” and hence presumably easier for humans to write andcomprehend.

However, no programming language can ever get away completely fromthe how of problem solving. The most abstract descriptions ofcomputations may not even be executable (such descriptions are thepurest form of specifications).

For example, an abstract description of the square root function is”the function which, for argument n, produces x such that x2 = n.” Anabstract description of a sorting function is “the function which, forargument sequence xs, produces the sequence ys which is a permutationof xs and in which y(i) < y(i+l)."

It is unreasonable to expect any interpreter/compiler to actuallyexecute these specifications systematically and efficiently. Thus, allprogramming, no matter how high level, fundamentally concernsalgorithms, which specify how to compute an answer. So, the adjective”declarative” is mainly an assessment of how much non-problem-specificdetail can be suppressed in a programming language.

At one extreme (least declarative) is architecture-specificassembler, where the programmer is intimately concerned with managingstorage layout, movement of data between registers and memory, andsequencing.

Most architecture-independent programming languages such as C,Pascal, and Fortran abstract away from some storage-layout details andprovide higher-level control structures such as if-then-elses, loops,and recursive procedures.

Some object oriented languages like Java provide even higher-levelstructures such as data abstraction and inheritance and further relievethe programmer from storage-management details by managing storagedeallocation automatically.

Declarative languages like functional and logic programminglanguages abstract away from storage layout and storage managementdetails completely and also abstract away from many sequencing details.Implicit Parallelism Due to Lack ofAnti-dependences
In functional languages, there is no such thing as a variable whosevalue can change during the computation. Rather, a variable behaveslike a mathematical variable, having a unique value that is computed(or “solved”) by the program.

Multiple definitions of a variable are simply not allowed and allcomputations are expressed in terms of computing new values from old.For example, given A, a vector of numbers, suppose we want to scale itby 5; that is, the result should be a vector whose i'th componentshould be 5A(i). In Fortran, we might say:

do i = 1,N        A(i) = 5 * A(i)
end do

and A itself contains the result. Further, the old value of A is nolonger available. In functional languages, such a concept does notexist; instead, we define a new vector based on the old. For example,in pH

b = array (1,n)        [(i, 5 * a!i) | i <- [1..n]]

(In Haskell, array indexing isdenoted by the infix “!” operator .) This defines b, a new vectorwith index range 1 to n, such that b(i) = 5a(i). Nothing has happenedto a-it still means exactly what it used to mean. For example, we couldnow produce a third vector c that is the vector-sum of a and b:

c = array (1,n)            [(i, a!i +b!i) | i <- [1..n]]

Matrix multiplication may be expressed in pH as follows:

matmult a b =     matrix ((1,1),(n,n))
               [((i,j), ip A i B j) | i <-[1..n], j <- [1..n]]

ip A i B j =     let        s = a!(i,l) * b!(l,j)
    in        for k <- [2..n] do
           next s = s + a!(i,k) * b!(k,j)        finally s

The first part defines the function matmult, which takes twoarguments-the matrices a and b. It produces a new n x n matrix whose(i,j)'th component is the inner product of the i'th row of a with thej'th column of b.

(pH permits us to extract sizeinformation from a matrix, but since this representation choice isorthogonal to whether a language is functional or not, and since wewant to keep the example simple, we assume that a and b are n x nmatrices. )

The second part defines the inner product function ip. The letstatement binds the value of a(i,1) x b(1,j) to a variable s. Althoughthe loop may superficially look like the corresponding Fortran loop, itis fundamentally different-it expresses the idea of n separate andconcurrent computations:

sl = a!(i,j) * b!(1,j) s2 = s1 + a!(i,2) * b!(2,j) s3 = s2 + a!(i,3) * b!(3,j)
sn = s(n-1) + a!(i,n) * b!(n,j)

The first computation is the one outside the loop (the let-binding),and sn, the final s, is the value of the entire loop. In other words,there are n different values of s; the next notation simply relateseach sk to its corresponding s(k-1); and the finally term justspecifies which s is returned as the result value of the whole loop(and, in turn, the function).

The statements of a loop are all executed concurrently, limited onlyby data dependencies (although the data dependencies in this particularloop do not admit much parallelism other than doing the multiplicationsin parallel). Thus, it is not the case that s is a variable that isbeing updated, as in the Fortran loop-each variable still has a uniquedefinition.

The pH program for matrix multiplication directly expresses thestructures in original figures at the start of this series of articles,illustrating the mathematical definition. In particular, it takes noposition on the order in which computations are done, beyond thenatural ordering of data dependencies.

Since variables never change, there cannot ever be anyantidependencies. It is for this reason that functional languages areconsidered to be “naturally parallel” languages.Implicit Producer-Consumer Parallelism
Most of the parallelism in matrix multiplication comes from dataparallelism-an independent inner product is computed at each of the n2components of the result matrix.

Another kind of parallelism that is very important isproducer-consumer parallelism, and this arises naturally in functionallanguages with nonstrict data structures. Suppose, instead of scalingvector a by 5, we have a more complicated mapping function f:

b = array (1,n)                [ (i, f (A!i)) | i <- [1.. n]]

The computation of all elements of b may be done in parallel, butthey may take widely differing times. In languages with strict datastructures (most existing programming languages), a consumer of b mustwait until all components have been defined (i.e., until the slowestcomponent computation has completed) because b is simply not availableto the consumer until it is fully defined.

In languages like pH with nonstrict data structures, on the otherhand, a consumer may be given a reference to b as soon as it isallocated and may attempt to read its components immediately.

These reads automatically wait as long as the components are emptyand automatically resume as they become filled. Thus, the consumer canoverlap some of its work with the computation of b's components.

For a more dramatic example of producer-consumer parallelism,suppose we have one computation that produces a list of sample readings(perhaps obtaining them from some measuring instrument) and anothercomputation that analyzes these readings looking for interestingevents. In pH,

producer instrument =
        if no-more-readings(instrument)
        then []
        else
            let
               x = get_reading(instrument)
            in
               x : produce (instrument)

analyzer [] = [] analyzer (x:xs) = ifis_interesting(x)
                          then x : analyzer(xs)                           else analyzer(xs)

The notation “x:xs” is pronounced “x cons xs” and represents a listwith x in front, followed by the remaining list, xs. The notation []represents an empty list. The producer function gets a reading x fromthe instrument and produces a list containing x followed by morereadings, until the instrument indicates that there are no morereadings.

The analyzer function handles two cases, and this is indicated bymatching the actual argument against one of the two patterns [] and(x:xs). In the former case, it returns an empty list of interestingreadings.

In the latter case, it checks if the first reading x is interesting.If so, it returns it, followed by the result of analyzing the remainingreadings xs. Otherwise, it discards x and simply returns the result ofanalyzing the remaining readings.

Now, the interesting producer-consumer parallelism occurs when weplug these functions together:

events = analyzer (producer(a_particular_instrument))

Suppose the instrument yielded a reading every ten seconds, thereare a hundred readings, and it takes five seconds to analyze eachreading. With conventional (“strict”) data structures, the producerfunction would, after 1000 seconds or so, return a list of 100elements.

At this point, the analyzer would begin its work and, about 500seconds later, eventually return a list of interesting readings.Overall, the computation takes 1500 seconds or more.

With nonstrict data structures, however, as soon as the instrumenthas yielded its first reading, the producer can return the firstelement of the list, leaving the “tail” of the list temporarily empty,to be filled in later when it gets the next reading.

Having returned this first reading, the analyzer can examine itimmediately. If it is interesting, the analyzer, in turn, canimmediately return the first element of its result list, leaving itstail empty, to be filled in later.

There is an implicit synchronization necessary here. In particular,the analyzer, having examined the first reading, may attempt to proceedto the next reading, which may not be available yet.

The semantics of nonstrict data structures specify that the analyzerautomatically blocks until the tail of its input list is no longerempty (i.e., has been filled in bythe producer ).

One can visualize this as a chain that is gradually produced by theproducer and incrementally consumed by the analyzer as it becomesavailable. The work of the analyzer and producer can thus beoverlapped, instead of being run seriatim. Thus, each 5-second activityof the analyzer can take place during the 10-second wait for the nextinstrument reading, and the overall computation can be completed inabout 1000 seconds.

Although we have illustrated producer-consumer parallelism with asimple, linear data structure (a list), the same principle applies toarbitrary data structures in pH and nonstrict functional languages.

For example, the producer may organize the instrument readings in asorted tree. The picture changes to one where the fringe, or frontier,of the tree grows incrementally as readings arrive. The analyzer, inparallel, pursues all branches up to the current fringe, consumingvalues at the fringe as it moves outward.

Similarly, the production of different elements of an array may takevaried times, and a consumer of the array can simultaneously pick upthose values that are ready at any given point in time.

Next in Part 6: So, whyaren't weall using functional languages yet?
To read Part 1, go to How sequential languages obscureparallelism.
To read Part 2, go to How to achieve parallel execution?
To read Part 3, go to Explicit Parallelism: Shared-MemoryProgramming with Threads and Locks
To read Part 4, go to Explicit Parallelism:Message-Passing programming

RishiyurS. Nikhil and Arvind are the coauthors of ImplicitParallel Programming in pH, published by Elsevier/MorganKaufmann.

Rishiyur is Chief TechnicalOfficer at ESL specialist Bluespec Inc . Previous to itsacquisition by Broadcom he led the Bluespec technology team atSandburst Corp. For the previous 9 years he was at Cambridge ResearchLaboratory (DEC/Compaq), including over a year as Acting Director.Earlier, he was an Associate Professor of Computer Science andEngineering at MIT. He holds several patents in functional programming,dataflow and multithreaded architectures, parallel processing andcompiling.

Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing Term Rewriting Systems (TRSs) ,encompassing hardware synthesis as well as verification ofimplementations against TRS specifications. Previously, he contributedto the development of dynamic dataflow architectures, the implicitlyparallel programming languages Id and pH, and compilation of theselanguages on parallel machines. He was also president and founder ofSandburst Corp.

Toread more on this topic on Embedded.com go to Moreabout multicores, multithreading and multiprocessing.

References:
[1]Microsoft:Parallel Programming Model Ten Years off,” by Rick Merritt,Embedded.com, July 23, 2007
[2]The Problem with Threads” byEdward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3]Whythreads are a bad idea (for most purposes),” J.K. Ousterhout,Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. “Propertiesof a First-order Functional Language with Sharing.” TheoreticalComputer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. “The Development ofErlang.” In Proc. Intl. Conf. on Functional Programming (ICFP),Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.” Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The JavaProgramming Language, second edition. Addison-Wesley, 1998. ISBN0-201-31006-6.
[9] Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executinga Program on the MIT Tagged-Token Dataflow Architecture. IEEETrans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded RuntimeSystem. In Proc. 5th. ACM Symp. on Principles and Practiceof Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,July 19-21 1995.
[13] D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
[14] A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVMParallel Virtual Machine. A User's Guide and Tutorial for NetworkParallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High PerformanceFortran Forum. High Performance Fortran Language Specification,Version 1.0. Technical Report CRPC-TR92225, Center for Research onParallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:an Operating System Structuring Concept. Communications of the ACM,17(10):549-557, October 1974.
[18] R. Hughes. WhyFunctional Programming Matters. The Computer Journal, 32(2):98-107,1989.
[19] Threadsstandard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. “Cid: A Parallel'Shared-memory' C for Distributed Memory Machines.”  In Proc.7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:a Language with Implicit Parallelism. In A Comparative Study ofParallel Programming Languages: The Salishan Problems, John Feo(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab. for Computer Science, Cambridge MA 02139,USA, July 1986.
[24] OpenMP Architecture ReviewBoard. OpenMPSpecifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones.The Implementation of FunctionalProgramming Languages . Prentice Hall, 1987.
[26] S. L. Peyton Jones(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, andP. Wadler. Haskell98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C.Diniz. CommutativityAnalysis: A New Analysis Framework for Parallelizing Compilers. ACMTrans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,November 1997.
[28] M. C. Rinard, D. J.Scales, and M. S. Lam. Jade: AHigh-Level, MachineIndependent Language for Parallel Programming.IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth.Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.