Making the transition from sequential to implicit parallel programming: Part 7 - Embedded.com

Making the transition from sequential to implicit parallel programming: Part 7

Parallel Haskell (pH) is based on over two decades of research withseveral versions of the functional language Id [10][23] and experiencewith their implementation on dataflow and conventional architectures[11] [49] as well asalternatives such as Multithreaded Idusing P-RISCGraphs and the MIT Tagged-TokenDataflow Architecture.

Some years ago, we recognized the semantic similarity between thefunctional core of Id and the standard nonstrict functional languageHaskell. We, therefore, decided to adopt Haskell for the functionalcore, extending it as necessary with I-structure and M-structureconstructs as shown in Figure 1.11below.

The Haskell basis should make pH familiar to amuch larger audience and should make exploitation of parallelism aneasy step for programmers already familiar with Haskell.

Figure1.11: Layers of pH.

The Functional Layer
A large subset of pH is its functional core, whose syntax and nonstrictsemantics are exactly those of the pure functional language Haskell.The major programming style in pH is functional, and because there isno notion of state, it is trivial to write parallel programs withoutraces in the functional core. In fact, it is impossible to write aprogram with races; functional programs are guaranteed to bedeterministic.

A central feature of the functional core is the higher-orderfunction. The key idea is that functions themselves can be treated asvalues. Higher-order functions increase the parametric power of alanguage and, consequently, allow us to write shorter, more abstractprograms.

For example, a function to sort a list can take, as one of itsarguments, the function used to decide the relative order of any pairof elements in the list. Then, the same sorting function can be used tosort a list of integers both in ascending and descending orders, simplyby passing in the “<" or ">” functions, respectively.

Or, the same sorting function can be used to sort a list of stringsin lexicographic order, by passing in the appropriate string comparisonfunction. The experienced functional programmer uses higher orderfunctions like oxygen – they are an essential part of one's toolkit.

We have already mentioned the notion of nonstrictness, anothercentral feature of pH. Functions can be executed before arguments areavailable (unless restricted by datadependencies ) and can even returnresults before the body has completed computing.

Similarly, any component of a data structure may be empty for anindefinite period, allowing producers and consumers to computeconcurrently. We motivated this idea from the point of view ofproducer-consumer parallelism, but this feature also fundamentally addsexpressive power to the language, allowing us to write certainrecursive definitions of data structures that would otherwise beawkward to express.

Studies indicate that the practical advantages of nonstrictness(producer -consumer parallelism,definition of cyclic data structures )may accrue even with strict functions, as long as data structures arenonstrict. This could have significant performance advantages-not onlycan strict function calls be implemented more efficiently, but theyalso permit better compiler analysis of function bodies, resulting inbetter code therein. This is still an open area of research and is apossible future for pH.

pH has types and static type checking. Pervading the entire languageis the notion of types and static type checking, although this isorthogonal to parallelism. Every phrase in pH (variable, expression,function, and so on ) must have a well-defined data type, andthisproperty is checked statically (i.e.,by the compiler ) before runningthe program.

Static types facilitate the construction of robust programs becausemany simple programming errors are caught at compile time, and becausethey give a convenient framework in which to organize the program.

Static types have sometimes earned a bad reputation for increasingthe “bureaucracy” that the programmer has to deal with; however, thereare three aspects of pH's type system that make it easy to use comparedto, say, the static type system of Pascal or C.

The first is polymorphism: for example, we can write a singlefunction that computes the length of a list, no matter what the listcontains (integers, strings, … ),whereas in Pascal we would have towrite several (nearly identical) functions for this purpose, one forinteger lists, one for string lists, and so on.

The second convenient feature is type inference: the programmer isrelieved of the tedium of having to predeclare the type of everyvariable used in the program. Instead, the compiler infers these typesautomatically and, in fact, it is perfectly safe for the programmer toomit all type declarations (althoughit is highly recommended that theprogrammer declare the types of at least the top-level functions ).

The third convenience is overloading whereby, for example, we canwrite a single function to multiply two integer or floating-pointmatrices, whereas in Pascal we would have to write two (nearlyidentical) functions for this purpose. Overall, the static type systemis invaluable in structuring the way we think about and designprograms.
The I-structure Layer
As noted in Figure 1.11 previously ,the second layer of pH adds”I-Structures” to the functional core. I-Structures are a limited butbenign form of state-benign, because it is still impossible to write aprogram with a race condition and programs are still guaranteed to bedeterminate. Data structures in pH can contain I-structure components.

What this means is that when the data structure is allocated, thecomponent is initially left empty. Later, a computation can assign avalue into that component, making it full. Any other computation thattries to read that component automatically blocks until the value isavailable.

We have already mentioned this type of synchronization in thecontext of producer-consumer parallelism in nonstrict functional datastructures. The difference here is this: in functional data structures,the allocation of the data structure and the values that fill itscomponents are always specified syntactically together, as a singleconstruct.

For this reason, in functional data structures we never even thinkof the action of filling in a component as an assignment-that issomething inside the implementation. In I-structures, on the otherhand, the computation that fills in an empty component may be in somesyntactically unrelated part of the program and is thus alwaysexpressed explicitly as an assignment.

One common use of I-structures is to implement pure functionsefficiently. For example, most functional languages provide a map_arrayprimitive that transforms one array into another by applying a givenfunction to each element and a foldl_array primitive that computes someoverall property of an array, such as the sum of all elements, or themaximum element.

Now, suppose we want to both transform an array as well as computean overall property. This map_foldl operation is a perfectlywell-defined, pure function, and it is of course possible to define itby simply applying both the map and the fold primitives; however, thisinvolves two array traversals. With I-structures, it is possible forthe programmer to implement this function directly as a new primitive,using a single array traversal.

I-structures retain determinacy in pH precisely because of their”write-once” semantics: an I-structure component makes a singletransition from empty to full with a value, and no computation canobserve this transition because any attempt to read an empty componentwill simply block the computation until the component is full and thenreturn the value.

In other words, there is only one possible value that can beobtained by reading an I-structure component, and so there is no chanceof nondeterminism.

I-structures are closely related to logic variables in logicprogramming languages. In fact, I-structures may be used in the sameway that logic variables are often used-to implement difference lists,mailboxes, and so on.

The M-structure Layer
The final level of pH adds “M-structures” to the functional andI-structure layers. M-structures allow reuse of storage and, thus,arbitrary manipulation of state.

However, the main difference from assignment statements inimperative languages is that access to M-structure data is implicitlysynchronized. This has two advantages.

First, it permits more economical expression of shared statemanipulation and, second, it can usually be implemented moreefficiently. For example, consider many parallel computations that areall updating components of a shared array h. With explicitsynchronization (e.g., with threadsand locks ), each update would be expressed as follows:

wait(locks[j]); H[j] = f(H[j]); signal(locks[j]);

In other words, we wait for exclusive access to the j'th component,using a companion array of locks. After performing the update, werelease the lock to allow other similar parallel computations toproceed. In pH, we express it as follows:

h!j := f (h !& j)

where the h!j notation means “wait until Full, read, and leaveEmpty,” and the assigment means “write, and make Full.”

In addition to simplifying the notation (we don't have to declare anarray of locks and explicitly perform waits and signals), theimplementation can also be more efficient because of the combinedsynchronization and data access. For example, in the conventionalimperative language code, each of the operations-wait, update h, andsignal-is likely to be a separate transaction with memory.

In the M-structure version, the wait and read h can be combined intoa single transaction, as can the write h and signal, resulting in twotransactions, the minimum possible.

In the functional and I-structure layers, there is no sequencing ofoperations other than that which is naturally implied by datadependencies. With M-structures, however, sequencing may be necessaryto control nondeterminism.

For example, if we want to ensure that a bank deposit computationcompletes before we perform a withdrawal computation on a sharedM-structure, data dependencies are no longer adequate; for this, pHprovides ways to explicitly sequence computations.

As discussed earlier in Part 6, several languages such as ML and Scheme also have purelyfunctional cores augmented with operations to manipulate state.However, in those languages, state manipulation comes at the expense ofparallelism; unwanted race conditions are made into a nonissue bymaking the entire language strict and totally sequential.

Of course, many researchers have reintroduced limited,user-specified parallelism by extending these languages withtraditional parallel constructs such as threads, locks andmessage-passing.

Also, as discussed earlier, a recent development in nonstrictfunctional languages such as Haskell is a technique called monads, which is a way toencapsulate state manipulation. While this is done without making theentire language sequential, the state manipulation operations are stillcompletely sequenced, so that one cannot have a parallel,state-manipulating program.

In pH, on the other hand, despite the addition of M-structures, theoverall programming style remains implicitly parallel unless otherwisespecified. Unlike other parallel languages which are implicitlysequential unless otherwise specified, the difference is qualitative,but noticeable – implicit parallelism makes it significantly easier towrite massively parallel programs.

The parallelism in pH is pervasive. Not only does it occur at a veryfine grain (e.g., reading the head and tail fields of a list recordproceed in parallel), but also at a larger grain (e.g., two functionsor loop iterations can execute in parallel). Both of these areimportant for parallel computer architectures. The large-grainparallelism allows easy distribution of work across the multipleprocessors of a parallel machine.

The fine-grain parallelism allows each processor to exploitinstruction level parallelism (i.e., the ability to executeinstructions of a thread in parallel) and multithreading (i.e., theability to execute another fine grain thread while one of its finegrain threads has to wait, perhaps to access a datum on a differentprocessor).

We know of no other programming language that exposes as muchparallelism as effortlessly as pH.The effect of architecture onprogramming
While this series is mainly about parallel programming, it is importantto recognize that computer architecture can have a major effect on ourprogramming languages and on our programming methodology.

For example, early computers (andeven some as recent as the Intel 8086 ), had small physicaladdress spaces and no architectural support for virtual memory. Onthese machines, it was still possible to write programs thatmanipulated huge data structures that would not fit into memory atonce.

In effect, the programmer could mimic a virtual memory system bymanually moving pieces of the array between disk and main memory. Thiskind of programming was the domain of experts who understood thearchitecture thoroughly.

Eventually, tools came into existence to automate some of thiseffort for large-memory programming, using a method called “overlays”.However, these tools were never even close to being able automaticallyto translate an arbitrary program that assumed large memory into codethat could work in limited memory.

In other words, the architectural limitation simply could not behidden from programmers. Ultimately, it required architectural supportfor virtual memory before the situation changed. Finally, everydayprogrammers could write programs that were not limited by theprocessor's physical address space.

A similar situation exists with parallel processing on largesystems, servers, desktops, network switches and many modern mobile andportable consumer devices. A universal requirement of parallelarchitectures “large, small and smaller – is that, in order to bephysically scalable, they must have distributed memories -the machine'smemory must be divided into several independent modules, and there mustbe many independent paths from processors to memory modules.

The first generation of commercially viable large parallel computersachieved this by simply interconnecting large numbers of conventionalsequential computers; each node of the machine consists of aconventional processor with memory and a means to communicate withother nodes.

These multicomputers can be programmed with message-passing, whichexactly mirrors the architectural organization. But we have seen howdifficult it is to program using the message-passing model; so, whileextremely useful and effective programs have been (and continue to be) written thisway, it remains the domain of experts.

Just as yesterday's large memory programming was only feasible forthose who understood thoroughly how to manage movement of data, betweenmemory and disk, today's parallel programming with message-passing isonly feasible for those who understand thoroughly how to managemovement of data between processing nodes of a multicomputer.

HPF (High Performance Fortran) can be seen as an attempt to automate some of this. But, just as withyesterday's tools for overlays in small-memory machines, today's HPFprogrammer has to be sophisticated about data distributions, and in anycase these tools can't handle arbitrary, general-purpose programs.

It is our belief that shared-memory parallel programming today isthe analog of virtual-memory programs of yesterday-it is the only modelfeasible for general-purpose programs written by ordinary programmers(and hence for widespread use).

Further, it will not become viable without architectural support.Even though the virtual memory illusion is mostly implemented inoperating system software, it required some architectural supportbefore it became efficient enough that the average programmer no longerhad to think about it (translation lookaside buffers, page faults).

Similarly, even though the shared-memory illusion may be implementedmostly in run-time system software, some architectural support isneeded to make it efficient enough that the average parallel programmerno longer needs to think about it.

(This is not to imply that thatmessage-passing must be completely hidden from the programmer carefullyhand-crafted message-passing algorithms are still likely to beadvantageous; we simply mean that, by and large, the programmer wouldprefer a shared-memory programming model ).

To date, there is no consensus on exactly what architectural supportis necessary to support the shared-memory illusion on a distributedmemory parallel machine (even though there are some commercial productsthat choose particular approaches).

The question is further complicated because the requiredarchitectural support depends on what parallel programming models onewishes to support, and there is no consensus on that, either (thesituation was easier with virtual memory-at least the user-levelprogramming model was not an issue). Architectural support fordistributed shared memory is a hot topic of research in many academic,industrial and government groups all over the world.Parallel programming for non-parallelprogrammers
We believe that in the long run, for general-purpose parallelprogramming (i.e., arbitrary programswritten by ordinary programmers ), the language must make iteffortless to express huge amounts of parallelism. The reasons aretwofold:

1) No matter whatarchitectural support there may be for shared-memory, movement of dataacross the machine will always be slow, compared to instructionexecution times and local memory access times. Abundant parallelismpermits the processor to perform useful work on one thread whileanother may be waiting for data to be moved.

This is analogous to the situation in operating systems today: whenone process encounters a page fault, the processor switches to anotherprocess while the first one waits for the page to arrive from disk.

2) In arbitrary parallelprograms, synchronization waits (waitingto acquire a lock ) are as unpredictable as nonlocal memoryaccesses; again, we need an adequate number of threads so that theprocessor can continue useful work while some are waiting forsynchronization events.

Since manual synchronization in the face of massive parallelism istoo tedious to be feasible (exceptfor programs with extremely simple, rigid; structures ), this,too, is something that the programmer should not have to manageexplicitly.

The bottom line is that for general-purpose, scalable, parallelcomputing, architectures need to support fine-grain threads withfine-grain synchronization, and programming languages andimplementations need to exploit this capability.

Although these architectural features will benefit all parallelprogramming languages, existing languages do not make it easy toexpress or to exploit fine-grain parallelism. Declarative programminglanguages such as pH appear to be most promising approach towards thisgoal.

Next in Part 8: Turning pHinto aproduction programming language
To read Part 1, go to How sequential languages obscureparallelism.
To read Part 2, go to How to achieve parallel execution?
To read Part 3, go to Explicit Parallelism: Shared-MemoryProgramming with Threads and Locks
To read Part 4, go to Explicit Parallelism:Message-Passing programming
To read Part 5, go to Implicitparallelism: Declarative programming languages
To read Part 6,go to “So, why aren't we usingfunctionallanguages yet?

RishiyurS. Nikhil and Arvind are the coauthors of ImplicitParallel Programming in pH, published by Elsevier/MorganKaufmann.

Rishiyur is Chief TechnicalOfficer at ESL specialist Bluespec Inc . Previous to itsacquisition by Broadcom he led the Bluespec technology team atSandburst Corp. For the previous 9 years he was at Cambridge ResearchLaboratory (DEC/Compaq), including over a year as Acting Director.Earlier, he was an Associate Professor of Computer Science andEngineering at MIT. He holds several patents in functional programming,dataflow and multithreaded architectures, parallel processing andcompiling.

Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing Term Rewriting Systems (TRSs) ,encompassing hardware synthesis as well as verification ofimplementations against TRS specifications. Previously, he contributedto the development of dynamic dataflow architectures, the implicitlyparallel programming languages Id and pH, and compilation of theselanguages on parallel machines. He was also president and founder ofSandburst Corp.

Toread more on this topic on Embedded.com go to Moreabout multicores, multithreading and multiprocessing.

References:
[1]Microsoft:Parallel Programming Model Ten Years off,” by Rick Merritt,Embedded.com, July 23, 2007
[2]The Problem with Threads” byEdward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3]Whythreads are a bad idea (for most purposes),” J.K. Ousterhout,Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. “Propertiesof a First-order Functional Language with Sharing.” TheoreticalComputer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. “The Development ofErlang.” In Proc. Intl. Conf. on Functional Programming (ICFP),Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.” Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The JavaProgramming Language, second edition. Addison-Wesley, 1998. ISBN0-201-31006-6.
[9] Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executinga Program on the MIT Tagged-Token Dataflow Architecture. IEEETrans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded RuntimeSystem. In Proc. 5th. ACM Symp. on Principles and Practiceof Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,July 19-21 1995.
[13] D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
[14] A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVMParallel Virtual Machine. A User's Guide and Tutorial for NetworkParallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High PerformanceFortran Forum. High Performance Fortran Language Specification,Version 1.0. Technical Report CRPC-TR92225, Center for Research onParallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:an Operating System Structuring Concept. Communications of the ACM,17(10):549-557, October 1974.
[18] R. Hughes. WhyFunctional Programming Matters. The Computer Journal, 32(2):98-107,1989.
[19] Threadsstandard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. “Cid: A Parallel'Shared-memory' C for Distributed Memory Machines.”  In Proc.7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:a Language with Implicit Parallelism. In A Comparative Study ofParallel Programming Languages: The Salishan Problems, John Feo(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab. for Computer Science, Cambridge MA 02139,USA, July 1986.
[24] OpenMP Architecture ReviewBoard. OpenMPSpecifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones.The Implementation of FunctionalProgramming Languages . Prentice Hall, 1987.
[26] S. L. Peyton Jones(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, andP. Wadler. Haskell98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C.Diniz. CommutativityAnalysis: A New Analysis Framework for Parallelizing Compilers. ACMTrans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,November 1997.
[28] M. C. Rinard, D. J.Scales, and M. S. Lam. Jade: AHigh-Level, MachineIndependent Language for Parallel Programming.IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth.Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.

Reader Response


Very interesting set of articles. There is no denying the power and elegance of functional languages. However, my main concern regarding their use in parallel programming is that they encourage only a coarse-grain, thread-based approach to parallelism at a time when the computer industry should be focusing its research muscle into developing fine-grain, auto-scalable multicore CPUs using an MIMD execution model. My second objection is that the very thing that FP proponents are railing against (state changes) will turn out to be their Achilles' heel with regard to safety-critical applications. The only way to implement a mechanism for the automatic discovery and resolution of data dependencies (a must for reliability) is to adopt a purely reactive software model that allows the use of state variables. This is especially important when maintaining unfamiliar legacy systems where data dependencies are not obvious. One man's opinion, of course.

-Louis Savain
Software Engineer
Miami, FL


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.