Making the transition from sequential to implicit parallel programming: Part 7
pH: An implicitly parallel, declarative language
By Rishiyur S. Nikhil (Bluespec) and Arvind (MIT)
Embedded.com
(09/24/07, 02:30:00 PM EDT)
Parallel Haskell (pH) is based on over two decades of research with several versions of the functional language Id [10][23] and experience with their implementation on dataflow and conventional architectures [11] [49] as well as alternatives such as Multithreaded Id using P-RISC Graphs and the MIT Tagged-Token Dataflow Architecture.

Some years ago, we recognized the semantic similarity between the functional core of Id and the standard nonstrict functional language Haskell. We, therefore, decided to adopt Haskell for the functional core, extending it as necessary with I-structure and M-structure constructs as shown in Figure 1.11 below.

The Haskell basis should make pH familiar to a much larger audience and should make exploitation of parallelism an easy step for programmers already familiar with Haskell.

Figure 1.11: Layers of pH.

The Functional Layer
A large subset of pH is its functional core, whose syntax and nonstrict semantics are exactly those of the pure functional language Haskell. The major programming style in pH is functional, and because there is no notion of state, it is trivial to write parallel programs without races in the functional core. In fact, it is impossible to write a program with races; functional programs are guaranteed to be deterministic.

A central feature of the functional core is the higher-order function. The key idea is that functions themselves can be treated as values. Higher-order functions increase the parametric power of a language and, consequently, allow us to write shorter, more abstract programs.

For example, a function to sort a list can take, as one of its arguments, the function used to decide the relative order of any pair of elements in the list. Then, the same sorting function can be used to sort a list of integers both in ascending and descending orders, simply by passing in the "<" or ">" functions, respectively.

Or, the same sorting function can be used to sort a list of strings in lexicographic order, by passing in the appropriate string comparison function. The experienced functional programmer uses higher order functions like oxygen - they are an essential part of one's toolkit.

We have already mentioned the notion of nonstrictness, another central feature of pH. Functions can be executed before arguments are available (unless restricted by data dependencies) and can even return results before the body has completed computing.

Similarly, any component of a data structure may be empty for an indefinite period, allowing producers and consumers to compute concurrently. We motivated this idea from the point of view of producer-consumer parallelism, but this feature also fundamentally adds expressive power to the language, allowing us to write certain recursive definitions of data structures that would otherwise be awkward to express.

Studies indicate that the practical advantages of nonstrictness (producer -consumer parallelism, definition of cyclic data structures) may accrue even with strict functions, as long as data structures are nonstrict. This could have significant performance advantages-not only can strict function calls be implemented more efficiently, but they also permit better compiler analysis of function bodies, resulting in better code therein. This is still an open area of research and is a possible future for pH.

pH has types and static type checking. Pervading the entire language is the notion of types and static type checking, although this is orthogonal to parallelism. Every phrase in pH (variable, expression, function, and so on) must have a well-defined data type, and this property is checked statically (i.e., by the compiler) before running the program.

Static types facilitate the construction of robust programs because many simple programming errors are caught at compile time, and because they give a convenient framework in which to organize the program.

Static types have sometimes earned a bad reputation for increasing the "bureaucracy" that the programmer has to deal with; however, there are three aspects of pH's type system that make it easy to use compared to, say, the static type system of Pascal or C.

The first is polymorphism: for example, we can write a single function that computes the length of a list, no matter what the list contains (integers, strings, ...), whereas in Pascal we would have to write several (nearly identical) functions for this purpose, one for integer lists, one for string lists, and so on.

The second convenient feature is type inference: the programmer is relieved of the tedium of having to predeclare the type of every variable used in the program. Instead, the compiler infers these types automatically and, in fact, it is perfectly safe for the programmer to omit all type declarations (although it is highly recommended that the programmer declare the types of at least the top-level functions).

The third convenience is overloading whereby, for example, we can write a single function to multiply two integer or floating-point matrices, whereas in Pascal we would have to write two (nearly identical) functions for this purpose. Overall, the static type system is invaluable in structuring the way we think about and design programs.

The I-structure Layer
As noted in Figure 1.11 previously, the second layer of pH adds "I-Structures" to the functional core. I-Structures are a limited but benign form of state-benign, because it is still impossible to write a program with a race condition and programs are still guaranteed to be determinate. Data structures in pH can contain I-structure components.

What this means is that when the data structure is allocated, the component is initially left empty. Later, a computation can assign a value into that component, making it full. Any other computation that tries to read that component automatically blocks until the value is available.

We have already mentioned this type of synchronization in the context of producer-consumer parallelism in nonstrict functional data structures. The difference here is this: in functional data structures, the allocation of the data structure and the values that fill its components are always specified syntactically together, as a single construct.

For this reason, in functional data structures we never even think of the action of filling in a component as an assignment-that is something inside the implementation. In I-structures, on the other hand, the computation that fills in an empty component may be in some syntactically unrelated part of the program and is thus always expressed explicitly as an assignment.

One common use of I-structures is to implement pure functions efficiently. For example, most functional languages provide a map_array primitive that transforms one array into another by applying a given function to each element and a foldl_array primitive that computes some overall property of an array, such as the sum of all elements, or the maximum element.

Now, suppose we want to both transform an array as well as compute an overall property. This map_foldl operation is a perfectly well-defined, pure function, and it is of course possible to define it by simply applying both the map and the fold primitives; however, this involves two array traversals. With I-structures, it is possible for the programmer to implement this function directly as a new primitive, using a single array traversal.

I-structures retain determinacy in pH precisely because of their "write-once" semantics: an I-structure component makes a single transition from empty to full with a value, and no computation can observe this transition because any attempt to read an empty component will simply block the computation until the component is full and then return the value.

In other words, there is only one possible value that can be obtained by reading an I-structure component, and so there is no chance of nondeterminism.

I-structures are closely related to logic variables in logic programming languages. In fact, I-structures may be used in the same way that logic variables are often used-to implement difference lists, mailboxes, and so on.

The M-structure Layer
The final level of pH adds "M-structures" to the functional and I-structure layers. M-structures allow reuse of storage and, thus, arbitrary manipulation of state.

However, the main difference from assignment statements in imperative languages is that access to M-structure data is implicitly synchronized. This has two advantages.

First, it permits more economical expression of shared state manipulation and, second, it can usually be implemented more efficiently. For example, consider many parallel computations that are all updating components of a shared array h. With explicit synchronization (e.g., with threads and locks), each update would be expressed as follows:

wait(locks[j]);
H[j] = f(H[j]);
signal(locks[j]);

In other words, we wait for exclusive access to the j'th component, using a companion array of locks. After performing the update, we release the lock to allow other similar parallel computations to proceed. In pH, we express it as follows:

h!j := f (h !& j)

where the h!j notation means "wait until Full, read, and leave Empty," and the assigment means "write, and make Full."

In addition to simplifying the notation (we don't have to declare an array of locks and explicitly perform waits and signals), the implementation can also be more efficient because of the combined synchronization and data access. For example, in the conventional imperative language code, each of the operations-wait, update h, and signal-is likely to be a separate transaction with memory.

In the M-structure version, the wait and read h can be combined into a single transaction, as can the write h and signal, resulting in two transactions, the minimum possible.

In the functional and I-structure layers, there is no sequencing of operations other than that which is naturally implied by data dependencies. With M-structures, however, sequencing may be necessary to control nondeterminism.

For example, if we want to ensure that a bank deposit computation completes before we perform a withdrawal computation on a shared M-structure, data dependencies are no longer adequate; for this, pH provides ways to explicitly sequence computations.

As discussed earlier in Part 6, several languages such as ML and Scheme also have purely functional cores augmented with operations to manipulate state. However, in those languages, state manipulation comes at the expense of parallelism; unwanted race conditions are made into a nonissue by making the entire language strict and totally sequential.

Of course, many researchers have reintroduced limited, user-specified parallelism by extending these languages with traditional parallel constructs such as threads, locks and message-passing.

Also, as discussed earlier, a recent development in nonstrict functional languages such as Haskell is a technique called monads, which is a way to encapsulate state manipulation. While this is done without making the entire language sequential, the state manipulation operations are still completely sequenced, so that one cannot have a parallel, state-manipulating program.

In pH, on the other hand, despite the addition of M-structures, the overall programming style remains implicitly parallel unless otherwise specified. Unlike other parallel languages which are implicitly sequential unless otherwise specified, the difference is qualitative, but noticeable - implicit parallelism makes it significantly easier to write massively parallel programs.

The parallelism in pH is pervasive. Not only does it occur at a very fine grain (e.g., reading the head and tail fields of a list record proceed in parallel), but also at a larger grain (e.g., two functions or loop iterations can execute in parallel). Both of these are important for parallel computer architectures. The large-grain parallelism allows easy distribution of work across the multiple processors of a parallel machine.

The fine-grain parallelism allows each processor to exploit instruction level parallelism (i.e., the ability to execute instructions of a thread in parallel) and multithreading (i.e., the ability to execute another fine grain thread while one of its fine grain threads has to wait, perhaps to access a datum on a different processor).

We know of no other programming language that exposes as much parallelism as effortlessly as pH.

The effect of architecture on programming
While this series is mainly about parallel programming, it is important to recognize that computer architecture can have a major effect on our programming languages and on our programming methodology.

For example, early computers (and even some as recent as the Intel 8086), had small physical address spaces and no architectural support for virtual memory. On these machines, it was still possible to write programs that manipulated huge data structures that would not fit into memory at once.

In effect, the programmer could mimic a virtual memory system by manually moving pieces of the array between disk and main memory. This kind of programming was the domain of experts who understood the architecture thoroughly.

Eventually, tools came into existence to automate some of this effort for large-memory programming, using a method called "overlays". However, these tools were never even close to being able automatically to translate an arbitrary program that assumed large memory into code that could work in limited memory.

In other words, the architectural limitation simply could not be hidden from programmers. Ultimately, it required architectural support for virtual memory before the situation changed. Finally, everyday programmers could write programs that were not limited by the processor's physical address space.

A similar situation exists with parallel processing on large systems, servers, desktops, network switches and many modern mobile and portable consumer devices. A universal requirement of parallel architectures "large, small and smaller - is that, in order to be physically scalable, they must have distributed memories -the machine's memory must be divided into several independent modules, and there must be many independent paths from processors to memory modules.

The first generation of commercially viable large parallel computers achieved this by simply interconnecting large numbers of conventional sequential computers; each node of the machine consists of a conventional processor with memory and a means to communicate with other nodes.

These multicomputers can be programmed with message-passing, which exactly mirrors the architectural organization. But we have seen how difficult it is to program using the message-passing model; so, while extremely useful and effective programs have been (and continue to be) written this way, it remains the domain of experts.

Just as yesterday's large memory programming was only feasible for those who understood thoroughly how to manage movement of data, between memory and disk, today's parallel programming with message-passing is only feasible for those who understand thoroughly how to manage movement of data between processing nodes of a multicomputer.

HPF (High Performance Fortran) can be seen as an attempt to automate some of this. But, just as with yesterday's tools for overlays in small-memory machines, today's HPF programmer has to be sophisticated about data distributions, and in any case these tools can't handle arbitrary, general-purpose programs.

It is our belief that shared-memory parallel programming today is the analog of virtual-memory programs of yesterday-it is the only model feasible for general-purpose programs written by ordinary programmers (and hence for widespread use).

Further, it will not become viable without architectural support. Even though the virtual memory illusion is mostly implemented in operating system software, it required some architectural support before it became efficient enough that the average programmer no longer had to think about it (translation lookaside buffers, page faults).

Similarly, even though the shared-memory illusion may be implemented mostly in run-time system software, some architectural support is needed to make it efficient enough that the average parallel programmer no longer needs to think about it.

(This is not to imply that that message-passing must be completely hidden from the programmer carefully hand-crafted message-passing algorithms are still likely to be advantageous; we simply mean that, by and large, the programmer would prefer a shared-memory programming model).

To date, there is no consensus on exactly what architectural support is necessary to support the shared-memory illusion on a distributed memory parallel machine (even though there are some commercial products that choose particular approaches).

The question is further complicated because the required architectural support depends on what parallel programming models one wishes to support, and there is no consensus on that, either (the situation was easier with virtual memory-at least the user-level programming model was not an issue). Architectural support for distributed shared memory is a hot topic of research in many academic, industrial and government groups all over the world.

Parallel programming for non-parallel programmers
We believe that in the long run, for general-purpose parallel programming (i.e., arbitrary programs written by ordinary programmers), the language must make it effortless to express huge amounts of parallelism. The reasons are twofold:

1) No matter what architectural support there may be for shared-memory, movement of data across the machine will always be slow, compared to instruction execution times and local memory access times. Abundant parallelism permits the processor to perform useful work on one thread while another may be waiting for data to be moved.

This is analogous to the situation in operating systems today: when one process encounters a page fault, the processor switches to another process while the first one waits for the page to arrive from disk.

2) In arbitrary parallel programs, synchronization waits (waiting to acquire a lock) are as unpredictable as nonlocal memory accesses; again, we need an adequate number of threads so that the processor can continue useful work while some are waiting for synchronization events.

Since manual synchronization in the face of massive parallelism is too tedious to be feasible (except for programs with extremely simple, rigid; structures), this, too, is something that the programmer should not have to manage explicitly.

The bottom line is that for general-purpose, scalable, parallel computing, architectures need to support fine-grain threads with fine-grain synchronization, and programming languages and implementations need to exploit this capability.

Although these architectural features will benefit all parallel programming languages, existing languages do not make it easy to express or to exploit fine-grain parallelism. Declarative programming languages such as pH appear to be most promising approach towards this goal.

Next in Part 8: Turning pH into a production programming language
To read Part 1, go to How sequential languages obscure parallelism.
To read Part 2, go to How to achieve parallel execution?
To read Part 3, go to Explicit Parallelism: Shared-Memory Programming with Threads and Locks
To read Part 4, go to Explicit Parallelism: Message-Passing programming
To read Part 5, go to Implicit parallelism: Declarative programming languages
To read Part 6, go to "So, why aren't we using functional languages yet?"

Rishiyur S. Nikhil and Arvind are the coauthors of Implicit Parallel Programming in pH, published by Elsevier/Morgan Kaufmann.

Rishiyur is Chief Technical Officer at ESL specialist Bluespec Inc. Previous to its acquisition by Broadcom he led the Bluespec technology team at Sandburst Corp. For the previous 9 years he was at Cambridge Research Laboratory (DEC/Compaq), including over a year as Acting Director. Earlier, he was an Associate Professor of Computer Science and Engineering at MIT. He holds several patents in functional programming, dataflow and multithreaded architectures, parallel processing and compiling.

Arvind is the Johnson Professor of Computer Science and Engineering at the Massachusetts Institute of Technology. His work laid the foundations for Bluespec, centering on high-level specification and description of architectures and protocols using Term Rewriting Systems (TRSs), encompassing hardware synthesis as well as verification of implementations against TRS specifications. Previously, he contributed to the development of dynamic dataflow architectures, the implicitly parallel programming languages Id and pH, and compilation of these languages on parallel machines. He was also president and founder of Sandburst Corp.

To read more on this topic on Embedded.com go to More about multicores, multithreading and multiprocessing.

References:
[1] "Microsoft: Parallel Programming Model Ten Years off," by Rick Merritt, Embedded.com, July 23, 2007
[2] "The Problem with Threads" by Edward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3] "Why threads are a bad idea (for most purposes)," J.K. Ousterhout, Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, and D. Gelernter. "Linda and Friends," IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. "Properties of a First-order Functional Language with Sharing." Theoretical Computer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. "The Development of Erlang." In Proc. Intl. Conf. on Functional Programming (ICFP), Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding, and M. Williams. "Concurrent Programming in Erlang." Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The Java Programming Language, second edition. Addison-Wesley, 1998. ISBN 0-201-31006-6.
[9] Arvind, A. Caro, J.-W. Maessen, and S. Aditya. A Multithreaded Substrate and Compilation Model for the Implicitly Parallel Language pH. In Proc. Wkshp. on Languages and Compilers for Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow, and W. Plouffe. The (preliminary) Id Report. Technical Report 114, Dept. of Information and Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In Proc. 5th. ACM Symp. on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216, July 19-21 1995.
[13] D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel Programming in Split-C. In Proc. Supercomputing 93, Portland OR, USA, November 1993.
[14] A. Geist, A. Begeulin, J. Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVM Parallel Virtual Machine. A User's Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp: A Language for Concurrent Symbolic Computation. ACM Trans. on Programming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High Performance Fortran Forum. High Performance Fortran Language Specification, Version 1.0. Technical Report CRPC-TR92225, Center for Research on Parallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors: an Operating System Structuring Concept. Communications of the ACM, 17(10):549-557, October 1974.
[18] R. Hughes. Why Functional Programming Matters. The Computer Journal, 32(2):98-107, 1989.
[19] Threads standard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. "Cid: A Parallel 'Shared-memory' C for Distributed Memory Machines."  In Proc. 7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing (LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390, August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id: a Language with Implicit Parallelism. In A Comparative Study of Parallel Programming Languages: The Salishan Problems, John Feo (editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali, and Arvind. Id Nouveau. Technical Report CSG Memo 265, Computation Structures Group, MIT Lab. for Computer Science, Cambridge MA 02139, USA, July 1986.
[24] OpenMP Architecture Review Board. OpenMP Specifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones. The Implementation of Functional Programming Languages. Prentice Hall, 1987.
[26] S. L. Peyton Jones (editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W. Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M. Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, and P. Wadler. Haskell 98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C. Diniz. Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers. ACM Trans. on Programming Languages and Systems (TOPLAS), 19(6):942-991, November 1997.
[28] M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A High-Level, MachineIndependent Language for Parallel Programming. IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth. Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.

Reader Response


Very interesting set of articles. There is no denying the power and elegance of functional languages. However, my main concern regarding their use in parallel programming is that they encourage only a coarse-grain, thread-based approach to parallelism at a time when the computer industry should be focusing its research muscle into developing fine-grain, auto-scalable multicore CPUs using an MIMD execution model. My second objection is that the very thing that FP proponents are railing against (state changes) will turn out to be their Achilles' heel with regard to safety-critical applications. The only way to implement a mechanism for the automatic discovery and resolution of data dependencies (a must for reliability) is to adopt a purely reactive software model that allows the use of state variables. This is especially important when maintaining unfamiliar legacy systems where data dependencies are not obvious. One man's opinion, of course.

-Louis Savain
Software Engineer
Miami, FL