There are many, many approaches to parallel programming, with a richterminology that can often be quite confusing. One often encountersthree terms to describe parallelism: parallel, concurrent, anddistributed.
Parallel Haskell (pH) sitsin the traditional space of “parallel” and “concurrent” languages. Theoriginal motivation for, and the principal focus of pH research, hascertainly been parallelism for speed, just like traditional parallelprogramming languages.
However, theimplicit parallelism in pH also subsumes theconcurrency in many concurrent programming languages, and pH may thusbe quite suitable for writing concurrent, nondeterministic, reactiveprograms as well.
pH does not address issues of distribution. The closest that any pHresearch has come to addressing distribution is in implementations ofpH for multicomputers and networks of workstations, but thisis still a relatively unexplored area.
In applications with a client-server architecture, pH could ofcourse be used to code either or both sides of the interaction, but itis no different from other languages in this regard. (
There are two aspects of pH that remain fairly unique. First, pH hasno explicit “processes” or “threads” to introduce parallelism; inprinciple, everything executes in parallel except as modulated by dataand control dependencies (conditionals). Since pH's model clearlysubsumes a model with explicit threads, what are the advantages ofexplicit threads? The issues revolve around efficiency ofimplementation.
Second, pH's fine-grain, implicit synchronization on data access, asseen in I-structures and M-structures, described in
The concept of monitors, introduced by Hoare in 1974 , andreproduced in several more recent languages including Java(synchronized classes), is an example of implicit data synchronization- synchronization is associated with a piece of data and happensautomatically on procedure (method) entry and exit.
However, the programmer still chooses which data should be protectedin this manner, and the granularity is often very large. In pH, on theother hand the fine-grain implicit synchronization of I-structures andM-structures enables the effortless exploitation of producer-consumerparallelism in all its forms.
The only class of languages that are similar to pH in its implicitparallelism is the class of parallel logic programming languages (see “
These languages are relational in nature, unlike pH's more familiarfunctional basis. Many of these languages also have parallelism withoutexplicit processes and threads, and they also have implicit,fine-grained data synchronization with logic variables (similar toI-structures). Like pH, these languages are also still researchlanguages.
Finally, we reemphasize that there axe many issues in programminglanguages that are orthogonal to implicit parallelism, which pH doesnot address. We have already mentioned issues connected withdistributed systems-sites, authentication, security, and so on. Theadvent of the World Wide Web has surfaced new research issues such as”mobile” programs (“bots”) and security and information flow.A call to action
pH is based on over a decade of research in language design forimplicit parallelism. During this period, the “look and feel” of thelanguage has evolved significantly as we have increased ourunderstanding of the expressivity issues surrounding topics such asfunctional arrays, I-structures, M-structures, and sequencing.
The most recent visual change was the adoption of Haskell notationfor the functional core, which forms the major part of the language.This change was made both to broaden the appeal of the language tothose already familiar with the standard functional language Haskelland to enable leveraging of existing Haskell compiler infrastructures.
|Figure1.11: Layers of pH.|
The Haskell programmer should find it very easy to program in pH.There are a few traditional Haskell idioms that cannot be used in pH,but in practice we have found these differences to be very minor.
(An example is the practice in Haskell to generate an “infinite”data structure and examine only a finite part of it. This works in alazy evaluator for a pure functional language but results in aninfinite computation in pH. )
Throughout this period, we have had several research implementationsof pH and Id, its predecessor. The current pH implementation is aboutthe third or fourth complete rewrite of the compiler. Earlierimplementations were focused on special hardware architectures known asdataflow machines , whereas the more recent implementations havefocused on stock hardware-first uniprocessors and now SMPs.
Nevertheless, there is always a substantial gap between a researchlanguage implementation and a production implementation. There areseveral bridges to cross before pH can become a widely used productionlanguage:
1) Efficientimplementations. This is mostly a technical question. Most ofthe issues concerning efficient implementation of pH are shared withmany other high-level programming languages, including Java, andwell-known techniques can be applied. A few issues are fairly unique topH and a few other languages, mostly arising out of the nonstrictnessof pH.
2)Interoperability with other languages and systems. This ispartly a technical issue, but it is also a social process because itinvolves standards, interaction models, and the involvement of manyconstituencies of programmers.
3) Evangelism and marketforces. The history of programming languages is replete withexamples of languages that did not succeed on technical merits alone.Some recognition that there needs to be some evangelism done to pushparallel programming in new directions is already emerging at companiessuch as Microsoft
Efficient Implementation of pH
Ultimately, all applications programmers are concerned with efficiency.This has two major dimensions. First, there is programmerproductivity-the speed at which we humans can create, debug, maintainand evolve our programs. And, second, there is the space and time costof program execution-how fast it runs and how much memory it takes.
The history of programming languages ” and the trend towards higherlevel languages – has precisely been the effort to improve programmerproductivity without major sacrifices in program execution efficiency.
As computer systems have become faster and compilers have improved,program efficiency has decreased as a first-order concern. Today, manyapplications are written in “scripting” languages like
These programs perhaps would run much faster if written in
pH continues this trend of aiming towards higher and higher levelsof programming. Nevertheless, there will always be applications whereprogram execution efficiency is paramount, where the programmer wouldlike to extract the absolute maximum performance that a machine iscapable of delivering.
At this “bleeding edge” of performance, very precise control ofresources is necessary. C and Fortran, and even assembler, aretypically the languages of choice in this regime because theircomputational model is quite close to the raw machine model, and hencethe programmer is able to control resources quite precisely.
(But even in this milieu, the recent architectural complexitiesof out-of-order execution, cache hierarchies, and non-uniform memoryaccess make the resource/performance model quite complex even for theexpert programmer ).
The higher the level of the language, the more difficult it is forthe programmer to have a mental model of resource usage, and the moredifficult it is to compile it to efficient code. This trend began inthe 1960s with
Haskell makes resource management even more complex, and pH, withits implicit parallelism, makes it even more so (
1) Automatic storagemanagement
2) Efficient implementation ofpolymorphism and overloading
3) Implicit Parallelism (noexplicit threads)
4) Nonstrictness and fine-grainimplicit data synchronization
The first issue is shared with many other programming languagesincluding Lisp, Scheme,Haskell, ML,and Java. The latter two issues are shared with a few other programminglanguages such as Haskell and concurrent constraint languages.
AutomaticStorage Management : In modern high-level languages data istypically allocated in an area of memory called the heap. As theprogram executes, more and more of the heap is used up. Eventually,when the heap is full, the run-time system engages in an activitycalled garbagecollection (GC).
GC identifies which data are still accessible to the program andwhich data are not and then recycles the memory occupied byinaccessible data to be available once again for allocation as theprogram resumes execution.
The run-time system typically needs to maintain some extrabookkeeping information with data so that it can identify objects inthe heap and determine whether they are accessible by the program ornot. This bookkeeping activity, known as “tagging”, together with thereclamation process itself, constitutes an overhead for such languages.
A second way in which automatic storage management can slow theprogram down is in its interaction with the memory hierarchy. In moderncomputers, because of multiple levels of caches and virtual memory,achieving maximum speed of execution is crucially dependent on”locality of reference” (i.e., the program must try to maintain assmall an active memory “footprint” as possible ).
Allocating data on a heap is usually antithetical to this objective,and the GC activity itself explores large memory areas without anylocality structure.
The problem of efficient implementation of automatic storagemanagement is not unique to pH; it also exists in Java, scriptinglanguages, Lisp, ML, Haskell, and other functional and logicprogramming languages. There is a vast body of literature describingsolutions to these problems.
Dynamic techniques include: clever representations of heap objectsthat reduce or eliminate tagging overhead; techniques to pack heapobjects to reduce their memory footprint; “generational” garbagecollectors that focus on short-lifetime data, thereby reducing thenumber of times that all of memory needs to be scanned for potentialgarbage, and so on.
Static techniques include escape analysis, which attempts toidentify data whose lifetime coincides with the call and return of someprocedure. That data can then be allocated within the procedure's stackframe instead of on the heap and thus the runtime system pays no extraoverhead for allocation and de-allocation.
Lifetime analysis is more general than escape analysis in attemptingto identify exactly when data is allocated and consumed-it can resultin optimizations that pass arguments and results in registers insteadof the heap; that avoid building tuples for multiple arguments andmultiple results; and that aggregate heap allocation of multipleobjects into a single allocation.
Linear Types allow the programmer to ensure that a particular datastructure is passed from operation to operation in a single-threadedfashion (i.e., that it is never shared). This allows the compiler toimplement functional updates to the data structure, which conceptuallycopies the data structure, with imperative in-place updates that do nocopying.
Polymorphismand Overloading: The efficiency issues due to polymorphism andoverloading are similar. Polymorphism often forces a uniformrepresentation of data. A polymorphic procedure on arrays should workon arrays of booleans as well as on arrays of floats, and thisnaturally leads to representations where a float and a boolean occupythe same storage space, resulting in wastage in the latter case.
For an overloaded procedure that works on arrays of booleans and onarrays of floats, the overloading is typically accomplished by passingin a table of methods-a table of float methods if the array containsfloats and a table of boolean methods if the array contains booleans.
The overloaded procedure then invokes the appropriate method fromthe table. Unfortunately, this can add the overhead of a procedure callfor something as simple as an addition operation, which would normallybe a single machine instruction in the nonoverloaded case.
Type analysis is a compiler technique that attempts to detect moreprecisely the types of certain program fragments to eliminatepolymorphism and overloading. In the best case, identification ofprecise types can eliminate uniform representations in favor of naturalmachine representations (this is often called “unboxing” ).
Finally, we should note that many of the optimizations describedabove make it harder to implement a good debugger for the language. Forexample, consider a debugger in which we wish to examine the datapassed into a polymorphic list-sorting function.
In order to display the objects in the input list, the debugger willneed a way to reconstruct the type of those objects. However, since thefunction is polymorphic, the debugger needs to examine the caller ofthe function, and perhaps its caller, and so on, in order to find outthe actual type of the input list.
This task is made more difficult if optimizations have stripped awaytags and other type-identifying information. This problem has beenwell-studied and solutions are described in the literature.
ImplementingImplicit Parallelism: Earlier in
The fundamental problem with explicit forks is that it can be verydifficult for the programmer to choose the placement of fork calls.Standard implementations of fork have a high overhead (
Further, the cost of multiplexing the computer's processors amongstthreads is typically very high. Thus, it is necessary to find just theright balance-too many threads will result in too much forking andthread-switching overhead, whereas too few threads will leaveprocessors idle. Having too few threads can also introduce deadlock.
For example, suppose in a call
This problem, of choosing where to fork, is faced by programmers ofexplicitly parallel languages and by compilers of implicitly parallellanguages like pH. In principle, a pH program can be compiled into asequential programming language with explicit forks liberally sprinkledat every point where there is potential parallelism according to pHsemantics.
Unfortunately, the right number of threads is generally notsyntactically obvious-it depends on the program structure and therun-time synchronization and data dependencies. Further, it can dependon program inputs; that is, there may be no single “correct”identification of fork points in the program.
A first and simple step towards solving this problem is to treatforks as hints rather than imperatives. When encountering a fork call,the run-time system actually forks a thread only if there are idleprocessors available; otherwise, it simply calls the specified functionusing a normal function call.
Thus, the programmer or the compiler can freely introduce forks thatmay potentially specify hundreds or thousands o£ threads,confident that the runtime system will in fact create just enoughthreads to keep the processors of the computer busy.
The language Cid is an example of a parallel language that exploredthis technique . Unfortunately, this solution is not a completesolution. Most importantly, it does not solve the deadlock problemdescribed above, because once a fork is treated as a procedure call,there is no going back.
Another problem, less serious, is that the decision whether or notto fork is taken at the time when the fork is encountered, based on thenumber of idle processors at that time. If the situation changes later,there is no way to revisit the decision (e.g., the system decides notto fork because all processors are busy, but processors become freeshortly after the decision is taken).
A more sophisticated approach is called Lazy Task Creation (or LTC).It was originally pioneered in the language Multilisp  with anefficient implementation by Marc Feeley  but has been furtherdeveloped in Lazy Threads, StackThreadsand in the language Cilk from MIT .
Under LTC a fork statement is always treated as an ordinary functioncall, but the runtime system performs just a little additionalbook-keeping, leaving a small record with enough information to revokethis decision if necessary.
If a processor becomes idle, it is able to examine the runtime datastructures and, if it finds such a record, it can retroactively convertthe procedure call into a fork point, and execute the pending work.
The processor that created the record may not be the same as theprocessor that exploits such a record; in this case, the secondprocessor is said to “steal” the parent from the first.
The key to making this work well is to minimize the overhead of theinitial book-keeping, even if it makes stealing expensive. Therationale is that, while there may be thousands or millions ofpotential fork-points encountered during execution (it depends on theprogram structure and the data), the actual number of thread creationevents (stealing) will be much lower, sometimes as low as the number ofprocessors, which is typically a much smaller number.ImplementingNonstrictness and Fine-Grain Data Synchronization: The finalcomplication in implementing pH efficiently is in the cost ofnonstrictness and implicit data synchronization (I-structures andM-structures).
In these features, a thread of the program attempts to read a value,but it may have to wait because the value may not be available yet(“empty” in the case of I- and M-structures). That thread can spin andwait, or it can suspend itself, which means enqueuing some staterelated to the thread on the empty slot.
Later, when some other thread writes data into the empty slot, itmust re-enable one or more of the threads waiting on that slot. This isa central activity in any pH implementation and is called threadswitching or context switching.
Because of implicit parallelism, nonstrictness, and fine-grain datasynchronization in pH, this activity is extremely frequent and isnecessary even in a uniprocessor implementation.
Consequently, it has to be extremely efficient, barely moreexpensive than a procedure call. Contrast this with conventional threadswitching in POSIX threads (pthreads), or modern operating systems,where the cost is typically two or more orders of magnitude higher thanthe cost of a procedure call.
Implementations of pH must thus pay great attention to optimizingthis operation, and the code sequences to accomplish this must usuallybe carefully hand-crafted. These implementation issues are consideredin great detail in ,  and by others.
A final point to note is that this particular efficiency issue isshared by all implementations of Haskell and other lazy functionallanguages, even though they may be sequential implementations. Therequirement arises from non-strictness and fine-grain datasynchronization, which are orthogonal to parallelism.
Interoperability and ApplicationDomains
This series has focused on the fundamentals of implicit parallelismand, as such, the programming examples illustrate classical algorithmicproblems.
Today, the world of programming is much more diverse, encompassingvaried domains such as Web-based client-server programming, userinterfaces, graphics, distributed computing, and so on. It is almostimpossible for a single programming language to encompass such a richdiversity of application domains; instead, over the last few years wehave seen the development and growth of several mechanisms forinteroperability.
For example a Web browser may be written in one language, but it mayinteract with a “plug-in” that is written in another language. Or, aspreadsheet may be written in one language while it accesses data froma database management system (possibly on a remote server) that iswritten in another language.
Ultimately, the success of a programming language depends not onlyon its core technical advantages, but also on the facilities itprovides for interoperability with the vast existing base of software(screen, keyboard and mouse control;file system access; databaseaccess; network access; and so on ).
Many interoperability mechanisms are quite ad hoc (
Two such standards are
For each component implementation language of interest, an “IDLCompiler” compiles this specification into stubs for each side of theinteraction. These stubs can then be linked in with the actual code ofthe participant in the interaction.
Researchers in the Haskell community have done much research intointerfacing Haskell to COM and CORBA  . These Haskell solutionsshould carry over directly to pH. In fact, it is an open questionwhether pH's implicit parallelism will improve the structure ofinteroperating programs because many of them are reactive in nature,which fits naturally into pH's concurrent, data-driven behavior.
The Final Word
pH represents a radically different way of thinking about parallelprogramming. We believe that it is increasingly in tune withdevelopments in computer hardware, where symmetric multiprocessorscomprising multicore processors are becoming the norm due to thecontinually decreasing prices of semiconductor components and thebarriers to increasing uniprocessor speed.
This vision is not a foolish dream. Despite the pessimism insome other quarters
To read Part 1, go to
To read Part 2, go to
To read Part 3, go to
To read Part 4, go to
To read Part 5, go to
To read Part 6, go to “
To read Part 7, go to “
RishiyurS. Nikhil and Arvind are the coauthors of
Rishiyur is Chief TechnicalOfficer at ESL specialist
Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing
Toread more on this topic on Embedded.com go to
 S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
 Z. Ariola and Arvind. “
 J. Armstrong. “
 J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.” Prentice Hall, 1993. ISBN: 0-13-285792-8.
 K. Arnold and J. Gosling.
 Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
 Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
 Arvind and R. S. Nikhil.
 R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou.
 D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
 A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram.
 R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
 C. Hoare.
 R. Hughes.
 Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
 R. S. Nikhil. “
 R. S. Nikhil and Arvind.
 R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab. for Computer Science, Cambridge MA 02139,USA, July 1986.
 OpenMP Architecture ReviewBoard. OpenMPSpecifications, 1997 (Fortran), 1998 (C/C++)
 S. L. Peyton Jones.
 S. L. Peyton Jones(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, andP. Wadler. Haskell98: A Non-strict, Purely Functional Language, February 1999.
 M. C. Rinard and P. C.Diniz. CommutativityAnalysis: A New Analysis Framework for Parallelizing Compilers. ACMTrans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,November 1997.
 M. C. Rinard, D. J.Scales, and M. S. Lam. Jade: AHigh-Level, MachineIndependent Language for Parallel Programming.IEEE Computer, 26(6):28-38, June 1993.