Parallel Haskell (pH) sits in the traditional space of "parallel" and "concurrent" languages. The original motivation for, and the principal focus of pH research, has certainly been parallelism for speed, just like traditional parallel programming languages.
However, the implicit parallelism in pH also subsumes the concurrency in many concurrent programming languages, and pH may thus be quite suitable for writing concurrent, nondeterministic, reactive programs as well.
pH does not address issues of distribution. The closest that any pH research has come to addressing distribution is in implementations of pH for multicomputers and networks of workstations, but this is still a relatively unexplored area.
In applications with a client-server architecture, pH could of course be used to code either or both sides of the interaction, but it is no different from other languages in this regard. (See "Calling Hell from Heaven and Heaven from Hell" for examples of how this has been done with Haskell.)
There are two aspects of pH that remain fairly unique. First, pH has no explicit "processes" or "threads" to introduce parallelism; in principle, everything executes in parallel except as modulated by data and control dependencies (conditionals). Since pH's model clearly subsumes a model with explicit threads, what are the advantages of explicit threads? The issues revolve around efficiency of implementation.
Second, pH's fine-grain, implicit synchronization on data access, as seen in I-structures and M-structures, described in Part 7 of this series, is quite unusual.
The concept of monitors, introduced by Hoare in 1974 [17], and reproduced in several more recent languages including Java (synchronized classes), is an example of implicit data synchronization - synchronization is associated with a piece of data and happens automatically on procedure (method) entry and exit.
However, the programmer still chooses which data should be protected in this manner, and the granularity is often very large. In pH, on the other hand the fine-grain implicit synchronization of I-structures and M-structures enables the effortless exploitation of producer-consumer parallelism in all its forms.
The only class of languages that are similar to pH in its implicit parallelism is the class of parallel logic programming languages (see "Design of the Kernel Language for the Parallel Inference Machine") and their more recent descendants, concurrent constraint languages (see "Object Oriented Concurrent Constraint Programming in Oz").
These languages are relational in nature, unlike pH's more familiar functional basis. Many of these languages also have parallelism without explicit processes and threads, and they also have implicit, fine-grained data synchronization with logic variables (similar to I-structures). Like pH, these languages are also still research languages.
Finally, we reemphasize that there axe many issues in programming languages that are orthogonal to implicit parallelism, which pH does not address. We have already mentioned issues connected with distributed systems-sites, authentication, security, and so on. The advent of the World Wide Web has surfaced new research issues such as "mobile" programs ("bots") and security and information flow.
The most recent visual change was the adoption of Haskell notation for the functional core, which forms the major part of the language. This change was made both to broaden the appeal of the language to those already familiar with the standard functional language Haskell and to enable leveraging of existing Haskell compiler infrastructures.
![]() |
| Figure 1.11: Layers of pH. |
The Haskell programmer should find it very easy to program in pH. There are a few traditional Haskell idioms that cannot be used in pH, but in practice we have found these differences to be very minor.
(An example is the practice in Haskell to generate an "infinite" data structure and examine only a finite part of it. This works in a lazy evaluator for a pure functional language but results in an infinite computation in pH.)
Throughout this period, we have had several research implementations of pH and Id, its predecessor. The current pH implementation is about the third or fourth complete rewrite of the compiler. Earlier implementations were focused on special hardware architectures known as dataflow machines [11], whereas the more recent implementations have focused on stock hardware-first uniprocessors and now SMPs.
Nevertheless, there is always a substantial gap between a research language implementation and a production implementation. There are several bridges to cross before pH can become a widely used production language:
1) Efficient implementations. This is mostly a technical question. Most of the issues concerning efficient implementation of pH are shared with many other high-level programming languages, including Java, and well-known techniques can be applied. A few issues are fairly unique to pH and a few other languages, mostly arising out of the nonstrictness of pH.
2) Interoperability with other languages and systems. This is partly a technical issue, but it is also a social process because it involves standards, interaction models, and the involvement of many constituencies of programmers.
3) Evangelism and market
forces. The history of programming languages is replete with
examples of languages that did not succeed on technical merits alone.
Some recognition that there needs to be some evangelism done to push
parallel programming in new directions is already emerging at companies
such as Microsoft
The history of programming languages " and the trend towards higher level languages - has precisely been the effort to improve programmer productivity without major sacrifices in program execution efficiency.
As computer systems have become faster and compilers have improved, program efficiency has decreased as a first-order concern. Today, many applications are written in "scripting" languages like Perl or Tcl/Tk and in high-level languages like Java.
These programs perhaps would run much faster if written in C or Fortran, and probably even faster if written in assembler (with huge effort), but the premium on programmer productivity is gradually beginning to dominate the equation. In addition, scripting languages and high-level languages make programming accessible to a much larger community. For example, there are many Web site designers who know scripting languages but no other traditional language.
pH continues this trend of aiming towards higher and higher levels of programming. Nevertheless, there will always be applications where program execution efficiency is paramount, where the programmer would like to extract the absolute maximum performance that a machine is capable of delivering.
At this "bleeding edge" of performance, very precise control of resources is necessary. C and Fortran, and even assembler, are typically the languages of choice in this regime because their computational model is quite close to the raw machine model, and hence the programmer is able to control resources quite precisely.
(But even in this milieu, the recent architectural complexities of out-of-order execution, cache hierarchies, and non-uniform memory access make the resource/performance model quite complex even for the expert programmer).
The higher the level of the language, the more difficult it is for the programmer to have a mental model of resource usage, and the more difficult it is to compile it to efficient code. This trend began in the 1960s with Lisp, and continues today with Java.
Haskell makes resource management even more complex, and pH, with its implicit parallelism, makes it even more so (although pH does better than Haskell in that it specifies sharing precisely and permits shared updatable storage). The following features make it more and more difficult to implement a language efficiently:
1) Automatic storage
management
2) Efficient implementation of
polymorphism and overloading
3) Implicit Parallelism (no
explicit threads)
4) Nonstrictness and fine-grain
implicit data synchronization
The first issue is shared with many other programming languages including Lisp, Scheme, Haskell, ML, and Java. The latter two issues are shared with a few other programming languages such as Haskell and concurrent constraint languages.
Automatic Storage Management: In modern high-level languages data is typically allocated in an area of memory called the heap. As the program executes, more and more of the heap is used up. Eventually, when the heap is full, the run-time system engages in an activity called garbage collection (GC).
GC identifies which data are still accessible to the program and which data are not and then recycles the memory occupied by inaccessible data to be available once again for allocation as the program resumes execution.
The run-time system typically needs to maintain some extra bookkeeping information with data so that it can identify objects in the heap and determine whether they are accessible by the program or not. This bookkeeping activity, known as "tagging", together with the reclamation process itself, constitutes an overhead for such languages.
A second way in which automatic storage management can slow the program down is in its interaction with the memory hierarchy. In modern computers, because of multiple levels of caches and virtual memory, achieving maximum speed of execution is crucially dependent on "locality of reference" (i.e., the program must try to maintain as small an active memory "footprint" as possible).
Allocating data on a heap is usually antithetical to this objective, and the GC activity itself explores large memory areas without any locality structure.
The problem of efficient implementation of automatic storage management is not unique to pH; it also exists in Java, scripting languages, Lisp, ML, Haskell, and other functional and logic programming languages. There is a vast body of literature describing solutions to these problems.
Dynamic techniques include: clever representations of heap objects that reduce or eliminate tagging overhead; techniques to pack heap objects to reduce their memory footprint; "generational" garbage collectors that focus on short-lifetime data, thereby reducing the number of times that all of memory needs to be scanned for potential garbage, and so on.
Static techniques include escape analysis, which attempts to identify data whose lifetime coincides with the call and return of some procedure. That data can then be allocated within the procedure's stack frame instead of on the heap and thus the runtime system pays no extra overhead for allocation and de-allocation.
Lifetime analysis is more general than escape analysis in attempting to identify exactly when data is allocated and consumed-it can result in optimizations that pass arguments and results in registers instead of the heap; that avoid building tuples for multiple arguments and multiple results; and that aggregate heap allocation of multiple objects into a single allocation.
Linear Types allow the programmer to ensure that a particular data structure is passed from operation to operation in a single-threaded fashion (i.e., that it is never shared). This allows the compiler to implement functional updates to the data structure, which conceptually copies the data structure, with imperative in-place updates that do no copying.
Polymorphism and Overloading: The efficiency issues due to polymorphism and overloading are similar. Polymorphism often forces a uniform representation of data. A polymorphic procedure on arrays should work on arrays of booleans as well as on arrays of floats, and this naturally leads to representations where a float and a boolean occupy the same storage space, resulting in wastage in the latter case.
For an overloaded procedure that works on arrays of booleans and on arrays of floats, the overloading is typically accomplished by passing in a table of methods-a table of float methods if the array contains floats and a table of boolean methods if the array contains booleans.
The overloaded procedure then invokes the appropriate method from the table. Unfortunately, this can add the overhead of a procedure call for something as simple as an addition operation, which would normally be a single machine instruction in the nonoverloaded case.
Type analysis is a compiler technique that attempts to detect more precisely the types of certain program fragments to eliminate polymorphism and overloading. In the best case, identification of precise types can eliminate uniform representations in favor of natural machine representations (this is often called "unboxing").
Finally, we should note that many of the optimizations described above make it harder to implement a good debugger for the language. For example, consider a debugger in which we wish to examine the data passed into a polymorphic list-sorting function.
In order to display the objects in the input list, the debugger will need a way to reconstruct the type of those objects. However, since the function is polymorphic, the debugger needs to examine the caller of the function, and perhaps its caller, and so on, in order to find out the actual type of the input list.
This task is made more difficult if optimizations have stripped away tags and other type-identifying information. This problem has been well-studied and solutions are described in the literature.
Implementing Implicit Parallelism: Earlier in Part 3 in this series, we described the explicit fork statement of some parallel languages (such as C with P-Threads, or Java). The fork statement allows a programmer to specify the creation of a new, parallel, thread of control.
The fundamental problem with explicit forks is that it can be very difficult for the programmer to choose the placement of fork calls. Standard implementations of fork have a high overhead (it is often hundreds or thousands of times more expensive than a procedure call).
Further, the cost of multiplexing the computer's processors amongst threads is typically very high. Thus, it is necessary to find just the right balance-too many threads will result in too much forking and thread-switching overhead, whereas too few threads will leave processors idle. Having too few threads can also introduce deadlock.
For example, suppose in a call fork f(), the parent thread is supposed to produce data for the child thread. Omitting the fork (thereby converting f(} into an ordinary procedure call), will deadlock the program: the procedure f() will wait for its data, but since it has not returned, the "parent" cannot produce it.
This problem, of choosing where to fork, is faced by programmers of explicitly parallel languages and by compilers of implicitly parallel languages like pH. In principle, a pH program can be compiled into a sequential programming language with explicit forks liberally sprinkled at every point where there is potential parallelism according to pH semantics.
Unfortunately, the right number of threads is generally not syntactically obvious-it depends on the program structure and the run-time synchronization and data dependencies. Further, it can depend on program inputs; that is, there may be no single "correct" identification of fork points in the program.
A first and simple step towards solving this problem is to treat forks as hints rather than imperatives. When encountering a fork call, the run-time system actually forks a thread only if there are idle processors available; otherwise, it simply calls the specified function using a normal function call.
Thus, the programmer or the compiler can freely introduce forks that may potentially specify hundreds or thousands o£ threads, confident that the runtime system will in fact create just enough threads to keep the processors of the computer busy.
The language Cid is an example of a parallel language that explored this technique [21]. Unfortunately, this solution is not a complete solution. Most importantly, it does not solve the deadlock problem described above, because once a fork is treated as a procedure call, there is no going back.
Another problem, less serious, is that the decision whether or not to fork is taken at the time when the fork is encountered, based on the number of idle processors at that time. If the situation changes later, there is no way to revisit the decision (e.g., the system decides not to fork because all processors are busy, but processors become free shortly after the decision is taken).
A more sophisticated approach is called Lazy Task Creation (or LTC). It was originally pioneered in the language Multilisp [15] with an efficient implementation by Marc Feeley [20] but has been further developed in Lazy Threads, StackThreads and in the language Cilk from MIT [12].
Under LTC a fork statement is always treated as an ordinary function call, but the runtime system performs just a little additional book-keeping, leaving a small record with enough information to revoke this decision if necessary.
If a processor becomes idle, it is able to examine the runtime data structures and, if it finds such a record, it can retroactively convert the procedure call into a fork point, and execute the pending work.
The processor that created the record may not be the same as the processor that exploits such a record; in this case, the second processor is said to "steal" the parent from the first.
The key to making this work well is to minimize the overhead of the initial book-keeping, even if it makes stealing expensive. The rationale is that, while there may be thousands or millions of potential fork-points encountered during execution (it depends on the program structure and the data), the actual number of thread creation events (stealing) will be much lower, sometimes as low as the number of processors, which is typically a much smaller number.
In these features, a thread of the program attempts to read a value, but it may have to wait because the value may not be available yet ("empty" in the case of I- and M-structures). That thread can spin and wait, or it can suspend itself, which means enqueuing some state related to the thread on the empty slot.
Later, when some other thread writes data into the empty slot, it must re-enable one or more of the threads waiting on that slot. This is a central activity in any pH implementation and is called thread switching or context switching.
Because of implicit parallelism, nonstrictness, and fine-grain data synchronization in pH, this activity is extremely frequent and is necessary even in a uniprocessor implementation.
Consequently, it has to be extremely efficient, barely more expensive than a procedure call. Contrast this with conventional thread switching in POSIX threads (pthreads), or modern operating systems, where the cost is typically two or more orders of magnitude higher than the cost of a procedure call.
Implementations of pH must thus pay great attention to optimizing this operation, and the code sequences to accomplish this must usually be carefully hand-crafted. These implementation issues are considered in great detail in [25], [9] and by others.
A final point to note is that this particular efficiency issue is shared by all implementations of Haskell and other lazy functional languages, even though they may be sequential implementations. The requirement arises from non-strictness and fine-grain data synchronization, which are orthogonal to parallelism.
Interoperability and Application
Domains
This series has focused on the fundamentals of implicit parallelism
and, as such, the programming examples illustrate classical algorithmic
problems.
Today, the world of programming is much more diverse, encompassing varied domains such as Web-based client-server programming, user interfaces, graphics, distributed computing, and so on. It is almost impossible for a single programming language to encompass such a rich diversity of application domains; instead, over the last few years we have seen the development and growth of several mechanisms for interoperability.
For example a Web browser may be written in one language, but it may interact with a "plug-in" that is written in another language. Or, a spreadsheet may be written in one language while it accesses data from a database management system (possibly on a remote server) that is written in another language.
Ultimately, the success of a programming language depends not only on its core technical advantages, but also on the facilities it provides for interoperability with the vast existing base of software (screen, keyboard and mouse control; file system access; database access; network access; and so on).
Many interoperability mechanisms are quite ad hoc (such as calls to compiled code from many "scripting" languages), but some have been designed with more principles and care.
Two such standards are CORBA from the Object Management Group and COM from Microsoft Corporation. In these standards, the programmer specifies the interaction between two "components", possibly implemented with different programming languages, using a common, language-neutral specification language called an "Interface Definition Language", or IDL.
For each component implementation language of interest, an "IDL Compiler" compiles this specification into stubs for each side of the interaction. These stubs can then be linked in with the actual code of the participant in the interaction.
Researchers in the Haskell community have done much research into interfacing Haskell to COM and CORBA [21] [22]. These Haskell solutions should carry over directly to pH. In fact, it is an open question whether pH's implicit parallelism will improve the structure of interoperating programs because many of them are reactive in nature, which fits naturally into pH's concurrent, data-driven behavior.
The Final Word
pH represents a radically different way of thinking about parallel
programming. We believe that it is increasingly in tune with
developments in computer hardware, where symmetric multiprocessors
comprising multicore processors are becoming the norm due to the
continually decreasing prices of semiconductor components and the
barriers to increasing uniprocessor speed.
This vision is not a foolish dream. Despite the pessimism in
some other quarters (See
"Programmers stuck at C") about the near term emergence of a
viable
parallel programming methodology that will be widely adopted, we
believe pH is a solution that is deployable now, with the right level
of engineering effort. We hope that the vision presented in this series
will excite researchers, students and software engineers to form a
critical mass that can make this vision a reality.
To read Part 1, go to How
sequential languages obscure parallelism.
To read Part 2, go to How to achieve
parallel execution?
To read Part 3, go to Explicit Parallelism:
Shared-Memory Programming with Threads and Locks
To read Part 4, go to Explicit Parallelism:
Message-Passing programming
To read Part 5, go to Implicit
parallelism: Declarative programming languages
To read Part 6, go to "So, why
aren't we using functional languages yet?"
To read Part 7, go to "pH: An
implicitly parallel, declarative language."
Rishiyur S. Nikhil and Arvind are the coauthors of Implicit Parallel Programming in pH, published by Elsevier/Morgan Kaufmann.
Rishiyur is Chief Technical Officer at ESL specialist Bluespec Inc. Previous to its acquisition by Broadcom he led the Bluespec technology team at Sandburst Corp. For the previous 9 years he was at Cambridge Research Laboratory (DEC/Compaq), including over a year as Acting Director. Earlier, he was an Associate Professor of Computer Science and Engineering at MIT. He holds several patents in functional programming, dataflow and multithreaded architectures, parallel processing and compiling.
Arvind is the Johnson Professor of
Computer Science and Engineering at the Massachusetts Institute of
Technology. His work laid the foundations for Bluespec, centering on
high-level specification and description of architectures and protocols
using Term Rewriting Systems (TRSs),
encompassing hardware synthesis as well as verification of
implementations against TRS specifications. Previously, he contributed
to the development of dynamic dataflow architectures, the implicitly
parallel programming languages Id and pH, and compilation of these
languages on parallel machines. He was also president and founder of
Sandburst Corp.
To
read more on this topic on Embedded.com go to More
about multicores, multithreading and multiprocessing.
References:
[1] "Microsoft:
Parallel Programming Model Ten Years off," by Rick Merritt,
Embedded.com, July 23, 2007
[2] "The Problem with Threads" by
Edward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3] "Why
threads are a bad idea (for most purposes)," J.K. Ousterhout,
Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, and
D. Gelernter. "Linda
and Friends," IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. "Properties
of a First-order Functional Language with Sharing." Theoretical
Computer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. "The Development of
Erlang." In Proc. Intl. Conf. on Functional Programming (ICFP),
Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,
and M. Williams. "Concurrent
Programming in Erlang." Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The Java
Programming Language, second edition. Addison-Wesley, 1998. ISBN
0-201-31006-6.
[9] Arvind, A. Caro, J.-W.
Maessen, and S. Aditya. A
Multithreaded Substrate and Compilation Model for the Implicitly
Parallel Language pH. In Proc. Wkshp. on Languages and Compilers
for Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,
and W. Plouffe. The
(preliminary) Id Report. Technical Report 114, Dept. of Information
and Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executing
a Program on the MIT Tagged-Token Dataflow Architecture. IEEE
Trans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.
Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime
System. In Proc. 5th. ACM Symp. on Principles and Practice
of Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,
July 19-21 1995.
[13] D. E. Culler, A. Dusseau,
S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.
Yelick. Parallel
Programming in Split-C. In Proc. Supercomputing 93, Portland OR,
USA, November 1993.
[14] A. Geist, A. Begeulin, J.
Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVM
Parallel Virtual Machine. A User's Guide and Tutorial for Network
Parallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:
A
Language for Concurrent Symbolic Computation. ACM Trans. on
Programming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High Performance
Fortran Forum. High Performance Fortran Language Specification,
Version 1.0. Technical Report CRPC-TR92225, Center for Research on
Parallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:
an Operating System Structuring Concept. Communications of the ACM,
17(10):549-557, October 1974.
[18] R. Hughes. Why
Functional Programming Matters. The Computer Journal, 32(2):98-107,
1989.
[19] Threads
standard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing Interface
Forum. MPI:
A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. "Cid: A Parallel
'Shared-memory' C for Distributed Memory Machines." In Proc.
7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing
(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,
August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:
a Language with Implicit Parallelism. In A Comparative Study of
Parallel Programming Languages: The Salishan Problems, John Feo
(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,
and Arvind. Id Nouveau. Technical Report CSG Memo 265, Computation
Structures Group, MIT Lab. for Computer Science, Cambridge MA 02139,
USA, July 1986.
[24] OpenMP Architecture Review
Board. OpenMP
Specifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones.
The Implementation of Functional
Programming Languages. Prentice Hall, 1987.
[26] S. L. Peyton Jones
(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.
Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.
Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, and
P. Wadler. Haskell
98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C.
Diniz. Commutativity
Analysis: A New Analysis Framework for Parallelizing Compilers. ACM
Trans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,
November 1997.
[28] M. C. Rinard, D. J.
Scales, and M. S. Lam. Jade: A
High-Level, MachineIndependent Language for Parallel Programming.
IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth.
Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.