Wouldn't it be great if our programs ran twice as fast merely byupgrading our workstation, desktop, server, or embedded systems – fromportable media players and smart mobile phones to network switches androuters – from one up to two processors, or from two up to four?
Wouldn't it be great if doubling the number of processors meant thatit took only half as long to evaluate alternative designs using CADtools, to complete a modelling run for tomorrow's weather, or to searcha large database?
We are all very used to the idea of gaining increased performance byupgrading our workstation with a faster processor, or with a larger orfaster memory system. Unfortunately we do not see this kind ofautomatic improvement when we upgrade to a multiprocessor.
Although a multiprocessor often allows us to run more programssimultaneously, it seldom allows us to complete a given program in lesstime, or to run a larger program in a given amount of time, withoutextraordinary effort.
The problem is that most application programs are written insequential programming languages, and these programs cannot exploit theextra processors available in a multiprocessor.
We can run separate programs on the different processors to increasethe overall throughput of the system but no single program runs fasterby dividing its work amongst multiple processors, the way a parallelprogram could.
Why are most programs sequential?
The answer to this question has a number ofdimensions, one of which is historical – it was only in the late 1990sthat multiprocessors began to be commonly affordable and available ascommodity hardware, and so until recently there was no widespreadinterest in writing parallel programs.
But, in addition, most existing approaches to producing parallelprograms are very hard, whether they involve automatically convertingexisting sequential programs into parallel ones, or whether theyinvolve the programmer explicitly writing parallel programs.(
In this series of articles we will explore the question of whatmakes parallel programming so hard, and we will end with an overview ofthe solution that we propose:
(A follow-on series of articles onEmbedded.com will discuss our latest work in this area, basedon “
Most of this parallelism is obscured and eliminated by the time wewrite the code in a concrete sequential programming language. We willsee why it is very difficult for an automatic system to recover thisobscured parallelism.
If parallelism is not recoverable from our existing sequentialprograms, an alternative is to rewrite our programs with parallelism inmind. Three approaches suggest themselves:
1) We can write ourprograms in sequential languages in certain stylized ways and withcertain annotations that make it easy for a compiler to generateparallel code. The design of these idioms and annotations is influencedby our insights into what makes automatic parallelization of legacysequential programs difficult.
Examples of programming languages in this category are
2) We can abandon automaticparallelization and extend our existing sequential languages withexplicit constructs to initiate parallel activities and to coordinatetheir interactions. There are a large number of programming languagesin this category, including
3) We can write our programsin a new programming language where parallelism is implicit. This isthe approach advocated in this series of articles, and pH is such alanguage. pH is parallel from the ground up – there is no sequentialcore. Its approach is quite the opposite from the previous ones – theveryfirst, simplest, programs that a pH programmer writes are parallel, andit is only later that an advanced pH programmer learns judiciously tointroduce a little explicit sequencing. As such, exploiting multipleprocessors is almost trivial in pH.
pH builds on over two decades of experience with
When discussing parallel programming it is important also to keep aproper perspective on expectations of scaling. Parallel computers todaycome in a wide range of sizes, from uniprocessors to thousands ofprocessors.
It is, in general, unreasonable to expect that a program's speedwill automatically scale across such a diversity of platforms. In verylarge machines a processor may experience several orders of magnitudevariation in access times for data that is “near” to the processorversus “far”.
This fundamental change in cost model may require algorithms to beredesigned: how we map work to different processors, how we place datain the machine's memory relative to processors, and how we scheduleactivities. None of the above approaches to parallel programming cancompletely insulate us from these considerations.
However, it is reasonable to expect that scaling should happenautomatically without programmer intervention for the kinds of parallelmachines that are likely to be in the vast majority as commoditymachines – small “SMPs” (
Parallelism in Matrix Multiplication
Many problems, when viewed abstractly, appear to be abundantlyparallel. However, this parallelism gets obscured when the problem isexpressed in a sequential language. Even the most sophisticatedcompilers are able to recover only a small part of the obscuredparallelism. We illustrate this with the problem of matrixmultiplication, shown in Figure 1-1below.
|Figure1-1. Matrix multiplication.|
Matrix multiplication is a computation at the heart of many problemssuch as scaling, rotation, and displacement transformations in computergraphics. A matrix is just a two-dimensional grid containing numbers inevery slot. Given two n x n matrices A and B, their product C = A x Bis another n x n matrix, such that:
C ij =inner product of A's row i with B's column j
which can be visualized as shown in Figure1.1 above . A's row i and B's column j are called vectors -one-dimensional grids of numbers. The inner-product computation can beexpressed as:
which can be visualized as shown in Figure1.2 below , where we have laid B's column j on its side forclarity.
|Figure1-2. Inner product of two vectors.|
What can be computed in parallel, and what must be donesequentially? First, it should be clear from Figure 1.1 that each component of Cis independent of all other components of C. Thus, we can compute C11 in parallel with C12 in parallel with C21 … andso on, that is, n2 inner products in parallel.
For example, if we put up the three matrices on a large blackboardand employed n2 lieutenants, one per slot in C, eachlieutenant could compute his assigned inner product independently ofthe others, after which he could walk up and write his answer in hisassigned slot of C.
From Figure 1.2 above , itshould be clear that all the multiplications can be done in parallel.However, the additions must be done sequentially, because each additionneeds the result of a previous one. In other words, each inner-productlieutenant could hire n ensigns to perform one multiplication each andn ensigns to perform one addition each; all the multiplier ensignswould compute in parallel, but the plus ensigns would resemble a relayrace.
There is more parallelism available if we recognize that “+” isassociative, namely that (x + (y + z)) = ((x + y) + z)-the order of theadditions does not matter.' The new computation structure is shown in Figure 1.3 below . The “+” ensigns,instead of being organized as a linear relay, are now organized as atree.
|Figure1.3: Tree summation in inner product.|
It is clear that all the “+” ensigns at the leaves of the tree (toprow) can operate in parallel, after which all the '+' ensigns at thenext level can operate in parallel, and so on until the ensign at theroot of the tree produces the final sum. A little calculation showsthat the N additions, instead of taking N steps, now take only log2 (N)steps (the height of the tree).
(Arithmetic in computers is oftendone with floating point numbers, for which “+” is not associative, butlet us ignore this problem for the moment .)
It turns out that there is another way to obtain parallelism, whichis perhaps not so useful in this example, but we mention it here forcompleteness. By recognizing that “+” is also commutative, namely that(x + y) = (y + x), we could also structure the inner product as shownin Figure 1.4 below .
|Figure1.4: Non-deterministic summation in inner product.|
Here, the “lieutenant” employs n “ensigns” to do the multiplicationsin parallel, as before. Meanwhile, the lieutenant keeps a runningtotal, initially zero. He also keeps a tally of how many multiplicationensigns have checked in, initially zero.
As each ensign finishes his multiplication, he sends a token to thelieutenant, reporting his result. The lieutenant adds it to his runningtotal and increments his tally. When his tally reaches n, thelieutenant knows that all multiplication ensigns have been accountedfor, and delivers the final total, which is Cij , (for whichpresumably he receives a token medal).
Observe that the additions occur in a nondeterministic order, as andwhen product tokens arrive. In both our previous versions of the innerproduct, the summation was deterministic. Now, the summation is againsequential, because only one addition happens at a time.
However, unlike the previous versions, if the multiplication ensignstook widely varying times (alas, in these days of voluntary service thequality of ensigns is very uneven), ensigns who are slow will not holdback the summation of products delivered by fast ensigns; in somesituations, this may be preferable to the tree summation.
Matrix Multiplication in a Sequential Language
We have seen that matrix multiplication has abundant parallelism at thealgorithmic level. Now let us consider a Fortran program for matrixmultiplication (the choice of Fortranhere is unimportant; ourdiscussion applies equally to C, C++, Pascal, and other languages ):
This program is totally sequential: exactly one operation isperformed at a time, as depicted schematically in the schedule below(imagine time flowing downwards):
In other words, while the abstract solution had abundant parallelism,we discarded it completely when we encoded it into a sequentialprogramming language.
So, what makes it difficult to execute our existing programs inparallel? The principal difficulty has to do with state, whichmanifests itself in the reuse of storage which, in turn, leads tonondeterminism (or races) if executed in parallel. In our exampleprogram, the variable s initially contains 0.0, which is replaced by
(current value of s) + A(1,1) *B(1,1)
which, in turn, is replaced by
(current value of s) + A(1,2) *B(2,1)
and so on. In other words, s is a storage location into which we canread and write values. At any point in time, it contains the mostrecent value written. Languages that model state by reuse of storageare also known as imperative languages.
As long as our programming languages have this notion of state, itis not trivial to execute things in parallel. In our matrixmultiplication program, suppose we indicate that the i and j loopsshould run in parallel (by using the popular doall notation):
Conceptually, we now have the following computations running inparallel:
and so on, for each of the N2 combinations of (i, j).Unfortunately,there is a problem, called a race condition. Consider thesimultaneousexecution of the following two statements:
s = s + A(1,1) * B(1,1)
s = s + A(1,1) * B(1,2)
which occur in the apparently independent inner-product computationsfor C11 and C12 , respectively. Each statementreads a value fromstorage location s ,adds a value to it, and stores it back.
If the reads and writes from different parallel statements areinterleaved arbitrarily, these statements will interfere with eachother-they continually clobber each other's s values. Thevaluesultimately computed for matrix C are likely to be utterly wrong.
Unfortunately, this error may not be detectable easily because theerror may not be visually apparent (whichof us could eyeball large A,B, and C matrices and recognize immediately that the product has beenmiscalculated? ), or because the computation may be buried deepinside alarge program and the error ultimately surfaces elsewhere.
Debugging a program with a race condition is a nightmare because theaberrant behavior is often not reproducible, since it is sotime-dependent. Suppose, instead, we had written our program this way:
Here, we have eliminated the variable s and used C(i,j) itself asthe location to hold the partial sum of the inner product. Now, ourconcurrent statements are:
C(1,1) = C(1,1) + A(1,1) * B(1,1)
C(1,2) = C(1,2) + A(1,1) * B(1,2)
and there is no longer any interference – the two inner products usetwo separate locations to hold their partial sums, so there is nodanger of them reading and writing each other's values.
This illustrates the difficulties encountered when concurrentcomputations simultaneously access a shared, updatable storagelocation. It also illustrates how sensitive the parallelism can be tothe original coding style. In fact, a skilled programmer may wellprefer the first version because a compiler would:
(a) find it easier to keep sin a register than C(i,j) and
(b) find it easier to avoidrepeated calculations of the address ofC(i,j).
A powerful argument for sequential languages, therefore, is that theprogrammer does not have to worry about such race conditions – thesequential semantics of Fortran, for example, enforces a particular,easy-to-understand schedule on the reads and writes to sharedlocations. This enables the programmer to reason about thestate-manipulation behavior of programs and, for debugging, toreproduce the behaviour of errant programs exactly.
Next in Part 2: How to achieveparallel execution?
Rishiyur is Chief TechnicalOfficer at ESL specialist Bluespec Inc . Previous to its acquisition by Broadcom he led the Bluespec technologyteam atSandburst Corp. For the previous 9 years he was at Cambridge ResearchLaboratory (DEC/Compaq), including over a year as Acting Director.Earlier, he was an Associate Professor of Computer Science andEngineering at MIT. He holds several patents in functional programming,dataflow and multithreaded architectures, parallel processing andcompiling.
Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing Term Rewriting Systems (TRSs),encompassing hardware synthesis as well as verification ofimplementations against TRS specifications. Previously, he contributedto the development of dynamic dataflow architectures, the implicitlyparallel programming languages Id and pH, and compilation of theselanguages on parallel machines. He was also president and founder ofSandburst Corp.
Toread more on this topic on Embedded.com go to Moreabout multicores, multithreading and multiprocessing.
 “Microsoft:Parallel Programming Model Ten Years off,” by Rick Merritt,Embedded.com, July 23, 2007
 “The Problem with Threads” byEdward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
 “Whythreads are a bad idea (for most purposes),” J.K. Ousterhout,Invited Talk, USENIX Technical Conference, January, 1996
 S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
 Z. Ariola and Arvind. “Propertiesof a First-order Functional Language with Sharing.” TheoreticalComputer Science, 146(1-2):69-108, July 1995.
 J. Armstrong. “The Development ofErlang.” In Proc. Intl. Conf. on Functional Programming (ICFP),Amsterdam, The Netherlands, pages 196-203, 1997.
 J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.”Prentice Hall, 1993. ISBN: 0-13-285792-8.
 K. Arnold and J. Gosling. The JavaProgramming Language, second edition. Addison-Wesley, 1998. ISBN0-201-31006-6.
 Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
 Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
 Arvind and R. S. Nikhil. Executinga Program on the MIT Tagged-Token Dataflow Architecture. IEEETrans. on Computers, 39(3):300-318, March 1990.
 R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded RuntimeSystem. In Proc. 5th. ACM Symp. on Principles and Practiceof Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,July 19-21 1995.
 D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
 A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVMParallel Virtual Machine. A User's Guide and Tutorial for NetworkParallel Computing. MIT Press, Cambridge MA, USA, 1994.
 R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
 High PerformanceFortran Forum. High Performance Fortran Language Specification,Version 1.0. Technical Report CRPC-TR92225, Center for Research onParallel Computation, Rice University, Houston TX, USA, May 3 1993.
 C. Hoare. Monitors:an Operating System Structuring Concept. Communications of the ACM,17(10):549-557, October 1974.
 R. Hughes. WhyFunctional Programming Matters. The Computer Journal, 32(2):98-107,1989.
 ThreadsstandardPOSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
 Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
 R. S. Nikhil. “Cid: AParallel 'Shared-memory' C for Distributed Memory Machines.” InProc.7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,August 8-10 1994.
 R. S. Nikhil and Arvind. Id:a Language with Implicit Parallelism. In A Comparative Study ofParallel Programming Languages: The Salishan Problems, John Feo(editor), pages 169-215. North Holland, 1992.
 R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab. for Computer Science, Cambridge MA 02139,USA, July 1986.
 OpenMP Architecture ReviewBoard. OpenMPSpecifications, 1997 (Fortran), 1998 (C/C++)
 S. L. Peyton Jones.TheImplementation of Functional Programming Languages . PrenticeHall,1987.
 S. L. Peyton Jones(editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W.Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M.Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, andP. Wadler. Haskell98: A Non-strict, Purely Functional Language, February 1999.
 M. C. Rinard and P. C.Diniz. CommutativityAnalysis: A New Analysis Framework for Parallelizing Compilers. ACMTrans. on Programming Languages and Systems (TOPLAS), 19(6):942-991,November 1997.
 M. C. Rinard, D. J.Scales, and M. S. Lam. Jade: AHigh-Level, MachineIndependent Language for Parallel Programming.IEEE Computer, 26(6):28-38, June 1993.
 N. Wirth.Programming inModula-2. Springer Verlag, Berlin, Germany, 1982.
The assumption, that (x + (y + z)) = ((x + y) + z) is not true for FP numbers, which means that parallel implementation may give slightly different results (possibly better), than sequential one. However, a big question is whether it would give always THE SAME result–which is not clear from the article.
Brno, Czech Republic