Making the transition from sequential to implicit parallel programming: Part 3 - Embedded.com

Making the transition from sequential to implicit parallel programming: Part 3

A well-known method for parallel programming is to extend a sequentiallanguage with threads. These may be regarded as multiple sequentialprograms that have the same view of all data-two threads may hold theaddress of the same variable and they both access the contents of thevariable with the standard notations for variable access.

Local variables of a procedure invocation in one thread are private,or inaccessible to other threads, in much the same way that eachrecursive invocation of a procedure in a sequential program refers toits own copy of local variables.

Threads
One notation to express creation of a new thread replaces a functioncall with a fork:

    F(i)        replaced by        fork F(i)

When the machine encounters a normal procedure call, it suspendsexecution at that point, executes F to completion, returns, andcontinues execution of the sequel. When the machine encounters a fork,on the other hand, it simultaneously executes F and the sequel.

The two threads are called the “parent” and “child” threads,respectively. For example, we could express a parallel loop as follows(we switch from Fortran to C notation, because thread-and-lock basedprogramming is more common in C than in Fortran):

           int main (. . .)
            {
                   float A [N] ;

                   for (i = 0; i< N; i++) {
                         fork BODY(i,A);
                   }
            . . .
            . . .
            }

           void BODY (int i, float A[])            {
                   A[i] =f(A[i]);
           }

At the end of the for loop, we have N child threads running inparallel, executing BODY(O,A), …, BODY(N-1,A) respectively. Each one,in parallel, computes with one element of A. Note that each thread hasits copy of i but, because of C's procedure call semantics, A refers tothe same array for all of them.

A fundamental problem with explicit threads is the choice of threadgranularity; that is, where should the programmer place fork calls formaximum advantage? The cost of a fork is typically hundreds tothousands of times the cost of a procedure call.

If the program generates too many threads, it may slow down becauseof the overhead of thread creation and context-switching betweenthreads. If the program generates too few threads, then processors mayremain idle. In many programs, particularly recursive programs, it isdifficult for the programmer to find the right balance by placing forksin just the right places.

The task is complicated by the fact that the right placement maydepend on input data, it may depend on the number of processors in themachine on which the program is run, it may depend on the other load onthose processors, and so on; there is no single “correct” solution.

Programmers sometimes deal with this issue by writing all kinds ofconditional code that, at run-time, chooses between a fork or anordinary procedure call based on querying the load on the system. Thisgreatly obscures the original clarity of the program.

Locks
When there is a danger of a race condition, data that is shared bymultiple threads is usually protected by a lock data structure: athread acquires the lock, manipulates the shared data, and thenreleases the lock. When a thread tries to acquire a lock, theimplementation forces it to wait if the lock is currently held byanother thread.

When the lock is released, only one of possibly many waiting threadsis allowed to acquire it. This ensures that the shared data is notsimultaneously manipulated by multiple threads and so there are no raceconditions.

Normally, only one thread may hold the lock at a time, althoughthere are many variations: for example, read-write locks permit manythreads that merely read the shared data to hold the locksimultaneously, while any thread that updates the shared data mustacquire the lock exclusively.

Let us illustrate the use of locks by extending the above example.In the above code, the parent thread has no way of knowing when thechild threads have completed; the loop simply forks the child threadsand the parent goes on.

Thus, the parent does not know when it is safe to assume that allthe new values in A are ready. One solution is to have a counter,initialized to N, that is decremented by each loop iteration when ithas completed its work. The parent thread, observing the counter dropto zero, infers that all child threads have completed and that A isready.

     int main (…)
     {
               float A[N];
               int counter = N;
               for (i = 0; i < N; i++) {
                       fork BODY(i,A, & counter);
                }
               …
               while (counter > 0) { /* spin */};
               …
               … /* Now, A is ready */
               …
               …
       }

       void BODY(int i, float A[], int*counter)
        {                A [i] = f (A [i]) ;                *counter = *counter – 1;
        }

Unfortunately, there is a race condition: each thread, in trying todecrement the counter, reads it, subtracts one, and stores it. Now, itis possible that two threads simultaneously read it, substract one, andstore it.

In this case the counter, instead of being decremented by two, isonly decremented by one. The counter will thus never reach zero and theparent thread will be stuck in its spin loop.

Things could be even worse. Depending on the processor architectureand the compiler, the counter-decrement statement of BODY may notcompile into a single read followed by a single write.

For example, counter may have a 4-byte representation, and thecompiler may produce two 2-byte reads and two 2-byte writes. If thesefiner granularity operations from different threads are interleavedthere could be even more possibilities for the actual value stored andread in the counter variable.

We introduce a lock to avoid this:

int main (…) {
           float A[N] ;
           int counter = N;            lock lk;
           for (i = 0; i < N; i++)                    fork BODY(i,A, & counter, &lk);
           }
           …
           … /* May do other work, unrelated toA */
           …
           while (counter > 0) { /* spin */ }            /* Now, A is ready for use */            …
           …
    }     void BODY(int i,float A[], int*counter, lock*1k)
    {            A[i] = f(A[i]);            wait(lk);                /* acquire the lock */            *counter = *counter – 1;            signal(lk);            /* release the lock */ }

The lock operations ensure that the entire read-decrement-writesequence of the enclosed statement is performed in its entirety by onethread before another thread can attempt it. This eliminates the racecondition.

Threads (or processes, or tasks) and locks have been used forparallel programming for decades. They were originally developed foroperating systems, even on uniprocessor machines, in order to expresscomputations that are conceptually if not physically parallel.

For example, an operating system must “simultaneously” execute manyusers' programs, as well as respond to unpredictable events from disks,networks, printers, terminals, etc. This is most easily expressed as aparallel program, even though the threads may be time-multiplexed on asingle processor.

While threads and locks have long been used in operating systems,they started becoming visible in users' programs since the early 1980swhen so-called bus-based shared-memory multiprocessors started becomingavailable.

In these machines, a small number of processors (typically less thana dozen) are plugged into a common bus sharing a common memory system.Users could now write parallel programs that ran faster because theyactually ran in parallel on multiple processors, withouttime-multiplexing.

Higher-Level Constructs forShared-Memory Programming
Coordinating the interaction of parallel threads, e.g., by protectingshared data with locks, is called synchronization of a parallelprogram. It is somewhat tricky to be convinced that a thread-basedparallel program has adequate synchronization. For example, a puristmay complain that access to the counter in the spin loop (while) in theparent thread has not been protected by lock acquisition.

Fortunately, in this case, the omission is benign-since the looponly reads the counter and since the counter value decrementsmonotonically from N down to zero, reading a zero value is still acorrect guarantee that the child processes have completed. In general,however, all accesses to shared data should be protected by lockacquisition.

(Actually, some C compilersperform “optimizations” that may preventthe parent thread from ever reading zero! Because reading the countervariable is not protected by a lock, some C compilers might feel freeto make a copy of counter that is “closer” to the parent thread, e.g.,in a register or some other nearby memory. In this case, the decrementsdown to zero are never seen in the local copy. Declaring the variablecounter to be volatile can sometimes prevent this behavior. )

To prevent inadvertent synchronization errors many languages providehigher-level constructs to package up common parallel paradigms. Forexample, the paradigm of executing iterations of a loop in parallel andwaiting for all iterations to complete (also called a barrier) is socommon that some languages introduce a special parallel loop construct:

       doall  i = 1, N                  A(i) = f(A(i))        end do
The compiler now manages all the details and will often use a moresophisticated and efficient barrier implementations than the simplecounter described above.

Another common language construct is the monitor [17]. Here, the programmer declaresa shared data structure along with a collection of procedures tomanipulate it. The data structure is not directly visible to the restof the program, so it can only be manipulated by calling the monitorprocedures.

The compiler ensures that these procedures implicitly acquire a lockon entry when they are called and release the lock on exit just beforereturning, ensuring one-at-a-time access to the shared data. Thisapproach to structured synchronization is used in the language Modula [29], in the POSIX threadsextensions to C and Fortran [19] ,and most recently in synchronized methods in Java [8].

Another popular language construct is the future (for example, inthe language Multilisp [15] ).In parallel programs, we often encounter the following scenario: wewish to fork a function, which will ultimately return some result, butwe do not need the result immediately, since there is other useful workto be done before we need the result.

However, when we eventually do need the result, we need some way ofwaiting for it, since it may not be ready yet. Futures are a notationto express this. For example, suppose we want a program thatrecursively traverses a binary tree, adding up some integer informationavailable at each tree node. We can express this using futures, asfollows:

       typedef struct tree
                   int info;{
                   struct tree *left, *right; }*treeptr;

       int tree _Add(treeptr tp)
        {
                   future int x;
                   int y;

                  if (tp == NULL)                           return 0;                    else {                         future x = tree-Add (tp->left);                         y = tree Add (tp->right);                         return touch (x) + y + tp->info;                    }            }

The declaration future int x conceptually declares x to be aninteger together with a lock that is initially in the locked state. Ifthe tree is not empty, the future call forks a call to traverse theleft subtree. This forked call will ultimately deliver its integerresult into the integer component of x and signal the lock indicatingthat the x is ready.

Meanwhile, the parent thread goes on to traverse the right subtree.Note that this happens recursively at every level of the tree and so itcan generate threads exponentially. After the right subtree's value hasbeen computed (y), we wait for the result x by executing touch(x),which waits for the lock to be signalled and then returns the integervalue in x. This is combined into the final answer and returned.

Constructs like doall, monitors, and futures make it easier to writerobust parallel programs because they hide the details of lockmanipulation completely; one is less likely to forget inadvertently toprotect shared data access with suitable lock operations.

Recently, two languages, Jade [28] and Cilk [l2] , have gonefurther. Both rely on programmer annotations to indicate what can beexecuted in parallel, but both provide a guarantee that parallelexecution is faithful to the semantics of sequential execution (wherethe parallelism annotations are ignored). This is a powerful guaranteebecause it allows one to debug the program in sequential mode usingstandard tools.

In Jade, the programmer specifies that a section of code can beexecuted in parallel (analagous to forking it), but must also declarewhat data is accessed by the section and how (read or write). Thisdeclaration is executable, so that it can handle dynamic datastructures.

The run-time system checks that data accesses by the parallel codeindeed conform to the declarations and introduces all thesynchronization necessary to preserve the sequential semantics. Cilkprovides extensions to C for forking threads and to wait for theircompletion. A thread-fork is basically an annotation on a procedurecall.

This induces a certain precedence between the parallel activities ofa program-it is a directed acyclic graph (DAG) or partial order. Thescheduler in Cilk's run-time system is carefully designed so thatthreads access memory according to this partial order, preserving thesequential semantics.

Next in Part 4:  Explicitparallelism – message passing
To read Part 1, go to How sequential languages obscureparallelism.
To read Part 2, go to How to achieve parallel execution?

RishiyurS. Nikhil and Arvind are the coauthors of ImplicitParallel Programming in pH, published by Elsevier/MorganKaufmann.

Rishiyur is Chief TechnicalOfficer at ESL specialist Bluespec Inc . Previous to itsacquisition by Broadcom he led the Bluespec technology team atSandburst Corp. For the previous 9 years he was at Cambridge ResearchLaboratory (DEC/Compaq), including over a year as Acting Director.Earlier, he was an Associate Professor of Computer Science andEngineering at MIT. He holds several patents in functional programming,dataflow and multithreaded architectures, parallel processing andcompiling.

Arvind is the Johnson Professor ofComputer Science and Engineering at the Massachusetts Institute ofTechnology. His work laid the foundations for Bluespec, centering onhigh-level specification and description of architectures and protocolsusing Term Rewriting Systems (TRSs) ,encompassing hardware synthesis as well as verification ofimplementations against TRS specifications. Previously, he contributedto the development of dynamic dataflow architectures, the implicitlyparallel programming languages Id and pH, and compilation of theselanguages on parallel machines. He was also president and founder ofSandburst Corp.

Toread more on this topic on Embedded.com go to Moreabout multicores, multithreading and multiprocessing.

References:
[1]Microsoft:Parallel Programming Model Ten Years off,” by Rick Merritt,Embedded.com, July 23, 2007
[2]The Problem with Threads” byEdward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3]Whythreads are a bad idea (for most purposes),” J.K. Ousterhout,Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, andD. Gelernter. “Lindaand Friends,” IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. “Propertiesof a First-order Functional Language with Sharing.” TheoreticalComputer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. “The Development ofErlang.” In Proc. Intl. Conf. on Functional Programming (ICFP),Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding,and M. Williams. “ConcurrentProgramming in Erlang.” Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The JavaProgramming Language, second edition. Addison-Wesley, 1998. ISBN0-201-31006-6.
[9] Arvind, A. Caro, J.-W.Maessen, and S. Aditya. AMultithreaded Substrate and Compilation Model for the ImplicitlyParallel Language pH. In Proc. Wkshp. on Languages and Compilersfor Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow,and W. Plouffe. The(preliminary) Id Report. Technical Report 114, Dept. of Informationand Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executinga Program on the MIT Tagged-Token Dataflow Architecture. IEEETrans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F.Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded RuntimeSystem. In Proc. 5th. ACM Symp. on Principles and Practiceof Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216,July 19-21 1995.
[13] D. E. Culler, A. Dusseau,S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K.Yelick. ParallelProgramming in Split-C. In Proc. Supercomputing 93, Portland OR,USA, November 1993.
[14] A. Geist, A. Begeulin, J.Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVMParallel Virtual Machine. A User's Guide and Tutorial for NetworkParallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp:ALanguage for Concurrent Symbolic Computation. ACM Trans. onProgramming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High PerformanceFortran Forum. High Performance Fortran Language Specification,Version 1.0. Technical Report CRPC-TR92225, Center for Research onParallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors:an Operating System Structuring Concept. Communications of the ACM,17(10):549-557, October 1974.
[18] R. Hughes. WhyFunctional Programming Matters. The Computer Journal, 32(2):98-107,1989.
[19] Threadsstandard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing InterfaceForum. MPI:A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. “Cid: A Parallel'Shared-memory' C for Distributed Memory Machines.”  In Proc.7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing(LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390,August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id:a Language with Implicit Parallelism. In A Comparative Study ofParallel Programming Languages: The Salishan Problems, John Feo(editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali,and Arvind. Id Nouveau. Technical Report CSG Memo 265, ComputationStructures Group, MIT Lab.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.