CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

Making the transition from sequential to implicit parallel programming: Part 3
Explicit Parallelism: Shared-Memory Programming with Threads and Locks



Embedded.com
A well-known method for parallel programming is to extend a sequential language with threads. These may be regarded as multiple sequential programs that have the same view of all data-two threads may hold the address of the same variable and they both access the contents of the variable with the standard notations for variable access.

Local variables of a procedure invocation in one thread are private, or inaccessible to other threads, in much the same way that each recursive invocation of a procedure in a sequential program refers to its own copy of local variables.

Threads
One notation to express creation of a new thread replaces a function call with a fork:

    F(i)         replaced by         fork F(i)

When the machine encounters a normal procedure call, it suspends execution at that point, executes F to completion, returns, and continues execution of the sequel. When the machine encounters a fork, on the other hand, it simultaneously executes F and the sequel.

The two threads are called the "parent" and "child" threads, respectively. For example, we could express a parallel loop as follows (we switch from Fortran to C notation, because thread-and-lock based programming is more common in C than in Fortran):

            int main (. . .)
            {
                    float A [N] ;

                    for (i = 0; i < N; i++) {
                          fork BODY(i, A);
                    }
            . . .
            . . .
            }

            void BODY (int i, float A[])
            {
                    A[i] = f(A[i]);
            }

At the end of the for loop, we have N child threads running in parallel, executing BODY(O,A), ..., BODY(N-1,A) respectively. Each one, in parallel, computes with one element of A. Note that each thread has its copy of i but, because of C's procedure call semantics, A refers to the same array for all of them.

A fundamental problem with explicit threads is the choice of thread granularity; that is, where should the programmer place fork calls for maximum advantage? The cost of a fork is typically hundreds to thousands of times the cost of a procedure call.

If the program generates too many threads, it may slow down because of the overhead of thread creation and context-switching between threads. If the program generates too few threads, then processors may remain idle. In many programs, particularly recursive programs, it is difficult for the programmer to find the right balance by placing forks in just the right places.

The task is complicated by the fact that the right placement may depend on input data, it may depend on the number of processors in the machine on which the program is run, it may depend on the other load on those processors, and so on; there is no single "correct" solution.

Programmers sometimes deal with this issue by writing all kinds of conditional code that, at run-time, chooses between a fork or an ordinary procedure call based on querying the load on the system. This greatly obscures the original clarity of the program.

Locks
When there is a danger of a race condition, data that is shared by multiple threads is usually protected by a lock data structure: a thread acquires the lock, manipulates the shared data, and then releases the lock. When a thread tries to acquire a lock, the implementation forces it to wait if the lock is currently held by another thread.

When the lock is released, only one of possibly many waiting threads is allowed to acquire it. This ensures that the shared data is not simultaneously manipulated by multiple threads and so there are no race conditions.

Normally, only one thread may hold the lock at a time, although there are many variations: for example, read-write locks permit many threads that merely read the shared data to hold the lock simultaneously, while any thread that updates the shared data must acquire the lock exclusively.

Let us illustrate the use of locks by extending the above example. In the above code, the parent thread has no way of knowing when the child threads have completed; the loop simply forks the child threads and the parent goes on.

Thus, the parent does not know when it is safe to assume that all the new values in A are ready. One solution is to have a counter, initialized to N, that is decremented by each loop iteration when it has completed its work. The parent thread, observing the counter drop to zero, infers that all child threads have completed and that A is ready.

     int main (...)
     {
                float A[N];
                int counter = N;
                for (i = 0; i < N; i++) {
                        fork BODY(i, A, & counter);
                }
                ...
                while (counter > 0) { /* spin */};
                ...
                ... /* Now, A is ready */
                ...
                ...
       }

        void BODY(int i, float A[], int *counter)
        {
                A [i] = f (A [i]) ;
                *counter = *counter - 1;
        }

Unfortunately, there is a race condition: each thread, in trying to decrement the counter, reads it, subtracts one, and stores it. Now, it is possible that two threads simultaneously read it, substract one, and store it.

In this case the counter, instead of being decremented by two, is only decremented by one. The counter will thus never reach zero and the parent thread will be stuck in its spin loop.

Things could be even worse. Depending on the processor architecture and the compiler, the counter-decrement statement of BODY may not compile into a single read followed by a single write.

For example, counter may have a 4-byte representation, and the compiler may produce two 2-byte reads and two 2-byte writes. If these finer granularity operations from different threads are interleaved there could be even more possibilities for the actual value stored and read in the counter variable.

We introduce a lock to avoid this:

int main (...)
{
            float A[N] ;
            int counter = N;
            lock lk;
            for (i = 0; i < N; i++)
                    fork BODY(i, A, & counter, & lk);
            }
            ...
            ... /* May do other work, unrelated to A */
            ...
            while (counter > 0) { /* spin */ }
            /* Now, A is ready for use */
            ...
            ...
    }
    void BODY(int i, float A[], int *counter, lock*1k)
    {
            A[i] = f(A[i]);
            wait(lk);                 /* acquire the lock */
            *counter = *counter - 1;
            signal(lk);             /* release the lock */
}

The lock operations ensure that the entire read-decrement-write sequence of the enclosed statement is performed in its entirety by one thread before another thread can attempt it. This eliminates the race condition.

Threads (or processes, or tasks) and locks have been used for parallel programming for decades. They were originally developed for operating systems, even on uniprocessor machines, in order to express computations that are conceptually if not physically parallel.

For example, an operating system must "simultaneously" execute many users' programs, as well as respond to unpredictable events from disks, networks, printers, terminals, etc. This is most easily expressed as a parallel program, even though the threads may be time-multiplexed on a single processor.

While threads and locks have long been used in operating systems, they started becoming visible in users' programs since the early 1980s when so-called bus-based shared-memory multiprocessors started becoming available.

In these machines, a small number of processors (typically less than a dozen) are plugged into a common bus sharing a common memory system. Users could now write parallel programs that ran faster because they actually ran in parallel on multiple processors, without time-multiplexing.

Higher-Level Constructs for Shared-Memory Programming
Coordinating the interaction of parallel threads, e.g., by protecting shared data with locks, is called synchronization of a parallel program. It is somewhat tricky to be convinced that a thread-based parallel program has adequate synchronization. For example, a purist may complain that access to the counter in the spin loop (while) in the parent thread has not been protected by lock acquisition.

Fortunately, in this case, the omission is benign-since the loop only reads the counter and since the counter value decrements monotonically from N down to zero, reading a zero value is still a correct guarantee that the child processes have completed. In general, however, all accesses to shared data should be protected by lock acquisition.

(Actually, some C compilers perform "optimizations" that may prevent the parent thread from ever reading zero! Because reading the counter variable is not protected by a lock, some C compilers might feel free to make a copy of counter that is "closer" to the parent thread, e.g., in a register or some other nearby memory. In this case, the decrements down to zero are never seen in the local copy. Declaring the variable counter to be volatile can sometimes prevent this behavior.)

To prevent inadvertent synchronization errors many languages provide higher-level constructs to package up common parallel paradigms. For example, the paradigm of executing iterations of a loop in parallel and waiting for all iterations to complete (also called a barrier) is so common that some languages introduce a special parallel loop construct:

        doall  i = 1, N
                  A(i) = f(A(i))
        end do
The compiler now manages all the details and will often use a more sophisticated and efficient barrier implementations than the simple counter described above.

Another common language construct is the monitor [17]. Here, the programmer declares a shared data structure along with a collection of procedures to manipulate it. The data structure is not directly visible to the rest of the program, so it can only be manipulated by calling the monitor procedures.

The compiler ensures that these procedures implicitly acquire a lock on entry when they are called and release the lock on exit just before returning, ensuring one-at-a-time access to the shared data. This approach to structured synchronization is used in the language Modula [29], in the POSIX threads extensions to C and Fortran [19], and most recently in synchronized methods in Java [8].

Another popular language construct is the future (for example, in the language Multilisp [15]). In parallel programs, we often encounter the following scenario: we wish to fork a function, which will ultimately return some result, but we do not need the result immediately, since there is other useful work to be done before we need the result.

However, when we eventually do need the result, we need some way of waiting for it, since it may not be ready yet. Futures are a notation to express this. For example, suppose we want a program that recursively traverses a binary tree, adding up some integer information available at each tree node. We can express this using futures, as follows:

        typedef struct tree
                    int info;{
                    struct tree *left, *right; } *treeptr;

        int tree _Add(treeptr tp)
        {
                    future int x;
                    int y;

                    if (tp == NULL)
                            return 0;
                    else {
                          future x = tree-Add (tp->left);
                          y = tree Add (tp->right);
                          return touch (x) + y + tp->info;
                    }
            }

The declaration future int x conceptually declares x to be an integer together with a lock that is initially in the locked state. If the tree is not empty, the future call forks a call to traverse the left subtree. This forked call will ultimately deliver its integer result into the integer component of x and signal the lock indicating that the x is ready.

Meanwhile, the parent thread goes on to traverse the right subtree. Note that this happens recursively at every level of the tree and so it can generate threads exponentially. After the right subtree's value has been computed (y), we wait for the result x by executing touch(x), which waits for the lock to be signalled and then returns the integer value in x. This is combined into the final answer and returned.

Constructs like doall, monitors, and futures make it easier to write robust parallel programs because they hide the details of lock manipulation completely; one is less likely to forget inadvertently to protect shared data access with suitable lock operations.

Recently, two languages, Jade [28] and Cilk [l2], have gone further. Both rely on programmer annotations to indicate what can be executed in parallel, but both provide a guarantee that parallel execution is faithful to the semantics of sequential execution (where the parallelism annotations are ignored). This is a powerful guarantee because it allows one to debug the program in sequential mode using standard tools.

In Jade, the programmer specifies that a section of code can be executed in parallel (analagous to forking it), but must also declare what data is accessed by the section and how (read or write). This declaration is executable, so that it can handle dynamic data structures.

The run-time system checks that data accesses by the parallel code indeed conform to the declarations and introduces all the synchronization necessary to preserve the sequential semantics. Cilk provides extensions to C for forking threads and to wait for their completion. A thread-fork is basically an annotation on a procedure call.

This induces a certain precedence between the parallel activities of a program-it is a directed acyclic graph (DAG) or partial order. The scheduler in Cilk's run-time system is carefully designed so that threads access memory according to this partial order, preserving the sequential semantics.

Next in Part 4:  Explicit parallelism - message passing
To read Part 1, go to How sequential languages obscure parallelism.
To read Part 2, go to How to achieve parallel execution?

Rishiyur S. Nikhil and Arvind are the coauthors of Implicit Parallel Programming in pH, published by Elsevier/Morgan Kaufmann.

Rishiyur is Chief Technical Officer at ESL specialist Bluespec Inc. Previous to its acquisition by Broadcom he led the Bluespec technology team at Sandburst Corp. For the previous 9 years he was at Cambridge Research Laboratory (DEC/Compaq), including over a year as Acting Director. Earlier, he was an Associate Professor of Computer Science and Engineering at MIT. He holds several patents in functional programming, dataflow and multithreaded architectures, parallel processing and compiling.

Arvind is the Johnson Professor of Computer Science and Engineering at the Massachusetts Institute of Technology. His work laid the foundations for Bluespec, centering on high-level specification and description of architectures and protocols using Term Rewriting Systems (TRSs), encompassing hardware synthesis as well as verification of implementations against TRS specifications. Previously, he contributed to the development of dynamic dataflow architectures, the implicitly parallel programming languages Id and pH, and compilation of these languages on parallel machines. He was also president and founder of Sandburst Corp.

To read more on this topic on Embedded.com go to More about multicores, multithreading and multiprocessing.

References:
[1] "Microsoft: Parallel Programming Model Ten Years off," by Rick Merritt, Embedded.com, July 23, 2007
[2] "The Problem with Threads" by Edward A. Lee, IEEE Computer, 39:5, May 2006, pp 33-42
[3] "Why threads are a bad idea (for most purposes)," J.K. Ousterhout, Invited Talk, USENIX Technical Conference, January, 1996
[4] S. Ahuja, N. Carriero, and D. Gelernter. "Linda and Friends," IEEE Computer, 19(8):26-34, August 1986.
[5] Z. Ariola and Arvind. "Properties of a First-order Functional Language with Sharing." Theoretical Computer Science, 146(1-2):69-108, July 1995.
[6] J. Armstrong. "The Development of Erlang." In Proc. Intl. Conf. on Functional Programming (ICFP), Amsterdam, The Netherlands, pages 196-203, 1997.
[7] J. Armstrong, R. Virding, and M. Williams. "Concurrent Programming in Erlang." Prentice Hall, 1993. ISBN: 0-13-285792-8.
[8] K. Arnold and J. Gosling. The Java Programming Language, second edition. Addison-Wesley, 1998. ISBN 0-201-31006-6.
[9] Arvind, A. Caro, J.-W. Maessen, and S. Aditya. A Multithreaded Substrate and Compilation Model for the Implicitly Parallel Language pH. In Proc. Wkshp. on Languages and Compilers for Parallel Computing (LCPC), August 1996.
[10] Arvind, K. P. Gostelow, and W. Plouffe. The (preliminary) Id Report. Technical Report 114, Dept. of Information and Computer Science, Univ. of California, Irvine CA, USA, 1978.
[11] Arvind and R. S. Nikhil. Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. on Computers, 39(3):300-318, March 1990.
[12] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Efficient Multithreaded Runtime System. In Proc. 5th. ACM Symp. on Principles and Practice of Parallel Programming (PPOPP), Santa Barbara CA, USA, pages 207-216, July 19-21 1995.
[13] D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. Parallel Programming in Split-C. In Proc. Supercomputing 93, Portland OR, USA, November 1993.
[14] A. Geist, A. Begeulin, J. Dongarra, W. Jiang, R. Manchek, and V. S. Sundaram. PVM Parallel Virtual Machine. A User's Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge MA, USA, 1994.
[15] R. H. Halstead. Multilisp: A Language for Concurrent Symbolic Computation. ACM Trans. on Programming Languages and Systems (TOPLAS), 7(4):501-539, October 1985.
[16] High Performance Fortran Forum. High Performance Fortran Language Specification, Version 1.0. Technical Report CRPC-TR92225, Center for Research on Parallel Computation, Rice University, Houston TX, USA, May 3 1993.
[17] C. Hoare. Monitors: an Operating System Structuring Concept. Communications of the ACM, 17(10):549-557, October 1974.
[18] R. Hughes. Why Functional Programming Matters. The Computer Journal, 32(2):98-107, 1989.
[19] Threads standard POSIX 1003.1c-1995 (also ISO/IEC 9945-1:1996), 1996.
[20] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, May 1994.
[21] R. S. Nikhil. "Cid: A Parallel 'Shared-memory' C for Distributed Memory Machines."  In Proc. 7th. Ann. Wkshp. on Languages and Compilers for Parallel Computing (LCPC), Ithaca NY, USA (Springer Verlag LNCS 892), pages 376-390, August 8-10 1994.
[22] R. S. Nikhil and Arvind. Id: a Language with Implicit Parallelism. In A Comparative Study of Parallel Programming Languages: The Salishan Problems, John Feo (editor), pages 169-215. North Holland, 1992.
[23] R. S. Nikhil, K. Pingali, and Arvind. Id Nouveau. Technical Report CSG Memo 265, Computation Structures Group, MIT Lab. for Computer Science, Cambridge MA 02139, USA, July 1986.
[24] OpenMP Architecture Review Board. OpenMP Specifications, 1997 (Fortran), 1998 (C/C++)
[25] S. L. Peyton Jones. The Implementation of Functional Programming Languages. Prentice Hall, 1987.
[26] S. L. Peyton Jones (editor), J. Hughes (editor), L. Augustsson, D. Barton, B. Boutel, W. Burton, J. Fasel, K. Hammond, R. Hinze, P. Hudak, T. Johnsson, M. Jones, J. Launchbury, E. Meijer, J. Peterson, A. Reid, C. Runciman, and P. Wadler. Haskell 98: A Non-strict, Purely Functional Language, February 1999.
[27] M. C. Rinard and P. C. Diniz. Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers. ACM Trans. on Programming Languages and Systems (TOPLAS), 19(6):942-991, November 1997.
[28] M. C. Rinard, D. J. Scales, and M. S. Lam. Jade: A High-Level, MachineIndependent Language for Parallel Programming. IEEE Computer, 26(6):28-38, June 1993.
[29] N. Wirth. Programming in Modula-2. Springer Verlag, Berlin, Germany, 1982.

1

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :