Parallelism joins concurrency for multicore embedded computing - Embedded.com

Parallelism joins concurrency for multicore embedded computing

For many of us, the terms parallelism and concurrency are near synonyms, and even if we feel they represent somewhat different concepts, we don't all agree on what makes one construct parallel and the other concurrent. But as we enter the era of multicore, manycore, and GPU-based computing, it is useful to make a distinction between these two, and see how those distinctions affect the embedded programmer's world.

Concurrent programming has a long history in embedded systems, and in that context it represents the approach of using separate threads of control to manage the multiple activities taking place. Often these threads of control have different relative priorities, with interrupt handlers preempting non-interrupt code, and code requiring a higher invocation frequency or an earlier deadline preempting code with lower frequency or later deadlines. For threads of control managing activities with equivalent priorities, often a round-robin scheduling approach is used to ensure no activity gets ignored for too long. None of this requires multiple physical processors, and it can be seen overall as a way to share limited resources in an appropriate way when many activities need to be managed. It can also be seen as a way to simplify the logic of the programming, by allowing any given thread of control to be restricted to managing a single activity. Because introducing concurrent threads of control can simplify the logic of the overall program, it is all right if the language constructs to support this approach are themselves of somewhat heavier weight.

In contrast with concurrent programming, parallel programming is often more about solving a computationally-intensive problem by splitting it up into pieces, using an overall divide-and-conquer strategy, to take better advantage of multiple processing resources to solve a single problem. Rather than simplifying the logic, introducing parallel programming can often make the logic more complex, and therefore it is important that the parallel programming constructs be very light weight both syntactically and at run-time, or else the complexity and run-time overhead may outweigh the potential speedup.

Scheduling the threads
For both concurrent and parallel programming, the actual number of physical processors may vary from one run of the program to the next, so it is typical to provide a level of abstraction on top of the physical processors, typically provided by a scheduler. For concurrent programming, preemption is often used by the scheduler, where one thread is interrupted in the middle of its execution to allow a higher priority thread to take over the shared processing resource. For parallel programming, more often the goal is simply to get all of the work done in the shortest possible time, so preemption is not used as often, while overall throughput becomes the critical criterion of success.

For concurrent programming used to manage external activities of varying criticalities, a real-time scheduler often relies on priorities assigned using some sort of rate-monotonic or deadline-monotonic analysis, to ensure that all threads will meet their deadlines [3]. In contrast, for parallel programming, an approach called work stealing (Figure 1 ) has emerged as a robust approach to balancing the load across multiple physical processors, while providing good locality of reference for a given processor, and good separation of access between processors.

Figure 1: Work stealing approach using double ended queues

Work stealing [2] is based on some relatively simple concepts, but often requires very careful coding of the underlying scheduler to achieve the required low overhead. The basic idea is that computations are broken up by the compiler into very light-weight threads (which we will call picothreads ), and in the scheduler each physical processor has a server with its own double-ended queue (often called a deque) of picothreads (Figure 1).

A typical picothread might represent the execution of a single iteration of a loop, or the evaluation of a single subexpression. Picothreads are automatically placed on the tail of the deque of the server that spawned them, and when a server finishes working on a given picothread, it removes the picothread it most recently added to its own deque (from the tail), and starts running that one. This last-in-first-out (LIFO) discipline is using the deque effectively as a stack of work to do.

At some point a server’s deque is empty, it having finished the overall computation it was performing. At that point the server steals a picothread from one of the other servers. But in this case it removes the oldest picothread at the head of some other server’s deque. That is, for stealing, a first-in-first-out (FIFO) discipline is used. What this accomplishes is that when serving its own deque, a server is picking up a picothread that will likely be manipulating data that the associated processor was recently using, while when stealing from another deque, it will be picking up a picothread that has been languishing on its deque, and likely will be manipulating data that is not sitting in any processor’s cache, and is likely not in close physical proximity to the data being manipulated by any other processor.

Individual picothreads are small enough that there is generally no need to preempt them before they complete, which means they don’t need a complete stack of their own – they can share a stack provided by the server. To terminate an overall parallel computation prematurely it is often adequate to prevent new picothreads from starting, without trying to interrupt a given picothread in the middle of its execution. In fact, it is sharing a stack and using a run-to-completion approach that makes these picothreads so light-weight relative to the typical preemptible heavier-weight threads used in concurrent programming.

There are many subtleties to work stealing, and this remains an active area of research, in terms of deciding which server to steal from, deciding how many picothreads to steal at once, choosing an efficient double-ended queue to support exclusive access from one end and shared access from the other, how to deal with any synchronization between the picothreads, etc. Nevertheless, the basic approach of work stealing has been adopted very widely, including for various parallel languages (Cilk+, Go, Rust, ParaSail, etc.) and various libraries (Java fork/join library, OpenMP, Intel’s Threaded Building Blocks, among others).

So how does this all pertain to the mobile and embedded programming world? The fact is that multicore hardware has arrived in the mobile and embedded worlds as well, in part because a multicore architecture can often provide the best performance per watt, and power is almost always a major concern in these resource-constrained environments. But most of these environments will still have hard or soft real-time requirements, so work stealing by itself will not provide that. What is needed is a careful integration of more traditional concurrent programming approaches with some aspects of work-stealing-based scheduling.

Combining real-time with work-stealing is a new research area, and thereare only a few academic papers focused on this issue sofar. Nevertheless, it is clearly becoming more important. Standards suchas ARINC 653, which defines a strongly-partitioned architecture forsystems of mixed criticality, are being updated to accommodate multicorehardware. Because of concerns that processors sharing a single chip arenot as independent as required for strong partitioning, one approachbeing adopted involves assigning multiple cores to a single partitionfor its time slice, and then reassigning them all to a differentpartition when the time slice is done. This would allow the use of ahybrid work-stealing approach, where each partition has its own set ofserver processes, each with its own double-ended queue (Figure 2). When apartition’s time slice ends, all of the server processes associatedwith that partition are suspended, and the server processes for the nextpartition to execute are resumed. So here we have preemption happeningto server processes, while the individual picothreads can still use arun-to-completion model by treating the server process like a kind ofvirtual processor.

Partition 1

Partition 2

Partition 3

Figure 2: Combining real-time and workstealing with a strongly partitioned architecture

Ina similar fashion prioritized scheduling can be accommodated, bycreating separate server processes for each real-time priority. Each hasits own dedicated stack, with a lower-priority server process runningon a core only when all higher-priority server processes on the corehave nothing to do. An alternative, when the real-time requirements aresofter, is to use only one server process per core, but with separatedeques for different priorities. With this approach, preemption of arunning picothread does not occur. However, when the server processchooses a new picothread to execute, priorities would be obeyed: theserver would select first from its own highest priority non-empty deque,but steal from another server if the latter had a non-empty deque ofhigher priority than any of the server’s own non-empty deques.

Programming language constructs for concurrency and parallelism
Manyprogramming languages incorporate some notion of concurrent threads,mutual-exclusion locks, and synchronizing signals and waiting. PerBrinch Hansen’s Concurrent Pascal [1] was one of the first languages toincorporate many of these concepts into the language itself. Ada andJava also included concurrent programming concepts from their inception,and many other languages now include these concepts as well. Generallythe execution of a concurrent thread corresponds to the asynchronousexecution of something approximating a named function or procedure. InAda, this is the task body for the associated task. In Java, it is therun method of the associated Runnable object. Locks are often associatedwith some sort of synchronizing object (often called a monitor), wheresome or all operations on the object automatically acquire a lock onstarting the operation and automatically release the lock on completion,thereby ensuring that locks and unlocks are always balanced. In Ada,these are called protected objects and operations, while in Java theyare the synchronized methods of a class.

Signaling and waitingare used to handle cases where concurrent threads need to communicate orotherwise cooperate, and one thread must wait for one or more otherthreads to take some action, or some external event to occur, before itcan proceed further. Signaling and waiting is also often mediated by asynchronizing object, with a thread awaiting some change in the state ofthe object, and a signal being used to indicate that the state haschanged and some number of waiting threads should recheck to see whetherthe synchronizing object is now in the desired state. Conditionalcritical regions suggested by Hoare and Brinch Hansen represented one ofthe first language constructs providing this kind of waiting andsignaling implicitly based on a Boolean expression. More commonly thisis provided by explicit Wait and Signal operations on an object or acondition queue (in Java signaling uses notify or notifyAll). Adacombines the notions of conditional critical regions and monitors byincorporating entries with entry barriers into the protected objectconstruct, eliminating the need for explicit Signal and Wait calls. Allof these notions represent what we mean by concurrent programmingconstructs.

By contrast, a smaller number of languages thus farincorporate what could be considered parallel programming constructs,though that is changing rapidly. As with concurrent programming,parallel programming can be supported by explicit language extensions,standard libraries, or some mixture of these two. A third option withparallel programming is the use of program annotations, such as pragmas,providing direction to the compiler to allow it to automaticallyparallelize an originally sequential algorithm.

Onecharacteristic that distinguishes parallel programming is that the unitof parallel computation can often be less than the execution of anentire function or procedure, but instead might represent one or moreiterations of a loop, or the evaluation of one part of a largerexpression. Furthermore, the compiler and the underlying run-time systemare more involved in determining what portions of the code can actuallyrun in parallel. This is quite different from traditional concurrentprogramming constructs, which rely on explicit programmer decisions todetermine where the thread boundaries lie.

One of the firstwidely used languages with general purpose parallel programmingconstructs was Cilk, designed by Charles Leiserson [3] at MIT, and nowsupported by Intel as part of their Intel Parallel Studio. Cilk allowsthe programmer to insert directives such as cilk_spawn and cilk_sync at strategic points in an algorithm, with _spawn causing the evaluation of an expression to be forked off into a separate lightweight thread, and _sync causing the program to wait for locally spawned parallel threads, sothe result of their execution can be used. Furthermore, Cilk providesthe ability to use cilk_for rather thansimply for to indicate that the iterations of the given for-loop arecandidates for parallel execution. Other languages now providing similarcapabilities include OpenMP, which uses pragmas rather than languageextensions to direct the insertion of parallel execution, the languageGo from Google, which includes lightweight goroutines for parallelexecution with channels for communication, the language Rust fromMozilla Research, which supports large numbers of lightweight taskscommunicating using ownership transfer to avoid race conditions, and thelanguage ParaSail from AdaCore using safe automatic parallelizationbased on a pointer-free, alias-free approach that simplifiesdivide-and-conquer algorithms.

All of these parallel languagesor extensions have adopted some variant of work stealing for thescheduling of their lightweight threads. And all of these languages makeit easier to move from a sequentially-oriented mindset to aparallel-oriented one. Embedded and mobile programmers should beginexperimenting with these languages now, to be prepared as real-timeprioritized capabilities are merged with work-stealing schedulers, toprovide the combination of reactivity and throughput needed for theadvanced embedded and mobile applications on the drawing boards for thenear future.

S. Tucker Taft is VP and Director of Language Research at AdaCore .He joined AdaCore in 2011 as part of a merger with SofCheck, which hehad founded in 2002 to develop advanced static analysis technology.Prior to that he was a Chief Scientist at Intermetrics, Inc. and itsfollow-ons for 22 years, where in 1990-1995 he led the design of Ada 95.He is recipient of an A.B. Summa Cum Laude degree from HarvardUniversity, where he has more recently taught compiler construction andprogramming language design.

Further Reading:
1. P.Brinch Hansen (editor), The Origin of Concurrent Programming: FromSemaphores to Remote Procedure Calls, Springer, June 2002.

2. R. D. Blumofe and C. E. Leiserson, Scheduling Multithreaded Computations by Work Stealing , Journal of the ACM, 720–748, September, 1999.

3. C. Maia, L. Nogueira, L. M. Pinho, Scheduling parallel real-time tasks using a fixed-priority work-stealing algorithm on multiprocessors , 8th IEEE Symposium on Industrial Embedded Systems (SIES), June 2013

4. S. T. Taft, Systems Programming with Go, Rust, and ParaSail

2 thoughts on “Parallelism joins concurrency for multicore embedded computing

  1. You are not nuts. The point was to emphasize that in the partitioned case, each partition is using essentially the same approach, but with a whole separate set of deques. The original plan was to make the three-copy image much smaller and more suggestive

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.