Parallelism joins concurrency for multicore embedded computing
For many of us, the terms parallelism and concurrency are near synonyms, and even if we feel they represent somewhat different concepts, we don't all agree on what makes one construct parallel and the other concurrent. But as we enter the era of multicore, manycore, and GPU-based computing, it is useful to make a distinction between these two, and see how those distinctions affect the embedded programmer's world.
Concurrent programming has a long history in embedded systems, and in that context it represents the approach of using separate threads of control to manage the multiple activities taking place. Often these threads of control have different relative priorities, with interrupt handlers preempting non-interrupt code, and code requiring a higher invocation frequency or an earlier deadline preempting code with lower frequency or later deadlines. For threads of control managing activities with equivalent priorities, often a round-robin scheduling approach is used to ensure no activity gets ignored for too long. None of this requires multiple physical processors, and it can be seen overall as a way to share limited resources in an appropriate way when many activities need to be managed. It can also be seen as a way to simplify the logic of the programming, by allowing any given thread of control to be restricted to managing a single activity. Because introducing concurrent threads of control can simplify the logic of the overall program, it is all right if the language constructs to support this approach are themselves of somewhat heavier weight.
In contrast with concurrent programming, parallel programming is often more about solving a computationally-intensive problem by splitting it up into pieces, using an overall divide-and-conquer strategy, to take better advantage of multiple processing resources to solve a single problem. Rather than simplifying the logic, introducing parallel programming can often make the logic more complex, and therefore it is important that the parallel programming constructs be very light weight both syntactically and at run-time, or else the complexity and run-time overhead may outweigh the potential speedup.
Scheduling the threads
For both concurrent and parallel programming, the actual number of physical processors may vary from one run of the program to the next, so it is typical to provide a level of abstraction on top of the physical processors, typically provided by a scheduler. For concurrent programming, preemption is often used by the scheduler, where one thread is interrupted in the middle of its execution to allow a higher priority thread to take over the shared processing resource. For parallel programming, more often the goal is simply to get all of the work done in the shortest possible time, so preemption is not used as often, while overall throughput becomes the critical criterion of success.
For concurrent programming used to manage external activities of varying criticalities, a real-time scheduler often relies on priorities assigned using some sort of rate-monotonic or deadline-monotonic analysis, to ensure that all threads will meet their deadlines . In contrast, for parallel programming, an approach called work stealing (Figure 1) has emerged as a robust approach to balancing the load across multiple physical processors, while providing good locality of reference for a given processor, and good separation of access between processors.
Work stealing  is based on some relatively simple concepts, but often requires very careful coding of the underlying scheduler to achieve the required low overhead. The basic idea is that computations are broken up by the compiler into very light-weight threads (which we will call picothreads), and in the scheduler each physical processor has a server with its own double-ended queue (often called a deque) of picothreads (Figure 1).
A typical picothread might represent the execution of a single iteration of a loop, or the evaluation of a single subexpression. Picothreads are automatically placed on the tail of the deque of the server that spawned them, and when a server finishes working on a given picothread, it removes the picothread it most recently added to its own deque (from the tail), and starts running that one. This last-in-first-out (LIFO) discipline is using the deque effectively as a stack of work to do.
At some point a server’s deque is empty, it having finished the overall computation it was performing. At that point the server steals a picothread from one of the other servers. But in this case it removes the oldest picothread at the head of some other server’s deque. That is, for stealing, a first-in-first-out (FIFO) discipline is used. What this accomplishes is that when serving its own deque, a server is picking up a picothread that will likely be manipulating data that the associated processor was recently using, while when stealing from another deque, it will be picking up a picothread that has been languishing on its deque, and likely will be manipulating data that is not sitting in any processor’s cache, and is likely not in close physical proximity to the data being manipulated by any other processor.
Individual picothreads are small enough that there is generally no need to preempt them before they complete, which means they don’t need a complete stack of their own – they can share a stack provided by the server. To terminate an overall parallel computation prematurely it is often adequate to prevent new picothreads from starting, without trying to interrupt a given picothread in the middle of its execution. In fact, it is sharing a stack and using a run-to-completion approach that makes these picothreads so light-weight relative to the typical preemptible heavier-weight threads used in concurrent programming.
There are many subtleties to work stealing, and this remains an active area of research, in terms of deciding which server to steal from, deciding how many picothreads to steal at once, choosing an efficient double-ended queue to support exclusive access from one end and shared access from the other, how to deal with any synchronization between the picothreads, etc. Nevertheless, the basic approach of work stealing has been adopted very widely, including for various parallel languages (Cilk+, Go, Rust, ParaSail, etc.) and various libraries (Java fork/join library, OpenMP, Intel’s Threaded Building Blocks, among others).
So how does this all pertain to the mobile and embedded programming world? The fact is that multicore hardware has arrived in the mobile and embedded worlds as well, in part because a multicore architecture can often provide the best performance per watt, and power is almost always a major concern in these resource-constrained environments. But most of these environments will still have hard or soft real-time requirements, so work stealing by itself will not provide that. What is needed is a careful integration of more traditional concurrent programming approaches with some aspects of work-stealing-based scheduling.