Multiprocessor scheduling and dealing with tasks, flows, buffers and data dependencies
In general, multiprocessor scheduling is NP-complete [Gar79]. That is,
if we want to minimize total execution time on an arbitrary processor,
we have no known way to find the shortest schedule in polynomial time.
Of course, many NP-complete problems have useful approximations and
heuristics. By taking advantage of what we know about the
multiprocessor structure, by limiting the combinations of process
executions that we consider, or by other simplifications, we can create
a number of simplified but useful multiprocessor scheduling problems.
For example, two-processor multiprocessors can be scheduled optimally
under some conditions.
One of the first multiprocessor algorithms was developed by Stone
[Sto77]. Although he referred to the problem as a scheduling one, it is
more accurately referred to as an allocation problem, since it selected
the CPUs on which to execute processes but only implicitly the times at
which they executed.
He solved the problem using network flow algorithms. Stone's model
considered a network of heterogeneous processors. He found an exact
solution to the two-processor scheduling problem and heuristics to
solve schedules for systems with arbitrary numbers of processors.
 |
| Figure
6-3 Models for Stone's multiprocessor scheduling algorithm. After Stone |
As shown in Figure 6-3 above,
the problem is scheduled in two parts. An intermodule connection graph
describes the time cost of communicating between two processes that are
assigned to different processors; communication between processes on
the same processor has zero cost.
The execution time table specifies the execution time of each
process on each processor; it is possible that not all processes will
be able to run on both processors.
The minimum running time balances the communication cost and the
execution cost. Stone formulates the scheduling problem as one of
finding a cutset of a modified version of the intermodule connection
graph.
Two additional nodes are added to represent the two processors. One
such node is the source of the graph (representing CPU 1) and the other
is the sink (representing CPU 2).
Edges are added from each non-sink node to the source and the sink.
The weight of an edge to the source is equal to the cost of executing
that node's module on CPU 2 (the sink); the weight of an edge to the
sink is equal to the cost of executing that node's module on CPU 1 (the
source).
The cutset divides the intermodule connection graph into two sets,
with the nodes in each set being assigned to the same processor. The
weight of a cutset is the cost of an assignment of the nodes to the two
processors as given by the cutset. To find the allocation that
minimizes the total execution time, we solve a maximum flow problem on
the graph.
Stone extended the problem to n processors by generalizing the
notion of a cutset. The generalized cutset divides the graph into n
disjoint subsets such that no proper subset of a cutset is also a
cutset.
He generalized the node to include n types of distinguished nodes
rather than just the source and sink. His heuristic for solving this
problem iteratively used several two-processor assignments to find the
n-processor assignment.
Why static tasks?
Many embedded systems statically allocate processes to processing
elements. We can efficiently find bounds on the execution time of the
processes in these multiprocessor systems. We will assume that there is
a set of processes with data dependencies between them; in general,
they can form one or more subtasks.
We will also assume that each CPU schedules processes using rate
monotonic scheduling. Although we can easily figure out the schedule if
we don't have data dependencies, the combination of data dependencies
and rate monotonic scheduling makes the problem more challenging,
although tractable.
Minimizing
buffer sizes. Bhattacharyya et al. [Bha97] developed methods to
efficiently schedule synchronous data flow graphs on multiprocessors. Figure 6-4 below shows an SDF graph
with the nodes assigned to processing elements in a multiprocessor. We
are primarily interested in the communication between PEs, since we can
schedule each SDF on a processor using other methods that produce a
sequential schedule.
 |
| Figure
6-4 Models for multiprocessor communication. |
We model the system using an interprocessor communication modeling
(IPC) graph, also shown in the figure. The IPC graph has the same nodes
as the SDF graph. The IPC graph has all the edges of the SDF graph plus
additional edges. We add edges to the IPC graph to model the sequential
schedule on each PE; these edges are shown by the dashed line in the
figure.
The edges in the allocated SDF graph that cross processor boundaries
are known in the IPC graph as IPC edges because they define
interprocessor communication. Any communication across an IPC edge must
use an interprocess communication mechanism to cross the boundary
between the processors.
We can determine whether communication across each IPC edge is
bounded; edges not in a strongly connected component (SCC) are not
bounded. When implementing interprocess communication on an unbounded
edge, we can use a protocol that ensures that the number of tokens
crossing the edge does not exceed a predetermined buffer size. We can
implement interprocess communication on bounded edges by using a
simpler protocol.
The IPC graph may have some redundant edges. An edge e is redundant
if there is another path from source(e) to sink(e) that has a longer
delay than the delay along e. The redundant edges do not have to be
removed in any particular order to ensure that we remove the maximum
number of redundant edges. The asymptotic iteration period T for a
strongly connected IPC graph G is

where C is a cycle through the graph, t(v) is the execution time of
a node v, and delay(C) is the sum of the delays around the path C. T is
also known as the cycle mean. The maximum cycle mean of an IPC graph, lamda max, is the largest cycle
mean for any SCC in the graph. A cycle whose cycle mean is equal to the
maximum is known as a critical cycle.
We can construct a strongly connected synchronization graph by
adding edges between strongly connected components. We add edges that
chain together source SCCs, edges that chain together sink SCCs, and an
edge that connects the overall sink of the graph to the source.
(A strongly connected component is
a source SCC if any edge whose sink is in the strongly connected
component also has its source in the strongly connected component. A
sink SCC is such that any edge whose source is in the SCC also has its
sink in the SCC.)
We need to add delays to the edges, corresponding to buffer memory,
that ensure the system will not deadlock and that we can minimize the
sum of the buffer bounds over all the IPC edges. We can use the added
edges to help us determine these delays - the added edges can be
divided into disjoint sets that help organize the graph.
Delay can be added optimally if the graph has one source SCC and one
sink SCC, and it is heuristic if the graph's structure is more complex.
We can determine the minimum delay on each edge that ensures that the
graph's cycle mean is not exceeded.
 |
| Figure
6-5 RATAN process model. |
We can construct a strongly connected synchronization graph by
adding edges between strongly connected components. We add edges that
chain together source SCCs, edges that chain together sink SCCs, and an
edge that connects the overall sink of the graph to the source.
(A strongly connected component
is a source SCC if any edge whose sink is in the strongly connected
component also has its source in the strongly connected component. A
sink SCC is such that any edge whose source is in the SCC also has its
sink in the SCC.)
We need to add delays to the edges, corresponding to buffer memory,
that ensure the system will not deadlock and that we can minimize the
sum of the buffer bounds over all the IPC edges. We can use the added
edges to help us determine these delays - the added edges can be
divided into disjoint sets that help organize the graph.
Delay can be added optimally if the graph has one source SCC and one
sink SCC, and it is heuristic if the graph's structure is more complex.
We can determine the minimum delay on each edge that ensures that the
graph's cycle mean is not exceeded.
Mathur et al. [Mat98] developed the RATAN tool for analyzing the
rates of multiple-tasking systems. As shown in Figure 6-5 above, a process consists
of a single threaded CDFG-style model. Data dependencies may extend
from a node within one process to a node within another process. When
we look only at the processes themselves, and the edges between them, a
control edge is labeled with [min,max] delays measured from the
activation signal for the process to the start of execution.
Those bounds are specifications on the allowable delay; our goal is
to find execution rates for the processes that satisfy these bounds. A
process starts to execute after all of its enable signals have become
ready. If we denote the delay of an edge i -> j in the graph as dij,
then the delay of a cycle C in the process graph is given by
The mean delay of the cycles is given by

where is the number of edges in C. The maximum mean cycle delay is
known as lambda. In a
strongly connected graph, all nodes execute at the same rate, namely lambda.
We call [rl(X),
ru(X)] the lower and upper bounds on
the rate of a subgraph X. If we have two maximal SCC of the graph, P
and C, and the graph has edges from P to C, then P is a producer and C
is a consumer; therefore the actual rate interval for the consumer C is

Data dependencies and scheduling
The problems created by data dependencies are illustrated in Figure 6-6, below. Here, two
subtasks are divided among three processors. Take, for example,
processing element M1.
 |
| Figure
6-6 Preemption and scheduling. |
This CPU runs two processes that will clearly affect each other's
schedules. But the completion times of the processes on M1
also depends on the behavior of the processes on all the other PEs in
the system. Data dependencies link P1 and P2,
which adds M2 to the set of interrelated PEs. The data
dependency between P3 and P4 also adds M3
to the system.
Getting a process to run faster doesn't always help. Consider Figure 6-7 below; in this example,
changing the computation time of process Px changes the
response time of P3 even though they run on different
processors.
 |
| Figure
6-7 Period shifting. From Yen and Wolf [Yen98] © 1998 IEEE. |
The data dependencies cause the shortened computation time of Px,
resulting in process P2 running sooner and preempting P3.
Ramamritham [Ram90b] proposed scheduling over an unrolled schedule.
He noted that the maximum interval that must be considered for a set of
processes is the least common multiple (LCM) of their periods; in a
longer schedule, the order of processes must necessarily repeat itself.
He then constructed a schedule using the LCM interval.
To develop bounds on the CPUs in the system, we can make use of a
theorem from Lehoczy et al. [Leh89]. They bounded the response times
for a set of independent (no data dependencies) processes running on a
single CPU.
The processes are {P1, P2, ...}, with P1
being the highest-priority process. The minimum period of the ith
process is pi and its worst-case execution time is ci. The worst-case
response time of Pi is wi, which can be computed
as the smallest non-negative root of

The ci term gives the computation time for the process. The x/pj
terms give the fraction of the period of each process that delays the
execution of the ith process. We cannot solve this equation directly
but can use numerical methods to solve it.
We can use the worst-case response times of the various processes to
help us solve the system schedule. But this formula is not sufficient
because it does not take into account the data dependencies either
between the CPUs or within a CPU. To handle those data dependencies, we
need to use a more complex graph algorithm.
Static
scheduling algorithm. Yen and Wolf [Yen98] developed an
algorithm that handles multiple CPUs and data dependencies. This
algorithm models the processes to be run as a task graph that can have
one or more subtasks. Each process in the task graph is given bounds
for its computation time [ci lower, ci upper].
This is more general than a single, fixed computation time, but it
does assume that we can strictly bound the computation time of a
process. Each subtask has a period that is also modeled as an interval
[pi lower, pi upper]. The
architecture of the platform is modeled as a processor graph. The
allocation of each process in the task graph to a processing element is
given in Figure 6-8, below.
 |
| Figure
6-8 An example of reduced computation time leading to longer response
times. From Yen and Wolf [Yen98]. |
The algorithm finds bounds on the start and the finish times of the
processes. Given a process Pi, the start time is bounded by earliest[Pi,
request] and latest[Pi, request]; the end time is bounded by
earliest[Pi, finish] and latest[Pi, finish].
The following code summarizes the delay estimation algorithm. The
maxsep[,] data structure holds the earliest[] and latest[] bounds for
each process; it starts out with infinite (unbounded) times for the
processes. This performance analysis algorithm iteratively tightens
these bounds until they stop changing (or until a predetermined
iteration limit is reached).
maxsep.
lower = maxsep.upper = inf inity;
step = 0; /* keep track of number of iterations */
do {
/* use longest path algorithm to find the request
and finish times */
foreach Pi { Earl iestTimes(Gi); LatestTimes(Gi);
/* handle max constraints */
foreach Pi { MaxSeparations (Gi);
step++;
} while (maxsep has changed and step < limit);
At each iteration, the algorithm performs two types of operations.
The EarliestTimes()/LatestTimes() procedures analyzes data dependencies
using a modified longest-path algorithm.
The MaxSeparations() procedure uses a modified max-constraint
algorithm to look for combinations of process executions that cannot
occur. Each of these procedures is applied separately to each subtask Pi
in the task graph, but the procedures look at the execution times
of all the other subtasks while updating Pi.
 |
| Figure
6-9 An example of phase adjustment and separation analysis. From Yen
and Wolf [Yen98] © 1998 IEEE |
To understand why we need these two steps, consider the example of Figure 6-9, above. A naive analysis
might lead one to conclude that the total time required to finish both
tasks is 80. But two conditions cause it to be shorter. First, P1 cannot
preempt both P2 and P3 in a single execution of
the task. Second, P2 cannot preempt P3 since P2
must finish before P3 can start. Therefore, the worst-case
delay to execute both tasks is only 45.
Phase constraints
We will use phase constraints to summarize the relationships between a
process that is preempted and the process that preempts one. The
algorithm uses two types of phases.
1) The request
phase, which we will call
phase[i,j,r] in the code, describes the smallest interval
between the execution of Pi on one iteration and the
succeeding iteration of Pj.
2) The finishing
phase, which we will call phase[i,j,f]
in the code, describes the smallest interval between the finishing time
of one iteration of Pi and the first request time of the
next iteration of Pj.
 |
| Figure
6-10 Request and finishing phases. |
These definitions are illustrated in Figure 6-10 above. The gray boxes at the
ends of the boxes represent the min/max timing bounds.
Data dependencies are one form of constraint on the relative
execution times of processes. LatestTimes() and EarliestTimes() are
modified longest-path algorithms that determine the timing and slack of
the processes based on their data dependencies. Pseudocode for
LatestTimes() is shown in the following block of code from [Yen98]; it
is similar in form to the EarliestTimes() procedure.
LatestTimes(G)
{
/* initial ize */
foreach (process Pi) {
latest[Pi,request] = 0;
foreach (process Pj) phase[i, j
,r] = 0;
}
foreach (process Pi in topological order) {
wi = worst-case response time of
Pi with phase adjustment phase[i,j,r];
foreach (process Pj such that
priority(Pj) > priority(Pi)) {
latest{Pi,finish] = latest[Pi,request]+wi
calculate
phase[i,j,f] relative to latest[Pi,finish] for each j;
foreach
(immediate successor Pk of Pi) {
delta = latest[Pk,request] 2 latest[Pi,finish];
if (latest[Pk,request] < latest[Pi,finish])
latest[Pk,request] =
latest[Pi,finish]
update phase[k,j ,r] for each process Pj according to
phase[i,j,f] and delta;;
}
}
}
}
The algorithm walks through the graphs in the order in which they
appear in the task graph. The worst-case response time wi is computed
using EQ 6-6, with the modification that the term inside the summation
takes into account the request phase:
After we have computed wi, we then compute the phases
relative to latest [Pi, finish]. There are two cases to
consider, Pj preempts Pi and Pj does
not preempt Pi. Updating the request phases requires looking
ahead one iteration, examining the successor Pks of Pi.
If , then there is slack between the latest finish time of Pi
and the latest request time of Pk. If , then the request
phase must be updated, as shown below:
Page 353 EQ
These relationships are illustrated in Figure 6-11, below.
 |
| Figure
6-11. Relationships related to phases. |
The MaxSeparations() procedure uses combinations of processes that
cannot interfere to tighten the bounds on execution time. It uses the
results of the phase analysis performed in LatestTimes() and
EarliestTimes() to check the separations of phases of processes, taking
preemption into account.
Max constraints model the relationship between a process and its
predators: the initiation time of a process is the max of the finish
times of its predecessors.
Max constraints are harder to solve than the linear constraints
imposed by data dependencies. We can use a modified version of an
algorithm developed by McMillan and Dill [McM92] to solve these
constraints.
To read Part 1, go to The
role of the operating system
Next in Part 3: Event Driven
Scheduling Analysis
Used
with the permission of the publisher, Newnes/Elsevier, this series of
five articles is based on copyrighted material from "High-Performance
Embedded Computing," by Wayne Wolf. The book can be purchased on
line.
Wayne Wolf is professor of
electrical engineering at Princeton University. Prior to joining
Princeton he was with AT&T Bell Laboratories. He has served as
editor in chief of the ACM Transactions
on Embedded Computing and of Design
Automation for Embedded Systems.
References:
[Gar79] M.R. Garey and D.S.
Johnson, Computers and
Intractability: A Guide to the Theory of NP-Completeness," W. H.
Freeman, 1979.
[Sto77] Harold S. Stone, "Multiprocessor scheduling with the aid
of
network flow diagrams," IEEE Transactions of Software Engineering, SE-
3(1) January, 1977
[Bha97] Shuvra S. Bhattacharyya, "Sundararajan Sriram and Edward
Lee, "Optimizing
synchronization in multiprocessor DSP systems," IEEE
Transactions on Signal Processing, 45(6), June, 1997
[Mat98] Anmol Mathur, Ali Dasdan and Rajesh K. Gupta, "Rate
analysis
for embedded systems," ACM Transactions on Design Automation of
electronic systems, 3(3) July, 1998.
[Ram90b] Krithi Ramamritham, "Allocation
and scheduling of complex
periodic tasks," Proceedings, 10th International Conference on
Distributed Computing Systems, IEEE, 1990
[Leh89] J. Lehoczy, L. Sha, and Y. Ding, "The
rate monotonic scheduling
algorithm:exact characterization and average case behavior,"
Proceedings, IEEE Real Time Systems Symposium, IEEE, 1989.
[Yen98] Ti-yen Yen and Wayne Wolf, "Performance analysis of distributed
embedded systems," IEEE Transactions on Parallel and Distributed
Systems, 9(11), November, 1998.
[McM92] K.McMillan and D. Dill, "Algorithms
for interface timing
verification," Proceedings, IEEE International Conference on
Computer
Design, 1992