# What’s different about multiprocessor software? (Part 2)

In general, multiprocessor scheduling is NP-complete [Gar79]. That is,if we want to minimize total execution time on an arbitrary processor,we have no known way to find the shortest schedule in polynomial time.

Of course, many NP-complete problems have useful approximations andheuristics. By taking advantage of what we know about themultiprocessor structure, by limiting the combinations of processexecutions that we consider, or by other simplifications, we can createa number of simplified but useful multiprocessor scheduling problems.For example, two-processor multiprocessors can be scheduled optimallyunder some conditions.

One of the first multiprocessor algorithms was developed by Stone[Sto77]. Although he referred to the problem as a scheduling one, it ismore accurately referred to as an allocation problem, since it selectedthe CPUs on which to execute processes but only implicitly the times atwhich they executed.

He solved the problem using network flow algorithms. Stone's modelconsidered a network of heterogeneous processors. He found an exactsolution to the two-processor scheduling problem and heuristics tosolve schedules for systems with arbitrary numbers of processors.

Figure6-3 Models for Stone's multiprocessor scheduling algorithm. After Stone |

As shown in Figure 6-3 above ,the problem is scheduled in two parts. An intermodule connection graphdescribes the time cost of communicating between two processes that areassigned to different processors; communication between processes onthe same processor has zero cost.

The execution time table specifies the execution time of eachprocess on each processor; it is possible that not all processes willbe able to run on both processors.

The minimum running time balances the communication cost and theexecution cost. Stone formulates the scheduling problem as one offinding a cutset of a modified version of the intermodule connectiongraph.

Two additional nodes are added to represent the two processors. Onesuch node is the source of the graph (representing CPU 1) and the otheris the sink (representing CPU 2).

Edges are added from each non-sink node to the source and the sink.The weight of an edge to the source is equal to the cost of executingthat node's module on CPU 2 (the sink); the weight of an edge to thesink is equal to the cost of executing that node's module on CPU 1 (thesource).

The cutset divides the intermodule connection graph into two sets,with the nodes in each set being assigned to the same processor. Theweight of a cutset is the cost of an assignment of the nodes to the twoprocessors as given by the cutset. To find the allocation thatminimizes the total execution time, we solve a maximum flow problem onthe graph.

Stone extended the problem to n processors by generalizing thenotion of a cutset. The generalized cutset divides the graph into ndisjoint subsets such that no proper subset of a cutset is also acutset.

He generalized the node to include n types of distinguished nodesrather than just the source and sink. His heuristic for solving thisproblem iteratively used several two-processor assignments to find then-processor assignment.

Why static tasks?

Many embedded systems statically allocate processes to processingelements. We can efficiently find bounds on the execution time of theprocesses in these multiprocessor systems. We will assume that there isa set of processes with data dependencies between them; in general,they can form one or more subtasks.

We will also assume that each CPU schedules processes using ratemonotonic scheduling. Although we can easily figure out the schedule ifwe don't have data dependencies, the combination of data dependenciesand rate monotonic scheduling makes the problem more challenging,although tractable.

Minimizingbuffer sizes. Bhattacharyya et al. [Bha97] developed methods toefficiently schedule synchronous data flow graphs on multiprocessors. Figure 6-4 below shows an SDF graphwith the nodes assigned to processing elements in a multiprocessor. Weare primarily interested in the communication between PEs, since we canschedule each SDF on a processor using other methods that produce asequential schedule.

Figure6-4 Models for multiprocessor communication. |

We model the system using an interprocessor communication modeling(IPC) graph, also shown in the figure. The IPC graph has the same nodesas the SDF graph. The IPC graph has all the edges of the SDF graph plusadditional edges. We add edges to the IPC graph to model the sequentialschedule on each PE; these edges are shown by the dashed line in thefigure.

The edges in the allocated SDF graph that cross processor boundariesare known in the IPC graph as IPC edges because they defineinterprocessor communication. Any communication across an IPC edge mustuse an interprocess communication mechanism to cross the boundarybetween the processors.

We can determine whether communication across each IPC edge isbounded; edges not in a strongly connected component (SCC) are notbounded. When implementing interprocess communication on an unboundededge, we can use a protocol that ensures that the number of tokenscrossing the edge does not exceed a predetermined buffer size. We canimplement interprocess communication on bounded edges by using asimpler protocol.

The IPC graph may have some redundant edges. An edge e is redundantif there is another path from source(e) to sink(e) that has a longerdelay than the delay along e. The redundant edges do not have to beremoved in any particular order to ensure that we remove the maximumnumber of redundant edges. The asymptotic iteration period T for astrongly connected IPC graph G is

where C is a cycle through the graph, t(v) is the execution time ofa node v, and delay(C) is the sum of the delays around the path C. T isalso known as the cycle mean. The maximum cycle mean of an IPC graph, lamda max, is the largest cyclemean for any SCC in the graph. A cycle whose cycle mean is equal to themaximum is known as a critical cycle.

We can construct a strongly connected synchronization graph byadding edges between strongly connected components. We add edges thatchain together source SCCs, edges that chain together sink SCCs, and anedge that connects the overall sink of the graph to the source.

(A strongly connected component isa source SCC if any edge whose sink is in the strongly connectedcomponent also has its source in the strongly connected component. Asink SCC is such that any edge whose source is in the SCC also has itssink in the SCC. )

We need to add delays to the edges, corresponding to buffer memory,that ensure the system will not deadlock and that we can minimize thesum of the buffer bounds over all the IPC edges. We can use the addededges to help us determine these delays – the added edges can bedivided into disjoint sets that help organize the graph.

Delay can be added optimally if the graph has one source SCC and onesink SCC, and it is heuristic if the graph's structure is more complex.We can determine the minimum delay on each edge that ensures that thegraph's cycle mean is not exceeded.

Figure6-5 RATAN process model. |

We can construct a strongly connected synchronization graph byadding edges between strongly connected components. We add edges thatchain together source SCCs, edges that chain together sink SCCs, and anedge that connects the overall sink of the graph to the source.

(A strongly connected componentis a source SCC if any edge whose sink is in the strongly connectedcomponent also has its source in the strongly connected component. Asink SCC is such that any edge whose source is in the SCC also has itssink in the SCC .)

We need to add delays to the edges, corresponding to buffer memory,that ensure the system will not deadlock and that we can minimize thesum of the buffer bounds over all the IPC edges. We can use the addededges to help us determine these delays – the added edges can bedivided into disjoint sets that help organize the graph.

Delay can be added optimally if the graph has one source SCC and onesink SCC, and it is heuristic if the graph's structure is more complex.We can determine the minimum delay on each edge that ensures that thegraph's cycle mean is not exceeded.

Mathur et al. [Mat98] developed the RATAN tool for analyzing therates of multiple-tasking systems. As shown in Figure 6-5 above, a process consistsof a single threaded CDFG-style model. Data dependencies may extendfrom a node within one process to a node within another process. Whenwe look only at the processes themselves, and the edges between them, acontrol edge is labeled with [min,max] delays measured from theactivation signal for the process to the start of execution.

Those bounds are specifications on the allowable delay; our goal isto find execution rates for the processes that satisfy these bounds. Aprocess starts to execute after all of its enable signals have becomeready. If we denote the delay of an edge i -> j in the graph as dij,then the delay of a cycle C in the process graph is given by

The mean delay of the cycles is given by

where is the number of edges in C. The maximum mean cycle delay isknown as lambda . In astrongly connected graph, all nodes execute at the same rate, namely lambda .

We call [r _{l} (X),r _{u} (X) ] the lower and upper bounds onthe rate of a subgraph X. If we have two maximal SCC of the graph, Pand C, and the graph has edges from P to C, then P is a producer and Cis a consumer; therefore the actual rate interval for the consumer C is

Data dependencies and scheduling

The problems created by data dependencies are illustrated in Figure 6-6, below. Here, twosubtasks are divided among three processors. Take, for example,processing element M_{1} .

Figure6-6 Preemption and scheduling. |

This CPU runs two processes that will clearly affect each other'sschedules. But the completion times of the processes on M_{1} also depends on the behavior of the processes on all the other PEs inthe system. Data dependencies link P_{1} and P_{2} ,which adds M_{2} to the set of interrelated PEs. The datadependency between P_{3} and P_{4} also adds M_{3} to the system.

Getting a process to run faster doesn't always help. Consider Figure 6-7 below; in this example,changing the computation time of process P_{x } changes theresponse time of P_{3} even though they run on differentprocessors.

Figure6-7 Period shifting. From Yen and Wolf [Yen98] © 1998 IEEE. |

The data dependencies cause the shortened computation time of P_{x} ,resulting in process P_{2} running sooner and preempting P_{3} .

Ramamritham [Ram90b] proposed scheduling over an unrolled schedule.He noted that the maximum interval that must be considered for a set ofprocesses is the least common multiple (LCM) of their periods; in alonger schedule, the order of processes must necessarily repeat itself.He then constructed a schedule using the LCM interval.

To develop bounds on the CPUs in the system, we can make use of atheorem from Lehoczy et al. [Leh89]. They bounded the response timesfor a set of independent (no data dependencies) processes running on asingle CPU.

The processes are {P_{1} , P_{2} , …}, with P_{1} being the highest-priority process. The minimum period of the ithprocess is pi and its worst-case execution time is ci. The worst-caseresponse time of P_{i} is w_{i} , which can be computedas the smallest non-negative root of

The ci term gives the computation time for the process. The x/pjterms give the fraction of the period of each process that delays theexecution of the ith process. We cannot solve this equation directlybut can use numerical methods to solve it.

We can use the worst-case response times of the various processes tohelp us solve the system schedule. But this formula is not sufficientbecause it does not take into account the data dependencies eitherbetween the CPUs or within a CPU. To handle those data dependencies, weneed to use a more complex graph algorithm.

Staticscheduling algorithm. Yen and Wolf [Yen98] developed analgorithm that handles multiple CPUs and data dependencies. Thisalgorithm models the processes to be run as a task graph that can haveone or more subtasks. Each process in the task graph is given boundsfor its computation time [ci lower, ci upper].

This is more general than a single, fixed computation time, but itdoes assume that we can strictly bound the computation time of aprocess. Each subtask has a period that is also modeled as an interval[p_{i lower } , p_{i upper } ]. Thearchitecture of the platform is modeled as a processor graph. Theallocation of each process in the task graph to a processing element isgiven in Figure 6-8, below.

Figure6-8 An example of reduced computation time leading to longer responsetimes. From Yen and Wolf [Yen98]. |

The algorithm finds bounds on the start and the finish times of theprocesses. Given a process Pi, the start time is bounded by earliest[P_{i} ,request] and latest[P_{i} , request]; the end time is bounded byearliest[P_{i} , finish] and latest[P_{i} , finish].

The following code summarizes the delay estimation algorithm. Themaxsep[,] data structure holds the earliest[] and latest[] bounds foreach process; it starts out with infinite (unbounded) times for theprocesses. This performance analysis algorithm iteratively tightensthese bounds until they stop changing (or until a predeterminediteration limit is reached).

maxsep.lower = maxsep.upper = inf inity;

step = 0; /* keep track of number of iterations */

do {

/* use longest path algorithm to find the requestand finish times */

foreach Pi { Earl iestTimes(Gi); LatestTimes(Gi);

/* handle max constraints */

foreach Pi { MaxSeparations (Gi);

step++;

} while (maxsep has changed and step < limit);

At each iteration, the algorithm performs two types of operations.The EarliestTimes()/LatestTimes() procedures analyzes data dependenciesusing a modified longest-path algorithm.

The MaxSeparations() procedure uses a modified max-constraintalgorithm to look for combinations of process executions that cannotoccur. Each of these procedures is applied separately to each subtask P_{i} in the task graph, but the procedures look at the execution timesof all the other subtasks while updating P_{i} .

Figure6-9 An example of phase adjustment and separation analysis. From Yenand Wolf [Yen98] © 1998 IEEE |

To understand why we need these two steps, consider the example of Figure 6-9, above. A naive analysismight lead one to conclude that the total time required to finish bothtasks is 80. But two conditions cause it to be shorter. First, P_{1 } cannotpreempt both P_{2} and P_{3} in a single execution ofthe task. Second, P_{2} cannot preempt P_{3} since P_{2} must finish before P_{3} can start. Therefore, the worst-casedelay to execute both tasks is only 45.

Phase constraints

We will use phase constraints to summarize the relationships between aprocess that is preempted and the process that preempts one. Thealgorithm uses two types of phases.

1) The requestphase, which we will callphase[i,j,r] in the code, describes the smallest intervalbetween the execution of P_{i} on one iteration and thesucceeding iteration of P_{j} .

2) The finishingphase, which we will call phase[i,j,f] in the code, describes the smallest interval between the finishing timeof one iteration of P_{i} and the first request time of thenext iteration of P_{j} .

Figure6-10 Request and finishing phases. |

These definitions are illustrated in Figure 6-10 above . The gray boxes at theends of the boxes represent the min/max timing bounds.

Data dependencies are one form of constraint on the relativeexecution times of processes. LatestTimes() and EarliestTimes() aremodified longest-path algorithms that determine the timing and slack ofthe processes based on their data dependencies. Pseudocode forLatestTimes() is shown in the following block of code from [Yen98]; itis similar in form to the EarliestTimes() procedure.

LatestTimes(G){

/* initial ize */

foreach (process Pi) {

latest[Pi,request] = 0;

foreach (process Pj) phase[i, j,r] = 0;

}

foreach (process Pi in topological order) {

wi = worst-case response time ofPi with phase adjustment phase[i,j,r];

foreach (process Pj such thatpriority(Pj) > priority(Pi)) {

latest{Pi,finish] = latest[Pi,request]+wi

calculatephase[i,j,f] relative to latest[Pi,finish] for each j;

foreach(immediate successor Pk of Pi) {

delta = latest[Pk,request] 2 latest[Pi,finish];

if (latest[Pk,request] < latest[Pi,finish])

latest[Pk,request] =latest[Pi,finish]

update phase[k,j ,r] for each process Pj according to

phase[i,j,f] and delta;;

}

}

}

}

The algorithm walks through the graphs in the order in which theyappear in the task graph. The worst-case response time wi is computedusing EQ 6-6, with the modification that the term inside the summationtakes into account the request phase:

After we have computed w_{i} , we then compute the phasesrelative to latest [P_{i} , finish]. There are two cases toconsider, P_{j} preempts P_{i} and P_{j } doesnot preempt P_{i.} Updating the request phases requires lookingahead one iteration, examining the successor P_{k} s of P_{i} .If , then there is slack between the latest finish time of P_{i} and the latest request time of P_{k} . If , then the requestphase must be updated, as shown below:

Page 353 EQ

These relationships are illustrated in Figure 6-11, below.

Figure6-11. Relationships related to phases. |

The MaxSeparations() procedure uses combinations of processes thatcannot interfere to tighten the bounds on execution time. It uses theresults of the phase analysis performed in LatestTimes() andEarliestTimes() to check the separations of phases of processes, takingpreemption into account.

Max constraints model the relationship between a process and itspredators: the initiation time of a process is the max of the finishtimes of its predecessors.

Max constraints are harder to solve than the linear constraintsimposed by data dependencies. We can use a modified version of analgorithm developed by McMillan and Dill [McM92] to solve theseconstraints.

To read Part 1, go to Therole of the operating system

Next in Part 3: Event DrivenScheduling Analysis

Usedwith the permission of the publisher, Newnes/Elsevier, this series offive articles is based on copyrighted material from “High-PerformanceEmbedded Computing,” by Wayne Wolf. The book can be purchased online.

Wayne Wolf is professor ofelectrical engineering at Princeton University. Prior to joiningPrinceton he was with AT&T Bell Laboratories. He has served aseditor in chief of the ACM Transactionson Embedded Computing and of DesignAutomation for Embedded Systems.

References:

[Gar79] M.R. Garey and D.S.Johnson, Computers andIntractability: A Guide to the Theory of NP-Completeness,” W. H.Freeman, 1979.

[Sto77] Harold S. Stone, “Multiprocessor scheduling with the aidofnetwork flow diagrams,” IEEE Transactions of Software Engineering, SE-3(1) January, 1977

[Bha97] Shuvra S. Bhattacharyya, “Sundararajan Sriram and EdwardLee, “Optimizingsynchronization in multiprocessor DSP systems,” IEEETransactions on Signal Processing, 45(6), June, 1997

[Mat98] Anmol Mathur, Ali Dasdan and Rajesh K. Gupta, “Rateanalysisfor embedded systems,” ACM Transactions on Design Automation ofelectronic systems, 3(3) July, 1998.

[Ram90b] Krithi Ramamritham, “Allocationand scheduling of complexperiodic tasks,” Proceedings, 10th International Conference onDistributed Computing Systems, IEEE, 1990

[Leh89] J. Lehoczy, L. Sha, and Y. Ding, “Therate monotonic schedulingalgorithm:exact characterization and average case behavior,”Proceedings, IEEE Real Time Systems Symposium, IEEE, 1989.

[Yen98] Ti-yen Yen and Wayne Wolf, “Performance analysis of distributedembedded systems,” IEEE Transactions on Parallel and DistributedSystems, 9(11), November, 1998.

[McM92] K.McMillan and D. Dill, “Algorithmsfor interface timingverification,” Proceedings, IEEE International Conference onComputerDesign, 1992