What's different about multiprocessor software? (Part 3) - Embedded.com

What’s different about multiprocessor software? (Part 3)

An alternative approach to multiprocessor performance analysis isSymTA/S. It is based on events [Hen05]. The inputs and outputs to thesystem and between the processing elements are modeled as events.Unlike rate-monotonic analysis , whichassumes one event per period, the SymTA/S model allows a more complexdescription of when events can and must occur.

SymTA/S starts with a description of the system, including the tasksand their mapping to processing elements, as well as the set of inputevents. It produces the output events and, as a side effect, theinternal events as well.

SymTA/S does not, strictly speaking, provide a schedule for theprocesses, but it does provide tight bounds on when processes can bescheduled to assure that they meet the input requirements.

Figure6-12 A jitter event model.

An event is an atomic action. We can describe several categories ofevents, depending on their timing behavior. A simple event model isdefined by a period P. These events happen strictly periodically. Ajitter event model has a period and jitter (P,J). These events occur atroughly the described period, but they can appear within the jitterinterval as shown in Figure 6-12 above .An event function model allows us to vary the number of events in aninterval.

An event function counts the number of events in an interval, so itis a piecewise constant function with unit steps – each step adds onemore event to the system. We can also define a periodic with jitterevent model as

We can also define a minimum distance function and maximum distancefunction , which define the minimum and maximum distance between N =two or more events. We can describe sporadic events as periodic eventstreams with the lower bound on the number of events per unit time aszero.

As illustrated in Figure 6-13 below ,processing elements consume input events and emit output events.SymTA/S concentrates on modeling the timing, not the functionality,between these inputs and outputs. Because these event models arerelatively simple, we can easily calculate the timing of an outputevent given the input event.

Figure6-13 Input events produce output events.

A strictly periodic input event produces a strictly periodic outputevent. In the case of an event with jitter, we just add the responsetime jitter (that is, the difference between minimum and maximumresponse times) to the input jitter in order to find the output event'sjitter:

We need to be able to analyze the timing of a system of processingelements by computing the relationships between the outputs of oneprocessing element and the input of another. Analysis starts with thedefinitions of the input events, which are given by the designer;analysis determines the events at the system outputs and along the waygenerates the internal events between the processing elements.

As illustrated in Figure 6-14 below ,some systems may be unanalyzable because they have cyclic schedulingdependencies. The figure shows two input events but each processingelement (R1 and R2) has a task with a missing input.

Figure6-14 A cyclic scheduling dependency. From Henia et al. [Hen05] ©2005 IEEE.

These PEs communicate, but we need to know the timing of R2 tocalculate the output of R1 (which feeds R2), while we need to know thetiming of R1 to calculate the output of R2 (which feeds R1). Thesedependencies between events makes it impossible to start and finish theanalysis.

Events activate tasks; we can define combinations of events that arerequired to activate a task. An AND activation requires all of a set ofevents to activate; an OR activation requires any of a set of events toactivate. We do not allow NOT activations because those would turn offevents and make analysis intractable.

As shown in Figure 6-15 below ,an AND activation task takes events at its inputs and fires when allits inputs are available. Each input is buffered with a queue sinceevents may not all arrive at the same time.

However, to ensure that the buffers are of finite size, we requirethat all the inputs to an AND-activation task have the same arrivalrate. The activation jitter of the AND-activation task is determined bythe largest jitter of any of the inputs:

Figure6-15 An AND-activation task.

An OR-activated task does not require input buffers since any inputevent is immediately consumed. However, analysis of OR-activated tasksis more complex than that of AND-activated tasks precisely becauseOR-activated tasks are more easily activated. Figure 6-16 below   illustratesa pair of input events and their resulting OR activations. One inputstream has P1 = 4, J1 = 2 while the other has P2 = 3, J2 = 2.

Although each input event stream can be characterized by this eventmodel, the OR-activation stream cannot. The OR combination of theseinput event streams has an irregular pattern that does not fit any ofour analyzable event models. We must therefore use an approximation ofthe activation stream as shown in part(d) Figure 6-16 below , using a periodic with jitter model.

The period of an OR-activation is determined by unrolling theschedule to find the hyperperiod. We find the least-common multiple ofthe input event periods and divide that by the sum of all the inputevents over that hyperperiod (assuming no jitter). This can be writtenas

Figure6-16 OR activations and approximations. From Henia et al. [Hena05]© IEEE

The OR-activation jitter needs to be approximated, as shown in Figure 6-16 . We want to find thesmallest jitter that satisfies these inequalities:

We can evaluate (EQ 6-12) piecewise in order to bound theOR-activation jitter. Each piece can be written as:

where kj is a natural number. The ceiling function monotonicallyincreases in  delta t, so we need to evaluate it only for thesmallest value of  delta t, which approaches  delta tj .This gives

Tasks can be connected in cycles – either primitive cycles where theoutput of a task feeds its own input, or long cycles in which a seriesof tasks folds back on to itself. AND activation is the natural modefor tasks in cycles.

However, an AND-activated task has a larger jitter at its outputthan at its input in our basic analysis. This would mean that cycles oftasks could never be stable. To analyze tasks, we cut the cycle,analyze the open-loop system, and check the results in the closed-loopcontext.

The basic event models do not describe correlation between events.In realistic systems, events can be correlated in a way that allows usto usefully narrow the range of feasible schedules. Correlations oftencome from context-dependent behavior, which can come from either of twosources: different types of events and internal modes in the tasks.

We can use these behaviors to define correlations within a singlestream known as intra-event stream context. We can also findinter-event stream contexts that allow us to find relationships, suchas offsets, between events in two different streams

Kang et al. [Kan99b] developed a tool suite to synthesizedistributed implementations of signal processing algorithms. Theirmethodology is shown in Figure 6-17,below . The designer provides the task graph, the hardwareplatform architecture, and design constraints such as deadlines. Thedesigner also provides statistical information about process behaviorin the form of probability distribution functions.

Figure6-17 A methodology for distributed software synthesis.

During synthesis, each subgraph of the task graph is treated as anindependent channel. Processes in the subgraph are executedquasicyclically. Time is divided into quanta and each process isassigned a time budget and a position in the schedule known as aninterval. Processes may use additional time quanta if the load is lowenough that extra quanta are available.

The load threshold estimator determines the schedule for processes.It figures out the throughput required for each channel, allocatesprocesses to processing elements to balance load, and schedules theintervals to minimize latency. Because load is modeled statistically,iterative algorithms must be used to allocate processes.

To analyze a channel, the estimator breaks it into chains, analyzeseach chain separately, then combines the estimates into an overallestimate for the complete channel. Because the load may vary, systembehavior is validated through simulation.

Scheduling with Dynamic Tasks
When tasks arrive at the system dynamically, we cannot guarantee thatwe will be able to meet the required demand. Unless the source of thetasks limits itself, it may generate more task requests than the systemcan allow.

A system that takes dynamically-allocated tasks must figure outon-the-fly whether it can accept a task and, in the case of amultiprocessor, which processing element should handle the task. Thisdecision needs to be accomplished quickly. If task allocation isperformed by a node that performs other tasks, scheduling overhead eatsinto the available CPU time for useful work.

Even if task allocation is performed by a dedicated processor,longer decision times delay the task from starting execution, giving itless time to complete before its deadline.

General-purpose multiprocessors commonly receive dynamic tasks fromusers. The multiprocessor operating system allocates tasks toprocessors based on system loads. However, general-purposemultiprocessors are often homogeneous, which simplifies many schedulingdecisions.

Embedded multiprocessors are often heterogeneous. As a result, notall tasks may run on all nodes. Although this simplifies somescheduling decisions – a task that runs on only one node must run onthat node – it complicates other matters.

If a processing element can run a combination of general-purposetasks and specialized tasks, then it may accept general-purpose tasksbefore it knows that it will be required to run a specialized task. Theoperating system can either reserve the processing element forspecialized tasks, thus wasting computational capability, or it must beable to move tasks on-the-fly to make room for the specialized task.

Ramaritham et al. [Ram90a] developed a myopic algorithm for dynamicscheduling of real-time tasks. Their algorithm performs a heuristicsearch to find a good schedule for a newly arrived task. They assumethat the tasks are nonperiodic and executed nonpreemptively; they alsoassume that there are no data dependencies between tasks. Each task ischaracterized by its arrival time TA , its deadline time TD ,worst-case processing time TP , and resource requirements {TR }.

The factor that chooses which process to schedule next during asearch can be encapsulated as a decision function H. Possible decisionfunctions include shortest deadline first, shortest processing timefirst, minimum earliest start time first, or minimum laxity first.

The search algorithm constructs partial schedules during its search,starting from a set of tasks P = {pi } that need to bescheduled. It adds tasks to the schedule to create more completeschedules. It may also backtrack, removing tasks from the partialschedule to try another solution.

At each point in the search, a task is taken from P and added to thepartial schedule S; we call the depleted task set P and the remainingtasks PR . A partial schedule is strongly feasible if thepartial schedule itself is feasible and every possible next choice fora task to execute also creates a feasible partial schedule.

The myopic algorithm considers only a subset of the remaining tasksduring search. It keeps PR sorted by deadlines and takesonly the first k tasks in the set. It then evaluates H for these ktasks and chooses the best task to add to the partial schedule. If k issmall relative to the total number of tasks, then this process isapproximately linear in the total number of tasks.

Load balancing is a form of dynamic task allocation in which thetasks come from internal sources. A processing element may decide toshift one of its tasks somewhere else in the system for any of severalreasons: it inaccurately estimated its ability to finish all its tasks;it wants to improve cache behavior or some other performance metric; orit needs to accept a task that can only run on a subset of theavailable PEs.

To balance processor loads, we must be able to stop a task runningon one PE, save its state, and restore that state on the new processingelement – a procedure often called task or process migration. This taskmay be straightforward or difficult depending on the situation,including:

1) Homogeneousmultiprocessor with shared memory. In this case, we can copy thetask's activation record from the old PE to the new PE and restart thePE.

2)Heterogeneous multiprocessor with shared memory. We have toassume that we have versions of the code available that will run onboth new and old PEs. In this case, we cannot simply copy theactivation record from one type of processor to another type.

In certain cases, there may be a straightforward transformationbetween the activation record information of the two PEs. In general,we must use specialized code in the processes that will save the taskstate in memory so that we do not need to explicitly transfer theactivation record.

3)Multiprocessor with nonshared memory. If the PEs do not executetasks in a shared memory area, we must copy all of the program data(and possibly code) from the old PE to the new one. This is generallyexpensive and it is unlikely that tasks can be usefully migratedwithout shared memory.

Shin and Chang [Shi89] developed a load-balancing algorithm forreal-time multiprocessors. Each processing element in the systemmaintains a buddy list of nodes with which it can share tasks; thebuddies may be determined, for example, by the communication costbetween the processors. To make scheduling decisions quickly,processing elements send information about their state to the PEs ontheir buddy list. Each PE may be in one of three states: underloaded,medium loaded, or fully loaded. When a PE changes state, it sends anupdate to all its buddies.

Each PE further organizes the buddy list into a preferred list,which is ordered by the communication distance of the processingelement. When a PE wants to move a task to another node, it requestshelp from the first underloaded PE on its preferred list. By properlyconstructing the preferred PE lists, the system can ensure that eachprocessing element is at the head of no more than one other PE. Thisreduces the chance that a PE will be overwhelmed by multiple taskmigration requests.

Next, in Part 4: Services andMiddleware for Embedded Multiprocessors
To read Part 1, go to  Therole of the operating system
To read Part 2, go to Multiprocessor Scheduling.

Usedwith the permission of the publisher, Newnes/Elsevier, this series offive articles is based on copyrighted material from “High-PerformanceEmbedded Computing,” by Wayne Wolf. The book can be purchased online.

Wayne Wolf is professor ofelectrical engineering at Princeton University. Prior to joiningPrinceton he was with AT&T Bell Laboratories. He has served aseditor in chief of the ACM Transactionson Embedded Computing and of DesignAutomation for Embedded Systems

References:
[Hen05] Rafik Henia, Arne Hamann,Marek Jersak, Razvan Racu, Kai Richter and Rolf Ernst, “System-levelperformance analysis: The SymTA/S approach,” IEE ProceedingsComputersand Digital Techniques.

[Kan99b] D.-I. Kang, R. Gerber, L.Golubchik, J. K. Hollingsworth, and M. Saksena, “A software synthesistool for distributed embedded systems design , ” inProceedings of theACM SIGPLAN 1999 Workshop on Languages, Compilers and Tools forEmbedded Systems, ACM Press, 1999.

[Ram90a] Krithi Ramaritham, John A.Stankovic and Perng-Fei Shiah, “Efficientscheduling algorithms forreal time multiprocessor systems,” IEEE Transactions on ParallelandDistributed Systems, April, 1990

[Shi89a] Kang G. Shin, and Wi-ChieChang, “Loadsharing in distributed real time systems with state-changebroadcasts,” IEEE Transactions on Computers, August, 1989.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.