While real-time operating systems provide apparent concurrency on asingle processor, multiprocessor platforms provide true concurrency.The concurrency and performance provided by multiprocessors can be verypowerful but also harder to analyze and debug.
The purpose of this series of five articles is (1) to review what is unique aboutmultiprocessor software as compared to both uniprocessor embeddedsystems and general-purpose systems: (2) to study scheduling and performance analysis of multiple tasks runningon a multiprocessor: (3) consider middleware and software stacks as well as design techniquesfor them; and (4) look atdesign verification of multiprocessor systems.
As we move up to software running on embedded multiprocessors, weface two types of differences. First, how is embedded multiprocessorsoftware different from traditional, general-purpose multiprocessorsoftware? We can borrow many techniques from general-purpose computing,but some of the challenges in embedded computing systems are unique andrequire new methods.
Second, how is the software in a multiprocessor different from thatin a uniprocessor-based system? On the one hand, we would hope that wecould port an embedded application from a uni-processor to amultiprocessor with a minimum of effort, if we use the properabstractions to design the software. But there are some important,fundamental differences.
The first pervasive difference is that embedded multiprocessors areoften heterogeneous, with multiple types of processing elements,specialized memory systems, and irregular communication systems.Heterogeneous multiprocessors are less common in general-purposecomputing; they also make life considerably more challenging than inthe embedded uniprocessor world. Heterogeneity presents several typesof problems.
1) Gettingsoftware from several types of processors to work together can presentchallenges. Endianness is one common compatibility problem; librarycompatibility is another.
2) Thedevelopment environments for heterogeneous multiprocessors are oftenloosely coupled. Programmers may have a hard time learning all thetools for all the component processors. It may be hard to debugproblems that span multiple CPU types.
3) Differentprocessors may offer different types of resources and interfaces tothose resources. Not only does this complicate programming but it alsomakes it harder to decide certain things at runtime.
Another important difference is that delays are much harder topredict in multiprocessors. Delay variations come from several sources:the true concurrency provided by multiprocessors, the larger size ofmultiprocessors, CPU heterogeneity, and the structure and the use ofthe memory system.
Larger delays and variances in delays result in many problems,including:
1) Delayvariations help expose timing-sensitive bugs that can be hard to testfor and even harder to fix. A methodology that avoids timing bugs isthe best way to solve concurrency-related timing problems.
2) Variations incomputation time make it hard to efficiently use system resources andrequire more decisions to be made at runtime.
3) Large delaysfor memory accesses makes it harder to execute code that performsdata-dependent operations.
Scheduling a multiprocessor is substantially more difficult thanscheduling a uniprocessor. Optimum scheduling algorithms do not existfor most realistic multiprocessor configurations, so heuristics must beused. Equally important, the information that one processor needs tomake good scheduling decisions often resides far away on anotherprocessor.
Part of the reason that multiprocessor scheduling is hard is thatcommunication is no longer free. Even direct signaling on a wire cantake several clock cycles and the memory system may take tens of clockcycles to respond to a request for a location in a remote memory.
Because information about the state of other processors takes toolong to get, scheduling decisions must be made without full informationabout the state of those processors. Long delays also cause problemsfor the software processes that execute on top of the operating system.
Of course, low energy and power consumption are important inmultiprocessors, just as in uniprocessors. The solutions to all thechallenges of embedded multiprocessor software must be found so thatenergy-efficient techniques can be used.
Many of these problems boil down to resource allocation. Resourcesmust be allocated dynamically to ensure that they are used efficiently.Just knowing which resources are available in a multiprocessor is hardenough.
Determining on-the-fly which resources are available in amultiprocessor is hard too. Figuring out how to use those resources tosatisfy requests is even harder. As discussed later in this series,middleware takes up the task of managing system resources across themultiprocessor.
|Figure6-1. Kernels in the multiprocessor.|
Real-Time Multiprocessor OperatingSystems
An embedded multiprocessor may or may not have a true multiprocessoroperating system. In many cases, the various processors run their ownoperating systems, which communicate to coordinate their activities. Inother cases, a more tightly integrated operating system runs acrossseveral processing elements (PEs).
A simple form of multiprocessor operating system is organized with amaster and one or more slaves. The master PE processor determines theschedules for itself and all the slave processors. Each slave PE simplyruns the processes assigned to it by the master.
This organization scheme is conceptually simple and easy toimplement. All the information that is needed for scheduling is kept bythe master processor. However, this scheme is better suited tohomogeneous processors that have pools of identical processors.
Figure 6-1 above shows theorganization of a multiprocessor operating system in relation to theunderlying hardware. Each processor has its own kernel, known as the PEkernel. The kernels are responsible for managing purely localresources, such as devices that are not visible to other processors,and implementing the decisions on global resources.
The PE kernel selects the processes to run next and switchescontexts as necessary. But the PE kernel may not decide entirely on itsown which process runs next. It may receive instructions from a kernelrunning on another processing element.
The kernel that operates as the master gathers information from theslave PEs. Based on the current state of the slaves and the processesthat want to run on the slaves, the master PE kernel then issuescommands to the slaves about their schedules. The master PE can alsorun its own jobs.
One challenge in designing distributed schedulers is thatcommunication is not free and any processor that makes schedulingdecisions about other PEs usually will have incomplete informationabout the state of that PE. When a kernel schedules its own processor,it can easily check on the state of that processor.
When a kernel must perform a remote read to check the state ofanother processor, the amount of information the kernel requests needsto be carefully budgeted.
Vercauteren et al. [Ver96] developed a kernel architecture forcustom heterogeneous processors. As shown in Figure 6-2 below , the kernelarchitecture includes two layers: a scheduling layer and acommunication layer.
|Figure6.2. Custom multiprocessor scheduler and communications|
The basic communication operations are implemented by interruptservice routines (ISRs),while the communication layer provides more abstract communicationoperations.
The communication layer provides two types of communicationservices. The kernel channel is used only for kernel-to-kernelcommunication – it has high priority and is optimized for performance.The data channel is used by applications and is more general purpose.
Example: TI's OMAP multiprocessorconfiguration
Cconsider the operating systems and communications in TI'sOMAP. The OMAPI standarddefines some core capabilities for multimedia systems. One of thethings that OMAPI does not define is the operating systems used in themultiprocessor. The TI OMAP family implements the OMAPI architecture.The figure below shows thelower layers of the TI OMAP, including the hardware and operatingsystems.
|OperatingSystems and Communication in the TI OMAP|
The main unifying structure in OMAP is the DSPBridge, which allowsthe DSP and RISC processor to communicate. The bridge includes a set ofhardware primitives that are abstracted by a layer of software. Thebridge is organized as a master/slave system in which the ARM is themaster and the C55x is the slave.
This fits the nature of most multimedia applications, where the DSPis used to efficiently implement certain key functions while the RISCprocessor runs the higher levels of the application.
The DSPBridge API implements several functions: it initiates andcontrols DSP tasks, exchanges messages with the DSP, streams data toand from the DSP, and checks the status of the DSP.
The OMAP hardware provides several mailbox primitives – separateaddressable memories that can be accessed by both. In the OMAP 5912,two of the mailboxes can be written only by the C55x but read by bothit and the ARM, while two can be written only by the ARM and read byboth processors.
Next in Part 2: MultiprocessorScheduling
Usedwith the permission of the publisher, Newnes/Elsevier, this series offive articles is based on copyrighted material from “High-PerformanceEmbedded Computing,” by Wayne Wolf. The book can be purchased online.
Wayne Wolf is professor ofelectrical engineering at Princeton University. Prior to joiningPrinceton he was with AT&T Bell Laboratories. He has served aseditor in chief of the ACM Transactionson Embedded Computing and of DesignAutomation for Embedded Systems.
[Ver96] S. Vercauteren, B. Lin, and H. De Man, “Astrategy for real time kernel support in application specific HW/SWembedded architectures,” in Proceedings, 33-rd Design AutomationConference, ACM Press, 1996, pp. 678 ” 682.
For more about multiprocessingissues on Embedded.com, go to MoreOn Multicores and Multiprocessing. To read exerpts from other recent books on embedded hardware andsoftware, go to Moreon The Embedded Bookshelf.