What's different about multiprocessor software? (Part 4) - Embedded.com

What’s different about multiprocessor software? (Part 4)

The services available for an embedded multiprocessor can be providedby the operating system or by other software packages, but the servicesare used to build applications. Services may include the servicesprovided by an embedded multiprocessor.

Such services can be provided by the operating system or by othersoftware packages, but the services are used to build applications.Services may include relatively low-level operations, I/O devicehandling, interprocessor communication, and scheduling. It may alsoprovide higher-level services.

Many services are provided by middleware – a term coined forgeneral-purpose systems to describe software that provides services forapplications in distributed systems and multiprocessors. Middleware isnot the application itself, nor does it describe primitive servicesprovided by the operating system.

Middleware may provide fairly generic data services, such as datatransport among processors that may have different endianness or otherdata formatting issues. Middleware can also provideapplication-specific services. Middleware is used in embedded systemsfor several purposes.

1) It providesbasic services that allow applications to be developed more quickly.Those services may be tied to a particular processing element or an I/Odevice. Alternatively, they may provide higher-level communicationservices.

2) It simplifiesporting applications from one embedded platform to another. Middlewarestandards are particularly useful since the application itself can bemoved to any platform that supports the middleware.

3) It ensuresthat key functions are implemented efficiently and correctly. Ratherthan rely on users to directly implement all functions, a vendor mayprovide middleware that showcases the features of the platform.

Middleware and resource allocation
One of the key differences between middleware and software libraries isthat middleware manages resources dynamically. In a uniprocessor, theoperating system manages the resources on the processor (for example,the CPU itself, the devices, etc.) and software libraries performcomputational tasks based on those allocations.

In a distributed system or multiprocessor, middleware allocatessystem resources, giving requests to the operating systems on theindividual processors to implement those decisions.

One reason that resources need to be allocated at runtime, not juststatically by designer decisions, is that the tasks performed by thesystem vary over time.

If we statically allocate resources, we end up with a drasticallyoverdesigned system that is not only very expensive but burns much morepower. Dynamic allocation lets us make more efficient use of resources(and hopefully manage cases in which we do not have enough resources toproperly handle all the current requests).

Embedded systems increasingly employ middleware because they mustperform a complex set of tasks whose resource requirements cannot beeasily evaluated statically.

A key trade-off is generality versus efficiency. General-purposecomputing systems are built with software stacks that provide usefulabstractions at different levels of granularity. Those stacks are oftendeep and provide a rich set of functions.

The constraints on embedded computing systems – power/energyconsumption, memory space, and real-time performance – often dictatethat we design software stacks more carefully.

Embedded system designers have experimented with a variety ofmiddleware architectures. Some systems make liberal use of generalstandards: Internet Protocol (IP) ,CORBA , and so on. Other systemsdefine their own services and support. The extent to which standardservices versus custom services are used to build middleware is a keydesign decision for embedded multiprocessors and distributed systems.

Standards-based services
A number of middleware systems have been built using variouscombinations of standard services; the Internet Protocol is one that'soften used. CORBA has also been used as a model for distributedembedded services.

The Common Object Request Broker Architecture (CORBA) [Obj06] iswidely used as an architecture for middleware services. It is notitself a specific protocol; rather it is a metamodel that describesobject-oriented services.

CORBA services are provided by objects that combine functionalinterfaces as well as data. An interface to an object is defined in an interactive data language (IDL). TheIDL specification is language-independent and can be implemented inmany different programming languages.

This means that the application and the object need not beimplemented in the same programming language. IDL is not a completeprogramming language; its main job is to define interfaces. Bothobjects and the variables carried by objects have types.

Figure6-18 A request to a CORBA object.

As shown in Figure 6-18 above ,an object request broker (ORB) connects a client to an object that provides a service. Each objectinstance has a unique object reference.

The client and object need not reside on the same machine; a requestto a remote machine can invoke multiple ORBs. The stub on theclient-side provides the interface for the client while the skeleton isthe interface to the object.

The object logically appears to be a single entity, but the servermay keep a thread pool running to implement object calls for a varietyof clients. Because the client and the object use the same protocol, asdefined by the IDL, the object can provide a consistent serviceindependent of the processing element on which it is implemented.

CORBA hides a number of implementation details from the applicationprogrammer. A CORBA implementation provides load balancing, faulttolerance, and a variety of other capabilities that make the serviceusable and scalable.

Real-time CORBA
RT-CORBA [Sch00] is a part of the CORBA specification that describesCORBA mechanisms for real-time systems. RT-CORBA is designed for fixedpriority systems, and it allows priorities to be defined in eitherCORBA or native types.

A priority model describes the priority system being used; it may beeither server-declared or come from the client. The server may alsodeclare priority transforms that can transform the priority based onsystem state.

A thread pool model helps the server control the real-timeresponsiveness of the object implementations. The threads in the poolmay be divided into lanes, with each lane given prioritycharacteristics. Thread pools with lanes allow the server to manageresponsiveness across multiple services.

RT-CORBA allows clients to explicitly bind themselves to serverservices. Standard CORBA provides only implicit binding, whichsimplifies programming but may introduce unacceptable and variablelatency for real-time services.

Wolfe et al. [Wol99] developed the Dynamic Real-Time CORBA system tosupport real-time requests. A real-time daemon implements the dynamicaspects of the real-time services.

Clients specify constraints during execution using a type of methodknown as timed distributed methodinvocation (TDMI) . These constraints can describe deadline andimportance. The method invocation is handled by the runtime daemon,which stores this information in a data structure managed by thedaemon.

Server objects and other entities can examine these characteristics.The kernel also provides a global time service that can be used bysystem objects to determine their time relative to deadlines. A latencyservice can be used with objects to determine the times required forcommunication. It can provide estimated latencies, measured latencies,or analytical latencies. These bounds can be used to help determineschedules and more.

A priority service records the priorities for system objects. Areal-time event service exchanges named events. The priorities ofevents are determined by the priorities of the producers and consumers.The deadline for an event may be either relative to the global clock orrelative to an event. The event service is implemented using IPmulticasting. The real-time daemon listens to a specified multicastgroup for events and sends multicast messages to generate events. Eachevent has a unique identification number.

Middleware for multiprocessor faulttolerance
ARMADA [Abd99] is a middleware system for fault tolerance andquality-of- service. It is organized into three major areas: real-timecommunications, middleware for group communication and fault tolerance,and dependability tools.

The ARMADA group identifies three major requirements forQoS-oriented communication. First, different connections should beisolated such that errors or malicious behavior on one channel does notstarve another channel. Second, service requests should be allowed todifferentiate themselves from each other based on urgency. Third, thesystem should degrade gracefully when overloaded. The communicationguarantees are abstracted by a clip, which is an object that guaranteesto have delivered a certain number of packets by a certain time.

Each clip has a deadline that specifies the maximum response timefor the communication system. The real-time channels that provide theseservices provide an interface similar to a UNIX socket.

The Real-Time ConnectionOrdination Protocol manages requests to create and destroyconnections. A clip is created for each end of the channel. Each clipincludes a message queue at the interface to the objects, acommunication handler that schedules operations, and a packet queue atthe interface to the channel. The communications handler is scheduledusing an EDF policy.

ARMADA supports a group multicast service that distributes timedatomic messages. An admission control service and group membershipservice manage the configuration and operation of the service. Theclient-side watches the system state and sends messages to the serverto update its copy of the state.

A real-time primary-backup service allows state to be replicated inorder to manage fault tolerance. The service provides two types ofconsistency: external consistency with copies of the system kept onservers and internal consistency between different objects in thesystem. The backup service is built on top of the UDP protocol for IP.

The ARMADA project developed a message-level fault injection toolfor analysis of reliability properties. A fault injection layer isinserted between the communication system and the protocol to betested. The fault injection layer can inject new messages, filtermessages, or delay messages.

MPI communication middleware
MPI (MultiProcessor Interface) is a specification for a middleware interface for multiprocessorcommunication. (MPICH is one well-known implementation of MPI.) It wasdesigned to make it easier to develop and port scientific computingapplications on homogeneous processors. It is starting to see some usein embedded computing systems.

MPI provides a rich set of communication services based on a fewcommunication primitives. MPI does not itself define the setup of theparallel system: the number of nodes, mapping of processes or data tonodes, and so on. That setup is provided before the MPI activitiesstart. A minimal MPI program looks something like this:

MPI_Ini t(&argc,&argv) ; /* initialize*/
MPI_Comm_rank(MPI_COMM_WORLD,&r);/* get the index of this node */
MPI_Comm_size(MPI_COMM_WORLD,&s);/* get the total number of nodes */
MPI_Finalize(); /* clean up */

This program simply sets up the system, gets the name (rank) of thisnode, the total system size, and then leaves MPI. The values of r and s were initialized before MPIstarted. A program can be written so that the number of nodes and theassignment of a program to a node can change dramatically withoutrewriting the program.

The basic MPI communication functions are MPI_Send() and MPI_Recv().These provide point-to-point, blocking communication. MPI allows theseroutines to include a data type so that the application can easilydistinguish several types of data.

MPI allows the program to create groups of processes. Groups can bedefined either by name or by topology. The groups can then performmulticast and broadcast.

The MPI standard is large – it includes about 160 functions.However, a minimal MPI system can be provided with only six functions:MPI_Init(), MPI_Comm_rank(), MPI_Comm_size(), MPI_Send(), MPI_Recv(),and MPI_Finalize(). The other functions are implemented in terms ofthese primitives.

Figure6-19 A software stack and services in an embedded multiprocessor.

System-on-chip services
The advent of systems-on-chips (SoC) has resulted in a new generationof custom middleware that relies less on standard services and models.SoC middle ware has been designed from scratch for several reasons.

First, these systems are often power or energy constrained and anyservices must be implemented very efficiently. Second, although the SoCmay be committed with outside standard services, they are notconstrained to use standards within the chip. Third, today's SoCs arecomposed of a relatively small number of processors. The 50- processorsystems of tomorrow may in fact make more use of industry standardservices, but today's systems-on-chips often use customized middleware.

Figure 6-19 above shows atypical software stack for an embedded SoC multiprocessor. This stackhas several elements:

1) The hardwareabstraction layer (HAL) provides a uniform abstraction for devices andother hardware primitives. The HAL abstracts the rest of the softwarefrom both the devices themselves and from certain elements of theprocessor.

2) The real-timeoperating system controls basic system resources such as processscheduling and memory.

3) Theinterprocess communication layer provides abstract communicationservices. It may, for example, use one function for communicationbetween processes whether they are on the same or different PE.

4) Theapplication-specific libraries provide utilities for computation orcommunication specific to the application. The application code usesthese layers to provide the end service or function.

Paulin et al. [Pau02a; Pau06] developed the MultiFlex programmingenvironment to support multiple programming models that are supportedby hardware accelerators. MultiFlex supports both distributed systemobject component (DSOC) and symmetric multiprocessing (SMP) models.Each is supported by a high-level programming interface.

Figure 6-20, below shows theMulti-Flex architecture and its DSOC and SMP subsystems. Differentparts of the system are mapped onto different parts of thearchitecture: control functions can run on top of an OS on the hostprocessor; some high-performance functions, such as video, can run onaccelerators.

Some parallelizable operations can go on a hardware multithreadedset of processors; and some operations can go into DSPs . The DSOC and SMP units in thearchitecture manage communication between the various subsystems.

Figure6-20 Object broker and SMP concurrency engine in MultiFlex. From Paulinet al. [Pau06] © 2006 IEEE.

The DSOC model is based on a parallel communicating object modelfrom client to server. It uses a message-passing engine forinterprocess communication.

When passing a message, a wrapper on the client-side must marshalthe data required for the call. Marshaling may require massaging datatypes, moving data to a more accessible location, and so on. On theserver-side, another wrapper unmarshals the data to make it usable tothe client.

The object request broker coordinates object communication. It mustbe able to handle many parallel objects that execute in parallel. Aserver farm holds the resources to execute a large number of objectrequests. The ORB matches a client request to an available server. Theclient stalls until the request is satisfied.

The server takes a request and looks up appropriate object serversin a table, then returns the results when they are available. Paulin etal. [Pau06] report that 300 MHz RISC processors can perform about 35million object calls per second.

The SMP model is implemented using a software on top of a hardwareconcurrency engine. The engine appears as a memory-mapped device with aset of memory-mapped addresses for each concurrency object. A protectedregion for an object can be entered by writing the appropriate addresswithin that object's address region.

ENSEMBLE [Cad01] is a library for large data transfers that allowsoverlapping computation and communication. The library is designed foruse with an annotated form of Java that allows array accesses and datadependencies to be analyzed for single program, multiple dataexecution.

The library provides send and receive functions, includingspecialized versions for contiguous buffers. The emb_fence() functionhandles the pending sends and receives. Data transfers are handled bythe DMA system, allowing the programs to execute concurrently withtransfers.

Middlewareand Services Architecture of the TI OMAP

Shown in the Figure above are the layers of software in anOMAP-based system. The DSP provides a software interface to itsfunctions. The C55x supports a standard, known as eXpressDSP, fordescribing algorithms. This standard hides some of the memory andinterfacing requirements of algorithms from application code.

The DSP resource manager provides the basic API for the DSPfunctions. It controls tasks, data streams between the DSP and CPU, andmemory allocation. It also keeps track of resources such as CPU time,CPU utilization, and memory.

The DSPBridge in the OMAP is an architecture-specific interface. Itprovides abstract communication but only in the special case of amaster CPU and a slave DSP. This is a prime example of a tailoredmiddleware service.

The Nostrum network-on-chip (NoC) is supported by a communicationsprotocol stack [Mil04a]. The basic service provided by the system is toaccept a packet with a destination process identifier and to deliver itto its destination.

The system requires three compulsory layers: the physical layer isthe network-onchip; the data link layer handles synchronization, errorcorrection, and flow control; and the network layer handles routing andthe mapping of logical addresses to destination process IDs. Thenetwork layer provides both datagram and virtual circuit services.

Sgroi et al. [Sgr01] based their on-chip networking designmethodology on the Metropolis methodology. They successively refine theprotocol stack by adding adaptators. Adaptors can perform a variety oftransformations: behavior adapters allow components with differentmodels of computation or protocols to communicate; channel adaptersadapt for discrepancies between the desired characteristics of achannel, such as reliability or performance, and the characteristics ofthe chosen physical channel.

Benini and De Micheli [Ben01b] developed a methodology for powermanagement of networks-on-chips. They advocate a micronetwork stackwith three major levels: the physical layer; an architecture andcontrol layer that consists of the OSI data link, network, andtransport layers; and a software layer that consists of handle systemsand applications.

At the data link layer, the proper choice of error-correction codesand retransmission schemes is key. The media access control algorithmalso has a strong influence on power consumption. At the transportlayer, services may be either connection-oriented or connectionless.Flow control also has a strong influence on energy consumption.

To read Part 1, go to  Therole of the operating system
To read Part 2, go to Multiprocessor Scheduling.
To read Part 3, go to Event-drivenmultiprocessor scheduling analysis

Usedwith the permission of the publisher, Newnes/Elsevier, this series offive articles is based on copyrighted material from “High-PerformanceEmbedded Computing,” by Wayne Wolf. The book can be purchased online.

Wayne Wolf is professor of electricalengineering at PrincetonUniversity. Prior to joining Princeton he was with AT&T BellLaboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing and of DesignAutomation for Embedded Systems

References:
[Obj06] Object Management Group. CORBA Basics,2006                      
[Sch00]  Douglas C. Schmidt and Fred Kuhns, “Anoverview of thereal-time CORBA specification,” IEEE Computer, June, 2000
[Wol99] Victor Fay Wolfe, et. al., “Expressingand enforcing timingconstraints in a dynamic real-time CORBA environment,” Journal ofReal-Time Systems, 16, 1999.
[Abd99] T. Abdelzaher, et. al. “ARMADAmiddleware and communicationsservices,” Journal of Real-Time Systems, 16, 1999.
[Pau02a] Pierre G. Paulin and Miguel Santana,”Flexware:a retargetableembedded-software development environment,” IEEE Design and Test ofComputers, July/August 2002.
[Pau06]  Pierre G. Paulin, et. al., “Parallelprogramming modelsfor a multi-processor SoC platform applied to hgih speed trafficmanagement,” IEEE Transactions on VLSI Systems, 2006
[Cad01] Sidney Cadot, et. al., “ENSEMBLE:a communications layer forembedded multiprocessor systems,” in Proceedings of the ACM SIGPLANWorkshop on Languages, Compilers and Tools for Embedded Systems, ACMPress, 2001.
[Mil04a]  Mikael Millberg, et. al., “TheNostrum  backbone -a communications protocol stack for networks on chps,” inProceedingsof the VLSI Design Conference, January, 2006.
[Sgr01] M. Sgroi, et. al., “Addressingthe system-on-a-chipinterconnect  woes through communications-based design,”Proceedings of the Design Automation Conferenence, ACM Press, 2001.
[Ben01b] Luca Benini ad Giovanni De Micheli, “Powering Networksonchips,” in Proceedings of the 14th Annual Symposium on SystemSynthesis, IEEE, 2001.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.