MPSoC platforms are now commonplace in embeddedcomputing, ranging from wireless and wireline communication , tomultimedia [7, 18], signal processing  and graphics . From thesoftware view-point, there is little consensus on the programmer viewoffered in support of these highly parallel platforms. In many cases,very little support is offered and the programmer is in charge ofexplicitly managing data transfers and synchronization. Clearly, thisapproach is extremely labor-intensive, error-prone and leads to poorlyportable software. For this reason MPSoC platform vendors are devotingan increasing amount of effort to offering more abstract programmerviews through parallel programming libraries and their applicationprogramming interfaces (APIs).
The most common form of support is in essence anincremental evolution of standard sequential programming. Theapplication developer usually start from a single-threaded applicationdeveloped for a uni-processor. Porting the application to amulti-processor target is a work-intensive activity, which can besummarized as follows. First, the single-threaded implementation isanalyzed in terms of its main functional bocks and their datadependencies. This analysis has three main objectives, which can besummarized as follows:
1. Identifying the critical computational kernels ofthe applications.This activity is often supported by profiling tools.
2. Identifying the data-flow between computational kernels. This isdone by hand, exploiting the knowledge of the data structures andthe application flow.
3. Applying a workload partitioning strategy that splits theapplicationin a sub-blocks that can be executed in parallel. This is themost difficult task, as a successful partitioning fully depends on thedeveloper's ingenuity and experience.
The last two steps are far from trivial. In manycases, sequential applications do not show significant potential forparallelization, simply because they have been written in a non-modularfashion.
A typical example of this problem is the common caseof a single loop that creates most of the application workload. In thiscase, the loop must be rewritten, for instance as a set of loopsoperating on disjoint data sets, in order to extract somecoarse-grained parallelism. More in general, many globaltransformations can be applied at this stage, such as merging orsplitting loops, choosing suitable data-structures, replicating data,etc.
Fortunately, there are several application domainswhere simple coarsegrained parallelization strategies are available.For example many multimedia applications operate on a set ofindependent frames (e.g., a sequence of audio samples, independentvideo frame overlays, etc.), and frame- by- frame parallelization ispossible. For instance, if encryption is performed on a data stream onfixed-sized data blocks, it is possible to instantiate multipleidentical encryption tasks, each one operating in parallel on adifferent block.
Generally, application tasks are not fullyindependent, but they share and exchange data, and need to synchronize.Parallel programming API must therefore provide support for datasharing and synchronization.
Programming abstraction is a key requirement: eventhough it is in principle possible to fully expose the hardware to theprogrammer and let her/him use low-level software interfaces tosynchronization and communication primitives, programmer productivitywould be extremely low, and the resulting software would be completelyarchitecture-specific and nonportable.
Thus, parallel programming libraries provide anabstract view of the underlying architecture, througharchitecture-neutral high-level primitives. The two most commonabstractions offered by current libraries aremessage passing (MP) and shared memory (SM).
Message passing and shared memory correspond to “abstract machines,''which are generally a much simplified model of the underlying hardware.The message-passing abstract machine has a number of independentprocessors with disjoint memory spaces. Communication betweenprocessors happens only via explicit messages (send”receive pairs).Special messages can be used for synchronization purposes.
Message passing has first been studied in thehigh-performance multi-processor community, where many techniques havebeen developed for reducing message delivery latency. Several authorsdeveloped light-weight software wrappers (bypassing the operatingsystem) for message-passing interface to reduce the setup cost (e.g.,).
Others have moved message-passing primitives inhardware, ranging from dedicated instructions (e.g., ) to complexco-processors for accelerating message passing (e.g., ). Dedicatedinstructions coupled to a low-latency communication network supportfine granularitymessages, enabling massive amounts of parallelism.
Message-passing implementation in MPSoCs Messagepassing has also entered the world of embedded MPSoC platforms. In thiscontext it is most commonly implemented on top of a shared memoryarchitecture (e.g., TI OMAP , Philips Eclipse , Toshiba Kawasaki, Philips Nexperia ). This approach is a consolidated one alsoin the traditional high-performance parallel computing field, andnumerous implementations of message-passing standards (e.g., MPI) ontoshared memory machines have been reported .
The main issue in MPSoC architectures is that sharedmemory is likely to become a performance/energy bottleneck, even whenDMAs are used to increase the transfer efficiency. This statement iseasily understood if we think that transferring a message from aprocessor to another using a shared memory requires at least one readand one write to that memory.
Let us consider an architecture with one singleshared memory slave on the interconnect. Clearly, since every messagedelivery (i.e., a send”receive pair) require writing then reading themessage (payload and some amount of wrapping information), allprocess-to-processor messaging traffic will have to go through a singlememory, which inevitably becomes the system's bottleneck. Note that anNoC interconnect would not help in this case, because the architecturewill suffer from destination contention, whichcannot be alleviated by any scalable interconnect.
Synchronization poses more subtle and even moredifficult challenges to the efficiency of message passing based onshared memory. This can bebetter understood through an example.
Example: Shared memory message passing .
Consider a send”receive pair where the sender is late, and thereceiver processor is waiting for a message. Let us assume that thereceiver waits on a spin lock, that is it continuously polls asynchronization variable (e.g., a semaphore mapped device), waiting forthe message to arrive. Polling will create a significant amount of readtraffic which will contribute to interconnect saturation.
Moving from active waiting to interrupt-based wakeupwill alleviate the problem, but it will increase the latency of messagedelivery, because of interrupt handling overhead. Moreover,synchronization operations requires locked (atomic) transactions overthe communication fabric. This not only further degrades theperformance, but also limits the scalability of the aboveimplementationsfor more advanced communication architectures.
Several authors have recently proposed support formessage passing on a distributed memory architecture. An interestingcase-study is presented in Ref.  where a turbo-coder is mapped on amessage-passing architecture. On each processing tile an I/O-device isresponsible for transmitting/receiving messages from the communicationarchitecture. Buffer underflow/overflow of the I/O-device has to beavoided in software.
In Ref. , more generic support for messagepassing is provided. A light-weight co-processor, called a memoryserver access point, is added to each processor. The access point linksthe processor to its own local memory, but also to the remote memories.The synchronization is left to the message-passing protocol. The aboveapproaches have limited support for synchronization and limitedflexibility in matching the application to the communicationarchitecture. For example, in Ref.  remote memories are alwaysaccessed with a DMA-like engine even though this is notthe most efficient strategy for small message sizes.
Hardware acceleration in support of message passingis also proposed in the StepNP architecture . In StepNP ahigh-level software abstraction is supported, namely message passingbased on parallel communicating objects. From the programmer'sviewpoint, a message is sent by calling a method of a remote object,similarly to what is done in high-level distributed programmingenvironments like JAVA, CORBA and DCOM.
Most of the operation related to translating a remotemethod invocation in a physical data-transfer (i.e., marshaling andun-marshaling, packetization, etc.) are implemented by hardwareaccelerators in the processor's network interfaces, and therefore theyhave a very low cost in terms of execution cycles. However, all remotemethod calls and returns in a system are managed by a single objectrequest broker (ORB), implemented with dedicated hardware engine.
This design decision, needed to support thehigh-level programming abstraction of the STEP-NP platform, leads to anintuitive system bottleneck, namely the ORB, which has to manage allthe traffic related to all the messages sent and received by all thetasks in thesystem.
Programming a message-passing NoC platform To givethe reader a more concrete understanding of the issues and challengesin supporting message passing for MPSoC, we now discuss in detail aspecific case study. An implementation based on the distributed memoryarchitecture described in Part4 was developed, leveraging scratchpad memories and localhardware semaphores.
Moreover, full porting of standard message-passinglibraries traditionally used in the parallel computing domain mightcause a significant overhead in resource-constrained MPSoCs. For thisreason, we developed a simple but highly optimized message-passinglibrary, custom-tailored for the scratch-pad-based distributed memoryarchitecture we are considering. The most importantfunctions are listed in Table. 7.1,below.
Messages are managed via message queues. Toinstantiate a queue, both the producer and consumer must run aninitialization routine. To initialize the producer, sq_init_ producer is called. Ittakes as arguments the ID of the consumer's, the message size, thenumber of messages in the queue and a binary value. The last argumentspecifies whether the producer should poll the producer's semaphore orsuspend itself until an interrupt is generated.
|Table7.1. Simple message-passing library of APIs|
The consumer is initialized with sq_init_consumer . It requires theID of the consumer's itself, the location of the read buffer and thepoll/suspend flag. In detail, the second parameter indicate the addresswhere the function sq_read will store the message transferred from the producer's message queue.This address can be located eitheron the private memory or on the scratchpad memory.
The producer sends a message with the sq_write(_dma) function. Thisfunction copies the data from *source to a free message block inside the queue buffer. This transfer caneither be carried out by the core or via a DMA transfer (x_dma) . Instead of copying thedata from *source into amessage block, the producer can decide to directly generate data in afree message block. The sq_getToken_write returns a free block in the queue's buffer on which the producercan operate. When data is ready, the producer should mark itsavailability to the consumer with sq_putToken_write .The consumer transfers a message from the producer's queue to a privatemessage buffer with voidsq_read(_dma) . Again, the transfer can beperformed either by a local DMA or by the core itself.
This simple messaging library thus supports:
1. either processor or DMA-initiated data transfersto remotememories,
2. either polling-based or interrupt-based synchronization,
3. flexible allocation of the consumer's message buffer, that is onscratchpad or on a private memory at a higher level of the hierarchy.
Thanks to the high-level APIs, this flexibility canbe effectively used to optimize message passing based applications.Message-passing performance analysis
The message-passing library is carefully optimizedfor the distributed memory architecture showed in Figure 7.3 in Part 4. The libraryimplementationis very lightweight, since it is based on C macros that do notintroduce significant overhead. A producer”consumer exchange of dataprogrammed via the library showed just a 1% overhead with respect tothe programmer'sdirect transfer control.
More interestingly, the library can be used tocompare alternative implementations of message-passing architecturesand for fine architectural tuning. First, we should question whetherthis customized solution (message-oriented architecture and MP library)is competitive with an alternative solution where message passing isimplemented on top ofshared memory.
We are interested in the average time required for aconsumer to obtain a message available on the producer's queue (Figure. 7.4, below ), under theassumption that the queue dimension is always sized big enough tocontain all transferred messages. In other words, the destination queuenever becomes full. We measure the transfer time per message (averagedover 20 messages) and explore different message sizes using either theprocessors or DMAsfor transfers.
|Figure7.4. Analysis of read execution cycles in a producer consumer benchmark.|
Performance of the distributed memory solution can bededuced from Figure 7.4 above by looking at “Scratch Queue Read ''family curves. The comparison iswith “Shared Queue Read '' (asolution wherein the communication queuebetween two processors is stored in shared memory, and thus reflectsmessage-passing implementation on top of a shared-memory architecture).
We can observe that, the longer the message (seeX-axis ) the more theapplication execution time is dominated by memory transfers. Hence,performance gains of “Scratch QueueRead '' become larger, due to theuse of a fast, local scratchpad memory instead of a slow shared memory.Employing a DMA engine for “SharedQueue Read '' does not help a lot.Please note that there are three architectural variants for “ScratchQueue Read '':
(i) basic version ,where data are moved fromproducer's scratchpad to the consumer's private memory by the consumerprocessor itself via single transfers;
(ii) “local buffer'' version ,where data are moved tothe consumer's scratchpad instead of its private memory, thus cuttingdown on memory write costs;
(iii) “local buffer and DMA''version , where data aremoved using dedicated DMA engines, thus making data transfers betweenscratchpad memories more efficient.
The fixed setup cost for programming DMA can clearlybe observed for zero-sized messages. As soon as the message sizeexceeds 25 words, the increased transfer efficiency compared toexplicit copying outweighs the setup cost. From these results, weclearly observe the inefficiency of message-passing implementation ontop of a shared-memory architecture.
|Table7.2. Different message-passing implementations|
We continue our analysis to show that flexibility inthe message-passing library can improve performance results. In fact,the library can exploit several features of the underlying hardwaresuch as processor-versus DMA-driven data transfers or interrupt versusactive polling. The proper architecture configuration can be selectedbased on the application.
Let us consider an application consisting of afunctional pipeline of eight matrix multiplication tasks. Each stage ofthis pipeline takes a matrix as input, multiplies it with a localmatrix and passes the result to the next stage. We iterate the pipeline20 times. We run the benchmark respectively on an architecture witheight and four processors.
In the first case, only one task is executed on eachprocessor, while in the second, we added concurrency by scheduling twotasks on each core. First, we compare three different configurations ofthe message-oriented architecture (Table7.2, above ). We execute the pipeline for two matrix sizes:8×8 and 32×32 elements.
In the latter case, longer messages are transmitted.Analyzing the results in Figure 7.5below, referred to the case where one task runs on eachprocessor, we can observe that a DMA is not always beneficial in termsof throughput. For small messages, the overhead for setting up the DMAtransfer is not justified.
In case of larger messages, the DMA-based solutionoutperforms the processor-driven transfers. Instead, employing a DMAalways leads to an energy reduction, even if the duration of thebenchmark is longer, due to a more power efficient data transfer.Please note that energy of all system components (DMA included) isaccounted for in the energy plot.
|Figure7.5. Comparison of message-passing implementations in a pipelinedbenchmark with eight cores from Table 7.2.|
Furthermore, the way a consumer is notified of thearrival of a message plays an important role, performance- andenergy-wise. The consumer has to wait until the producer releases theconsumer's semaphore. With a single task per processor (Figure 7.5, above ), theoverhead related to the interrupt routine can slows down the system,depend on the communication versus computation ratio and polling is, ingeneral, more efficient. On the contrary, with two tasks per processor (Figure 7.6, below , referred tomatrices of 8×8 elements) the interrupt-based approach performsbetter.
In this case, it is more convenient to suspend thetask because the concurrent task scheduled on the same processor is in”ready'' state. Instead, with active polling, the processor is stalledand the other task cannot be scheduled.
This simple application case study shows theimplementation of message passing should be matched with application'sworkload. This is only possible with a flexible message-passing library.
|Figure7.6. Task scheduling impact on synchronization in a pipelined benchmarkwith four cores from Table 7.2.|
The key property of the shared-memory approach is that communicationoccurs implicitly as a result of conventional memory accessinstructions. Processes have address spaces that can be configured sothat portions ofthem are shared, that is, mapped to a common physical location.
Processesbelonging to a parallel application communicate with each other byreading and writing shared variables. The presence of a memoryhierarchywith locally cached data is a major source of complexity inshared-memoryapproaches.
Broadly speaking, approaches for solving thecache-coherence problem fall into two major classes: hardware-based andsoftware-based. The former imposes cache coherence by adding suitablehardware which guarantees coherence of cached data, whereas the latterimposes coherence by limiting the caching of shared data.
This can be done by the programmer, the compiler, orthe operating system. For a survey of hardware-based cache-coherencesolutions in general-purpose computing, the reader is referred to[5,27,28] software-based cache-coherencesolutions are reviewed in .
Shared-memory in MPSoCs Even though message passinghas received considerable attention recently, shared memory is the mostcommon programmer abstraction in today's MPSoCs. In embedded MPSoCplatforms, shared-memory coherence is often supported only throughsoftware libraries which rely on the definition of non-cacheable memoryregions for shared data or on cache flushing at selected points of theexecution flow.
There are a few exceptions that rely on hardwarecache coherence, especially for platforms which have a high degree ofhomogeneity in computational node architecture . On the contrary,hardware cache coherency is by far the most common architectural optionof homogeneous chip multi-processors which are becomingcommonplace in general-purpose computing .
This bifurcation has taken place for two mainreasons. First, most general-purpose engines are homogeneous, and allprocessors have the same level-1 cache architecture and memorymanagement unit. The cores used in these chip multi-processors areoften derived from older designs with cache-coherency support, thus, noextra design effort is needed to support it.
Second, and most important, a general-purpose chipmultiprocessor is designed to replace a traditional high-performancemicroprocessor, hence it should run efficiently single-processorworkloads.
From the programmer viewpoint, this implies that thehardware architecture should provide a programming model with thesmoothest transition from single thread of execution tomulti-threading. Hardware-based cache coherency fully support auniform, coherent memory space, where threads can communicate throughshared variables and parallelization is thereforerelatively easy.
In contrast, MPSoCs for low-power embeddedapplications are heterogeneous collections of programmable cores whichhave been carefully chosen and customized for a specific class ofapplications. For instance, the widespread SoC platforms for wirelesshandsets feature a dualprocessor architecture, where a RISCmicrocontroller is coupled with a DSP.
The microcontroller performs system control functionsand manages the user interface, while the DSP runs the base-bandsection of the modem. These are very different workloads, and each coreis highly efficient (performance and energy-wise) in executing aspecific workload. Achieving the same level of performance on ageneral-purpose dual-processor engine would be highlyenergy-inefficient as it would require two very powerfulgeneral-purpose cores.
Supporting both hardware-based cache coherency andheterogeneous processing elements is a very challenging task from thearchitectural viewpoint. First, heterogeneous processors work almostalways on disjoint data structures. Hence, most of thecoherency-related operations performed by the hardware at runtime arepure overhead.
Second, different types of processors have verydifferent access patterns to global memory. Conflicting access patternscan put under severe stress a cache-coherent machine. For instance, thestreaming access pattern of signal processors is highly predictable (nodata-dependent control flow), but has very low locality and poor cachebehavior. Last, hardware-based cache coherence heavily relies on thepresence of a shared-bus interconnect.
Many MPSoCs already feature a number of cores andbandwidth requirements that exceed the bandwidth limits of a sharedbus. Extensions of hardware-based cache coherency to high-bandwidthparallel or multi-hop NoCs is difficult andhas not been attempted yet.
Even though shared memory is considered aninherently easier architectural abstraction than message passing fromthe programmer's viewpoint, writing complex and correct parallelprograms on a shared-memory machine is very challenging. For thesereasons, a number of parallel programming libraries for shared-memorymachines have been developed such as Pthreads .
The main purpose of these libraries is to provide ahigh-level interface to parallel programming primitives based on sharedmemory, in order to enhancer the productivity of the software developer.
These APIs offer three main types of functions:
1. creation, management and destruction of parallel threads ofexecution;
2. allocation, deallocation access of shared data-structures;
3. various forms of synchronization.
Besides facilitating the development of complexparallel applications, these libraries provide a number of sidebenefits: code portability across multiple hardware platforms, easierdebugging and documentation, faster parallelization of existingsequential code, better upgrading of existing code base.
The critical distinctive issue in embedded MPSoCs isthat software abstraction is not without a cost, and library-inducedoverhead of hundreds of instructions are not affordable in a number oftightly performance- constrained applications. For these reasons, thelibraries made available in MPSoC development toolkits offer only basicabstractions and functions. In the next part in this series, we willuse the MPSIM virtual platform as a case study for analyzing in detailthe implementation of ashared-memory library in a resource-limited MPSoC context.
Programming ashared-memory NoC platform
We now focus on the shared-memory MPSIM platform (see Part 4). The system V IPClibrary, which was originally developed to write parallel applicationsin the Unix operating system was ported onto MPSIM.
One important advantage of using a standard libraryavailable in uniprocessor Linux and Unix environments is that itbecomes possible for software designers to develop their applicationson host PCs and to easily port their code onto the MPSoC virtualplatform for validation and fine-grained software tuning on the targetarchitecture.
System V IPC is a communication library forheavy-weight processes based on permanent kernel resident objects. Eachobject is ident