The challenges of next-gen multicore networks-on-chip systems: Part 6 - Embedded.com

The challenges of next-gen multicore networks-on-chip systems: Part 6

Parallel programming libraries as described in the previoussectionsare a viable short-term answer to the quest for increased softwareproductivity for NoC platforms, but ultimately they are only partialextensions to traditional sequential programming languages.

If we reason more abstractly about parallelprogramming environments as syntactic frameworks for expressingconcurrency, today we are in parallel computing at a level similar tosequential programming at the dawn of the structured programming era.Locks and threads are basic structured constructs of concurrency.Similarly to what happened to sequential programming languages with theobject-oriented revolution, there is a need for higher-levelabstractionsfor building large-scale concurrent programs.

The evolution toward high-level concurrent softwaredevelopment environments is going to be a difficult but unavoidablestep, because parallelism exploitation will be essential for achievinghigher performance. Commercial and systems programming languages willbe assessed based on their support for concurrent programming. Existinglanguages, such as C, are likely to be extended with concurrencysupport beyond simple library extensions.

Languages that will not evolve to support parallelprogramming will gradually become useless in high-performanceapplication. On the other hand, parallel programming is demonstrablymore difficult than sequential. For example, simultaneous analysis ofprogram context and synchronization is proved to be undecidable [31].Moreover, human programmers experience many difficulties in reasoningin terms of concurrency, especially when analyzing the side effects ofthe many possibleinterleaving of parallel tasks.

A well-tried approach to raise the abstraction levelwithout compromising efficiency is to reduce the expressiveness of thelanguage. Taking a more positive viewpoint, this approach impliesfocusing the software development framework on a model of computationwhich is both abstract and efficient. The price to be paid is in lackof generality and flexibility, but this is an acceptable loss if themodel of computation is well-suited to describea wide spectrum of market-relevant applications.

The most successful example of this approach in thearea of parallel programming for NoCs is stream computing. The key ideain stream computing is to organize an application into streams andcomputational kernels to expose its inherent locality and concurrency.Streams represent the flow of data, while kernels are computationaltasks that manipulate and transform data. Many signal- andmedia-processing algorithms caneasily be seen as sequences of transformations applied on a data stream.

In this case, streams are a natural way ofexpressing the functionality of the application, while at the sametime, they match very well the hardwareexecution engines.

More formally, a stream computation can berepresented by a directed graph, where nodes represent computations andedge represent data communication. Both nodes and edges are annotated.Nodes are annotated with the function that they execute, while edgesare annotated with the data they carry. The stream model fully exposesinter-task communication. Intuitively, this is a very desirable featurebecause it forces the programmer to focus on communication, not only oncomputation.

By analyzing the communication edges in a streamcomputation, it is possible to obtain precise estimates of thecommunication requirements for a given application. This greatlysimplifies analysis and mapping of application onto parallelarchitectures. In fact, one of the most challenging issues in parallelprogramming is the analysis of communication, which is critical forperformance optimization, but it is often obscured by thecomputation-oriented semantic of traditional programming languages.

Stream-oriented models of computations have beenstudied since the early 1960s (the interested reader is referred to thegood survey by Stephen [32]), under the name of dataflow. Numerouslanguages (e.g., LUCID) have been developed to support dataflowsemantics. Probably, the best-known semantic for dataflow has beenproposed by Kahn, and it is known as kahn process network (KPN) [32].In the KPN formalism, computational nodes perform arbitrarycomputations and communicate through unbounded FIFOs associated withthe edges.

A computation consumes one or more data tokens (ofarbitrary size) on the input FIFOs and produces one or more data tokenson the output. Since FIFOs are unbounded, there is no problem inblocking or overflow of full FIFOs. Computational nodes can be blockedonly on empty FIFOs. The KPN have been proved to be Turing-equivalent,hence they are expressive as any sequential semantic, but they have amuch more explicit model for parallel execution of tasks and inter-taskcommunication.

One important result proved by Kahn is that theresult of the execution of a KPN is invariant in the time order(schedule) of the node computations, provided that the partial orderrules implicitly created by the FIFO semantics on the channels arerespected.

Clearly KPNs are a theoretical model, which is notimplementable in practice. The most obvious unrealistic assumption isthat of unbounded FIFOs. Any real-life system must use bounded FIFOs.Ensuring schedulability of a computation with bounded FIFO space is amuch tougher theoretical problem, and it often solved by restrictingthe execution semantic of the KPN model.

One of the most well-known KPN specializationsfeaturing an efficient schedulability analysis is Synchronous Dataflow(SDF). In SDF computational nodes (called actors) have static,non-uniform rates of execution (firing), firing is atomic and datadriven. This model is implementable on real single- andmulti-processors and its properties have been studied extensively [33].Unfortunately, SDF is not Turing equivalent, in other words there arecomputations that can be expressed by KPN and cannot be performed inand SDF model.

The mathematical properties of dataflow have beenextensively studied by the theoretical computer science community [32],and the importance of a solid theoretical foundation for programverification is widely recognized and understood. Dataflow/streaminglanguages and programming environments move from these solidtheoretical foundations, but they usually make compromises to provideflexibility and to ease code development.

Graphical dataflow programming environments
Many commercial environments supporting stream programming areavailable and widely adopted (e.g., Mathlab/Simulink,SPW, COSSAP,etc.) and even more viable prototypes have been demonstrated inresearch (e.g., Ptolemy). Developing streaming applications in theseenvironments is relatively straightforward. The programmer usuallyworks using a graphic user interface (GUI) to specify the dataflowgraph nodes and dependencies. A large library of computational nodes,implementing common processing tasks is available (for instance, an FIRfilter block, with programmable coefficients).

For instance, theMathlab/Simulink environment provides hundreds of pre-defined actorsthat can be readily instantiated without any code development. When thebehavior of an actor is not available in the library, the programmermoves to a text-based interface where it can specify the node behavior,using pre-specified function calls to read input tokens and produceoutput tokens. Computation is usually specified in animperative language, such as C.

Graphical dataflow environments are very effectivein enhancing the productivity of the software developer, and thisexplains their widespread usage in software prototyping. However, inpractice they have not been used for final product-code implementation,because they generate quite inefficient code for the target MPSoCarchitectures. In order to understand the reasons for this shortcoming,we need to look deeper inimplementation details.

Probably the most critical issue in translatingdataflow specifications into efficient NoC platform code is memorymanagement. Let us consider as an example a video-processing pipelinethat operates on a large image frame (which can reach multi-megabytesize for high-definition color images). Each actor in the pipelineperforms processing on an incoming frame and passes it to the followingactor.

Let us assume that, we parallelize the execution byallocatingeach frame-processing actor on a different processor on a multi-corearchitecture. The token-passing semantic hides a performance pitfall:passing a frame token implicitly specifies a multi-megabyte datatransfer. If the processors have private memories, passing a framerequires a very expensive memory-to-memory copy.

Figure7.9. Partitioning a large memory frame to improve streaming performance.

Even worse, if the frame does not fit into theon-chip private memory of each processor, an external memory overflowarea will have to be defined, and each token passing will then imply asignificant amount of off-chip memory traffic, which is extremely slowand energy-inefficient. Moreover, it creates a destination contentionbottleneck that no NoC architecture can alleviate, as all backgroundmemory transfers must go through a singleNoCtarget interface tooff-chip memory.

As a result, the convenient token-passing abstractionbreaks down in terms of efficiency, as execution becomes severelymemory and communication bound. Intuitively, it is quite clear thatthis memory bottleneck can be eased if computation is specified in amore memory-aware fashion.

For instance, the computation can beperformed in sub-sections of the image by finergranularity task, asshown in Figure 7.9 above. Inthis case, the full frame resides in main (off-chip) memory, and theimage is processed one sub-unit ad a time (e.g., stripe andsub-square). Tokens have much finer granularity and they can be passedalong without clogging the on-chip interconnectsand the Ios.

Unfortunately, traditional prototyping-orienteddataflow environments do not offer analysis and optimization featuresthat help programmers in this difficult balancing act. Hence, thetraditional design flow is that an executable specification obtained ina dataflow programming environment will have to manually translated inefficient code in a slow and painful manual tuning process.

Another critical issue with dataflow programming isglobal datadependent control flow. Control flow (conditionals, loops)can be specified in a dataflow environment as a part of actor behavior.For instance, conditional execution of an actor's computation can bespecified by guarding its code with a conditional. However, in manypractical applications a single control-flow decision impacts theexecution of a number of actors.

Consider, for example, MPEGvideo stream processing:different computations are performed on different frames, depending onthe frame type (B, I, T). Even worse, asynchronous events (e.g., theuser pressing a button) may imply significant changes in thecomputation to be performed by actors. The “pure'' dataflow semanticshave been extended to deal with global control flow and asynchronousevents, but the efficient mapping of theseextensions on target platform hardware is still not mature.

Stream Languages
Stream languages were developed with the purpose of closing the gapbetween abstraction and efficiency. In other words, they aim atproviding high-level abstraction to the programmer for controllingstream manipulation and processing. At the same time they syntacticallyenforce a set of rules to facilitate the creation of complex programsthat can be compiled very efficiently onto the hardware platformtargets.

On one hand, developing a complex stream-processingapplication using a stream language should be as fast and productive asusing a graphical prototyping environment. On the other hand, thecompilation of such a program should produce a very efficientexecutable without needing slow and error-pronemanual translation.

The basic rationale of this approach is to createlanguages that are a better match for communication-dominated NoC-basedplatforms than traditional imperative languages such as plain C or C++.This matching is obtained by making parallelism and communicationexplicit at the language level, while at the same time providinghigh-level primitives to manage them.

Efficient automated generation of executable code isfacilitated by restricting the freedom given to the programmer inaccessing global resources (e.g., global variables), and in specifyingarbitrarily complex control and data flow. In order to giveconcreteness to these observations, we will now analyze in some detailsone of the most welldeveloped stream languages, StreamIt, proposed bythe MIT Group leadby S. Amarasinghe [34].

A case study: StreamIt
StreamIt is a language explicitly designed for stream applicationprogramming. It defines a set constructs that expose the parallelismand communication of streaming applications, but it avoids explicitreferences to the topology of the target architecture and thegranularity of its computational engines.

A program consists of a number of simple blocks whichcommunicate data in a limited set of patterns. The atomic block inStreamIt is called Filter: it specifies a single-input”single-outputcomputations. A filter body is specified using a single-threadedJAVA-like procedural language. This is the work function, whichspecifies the Filter's steady-state computation. An example of a simplefilter is shown in Figure 7.10, below.

Figure7.10. Filter construct in StreamIT

A Filter communicates with connected blocks throughinput and output FIFO queues. These channels support three primitiveaccess functions: (i)pop(), which removes an item from the output endof the channel and returns its value, (ii) peek(i) ,which returns thevalue of the item i spaces from the end of the FIFO, without removingit and (iii)push(x) writes the value of x to the front of the channel.If x is an object, a copy is enqueued on the channel.

Filter input”output operations have statically(compile-time) specified rates of production and consumption of FIFOselements. Note that the work function is triggered when enough samplesare in the input FIFO to allow all input queue peek operationsspecified by its body. Filters also contain initialization functions,which serve two purposes: they set the initial state of the Filter(such as, for instance, the taps coefficients in a FIR filter), second,they specify the Filter's I/O types and data rates to thecompiler.

Filters are compositional. They can be connected viathree composite blocks: pipelines, split-joins and feedback loops. Eachof these structures also has a single entry and exit points, allowingmodular recursive composition. The Pipeline construct builds a sequenceof streams. It has an initialization function, whose primary purpose isto add component streams in a specified sequence.

There is no work function, because the behavior of apipeline is fully specified by its components. The SplitJoin constructspecifies independent parallel streams that diverge from a commonsplitter will eventually merge into a common joiner. Similarly toPipeline, the components of a SplitJoin are specified with successivecalls to add from the init function. The splitter specifies how todistribute items from the input of the SplitJoin to the components.

Three types of split are supported: (i) Duplicate,which replicates each data item and sends a copy to each parallelstream, (ii) RoundRobin, which assigns a weight to each parallel streamand sends a number of items to each stream equal to its weight beforemoving to the next and (iii) Null, which means that the parallelcomponents do not require inputs.

The joiner indicates how the outputs of the parallelstreams should be interleaved on the output channel of the SplitJoin.There are two joiner types: RoundRobin and Null. The FeedbackLoopconstruct enable the specification cycles in a streamgraph. It contains:

(i) a bodystream, which is the block around which the”feedback path'' will be closed,
(ii) a loop stream, to performcomputationon the feedback path,
(iii) a splitter, to distributeitems between the feedback path and the output channel at the bottom ofthe loop, (iii) a joiner, which merges items between the feedback pathand the input channel. All these components are specified in theinitialization function. Upon initialization, there are no items on thefeedback path and they can be providedby an initPath function.

An important characteristic of the StreamIt languageis that it cannot specify arbitrary data flow graph. Only hierarchicalcomposition of pipelines, splitjoins and feedback loops can bespecified. This greatly facilitates compiler analysis and optimization,while forcing the programmerto build well-organized code.

StreamIt also support constructs for passingirregular, low-volume control information, called messages, betweenfilters and streams. It is possible to send a message from within thebody of a filter work function to modify a parameter setting in otherfilters. The sender continues execution while the message is beingdelivered: message delivery is asynchronous, and it returns no value.

One important characteristics of the message deliveryabstraction is that it is possible to specify latency bounds for thedelivery times. Recall that execution is parallel and asynchronous indataflow languages, and there is no global notion of time. Hence,latencies can be specified only relatively to data token.

It is therefore possible to specify a latency boundin terms of number of data token starting from a reference token. Forinstance, it is possible to specify that a message arrives to thereceivers work function K evaluations after the arrival of the datatoken that has triggered message issue by the sender. Various types ofmessages are supported, including broadcast messages andre-initializationmessages.

From this cursory view of the StreamIt language, wecan draw a few observations. First, data communication between filtersand filter compositions are completely explicit can be analyzed atcompile time. Second, data production and consumption is staticallydefined at compile time. This greatly eases allocation and schedulingof hardware resources at compilation time (e.g., FIFO sizing an channelsizing).

Moreover, the language's strictly defined firingrules make it much easier to schedule computations and to guard them.Overall, the language is expressive enough to model non-trivialcomputations, and its strong structure makes it much easier to generatehighly efficient executable on an NoC-basedtiled architecture.

However, there are some limitations that clearlypoint to the need for significant additional work in this area. First,there is no way to keep the size of data tokens under control. Hence,the programmer can easily specify a behavior where very large datatokens are passed around. If data tokens cannot fit on local storageassociated to computational nodes, a lot of traffic from”to globalmemories will be created, with obvious efficiency losses. Second, thelanguage constructs are quite limited and a lot of the behavior has tobe fully at compile time.

Data-dependent global control flow is very difficultto express (messages help, but they are cumbersome). It is quite clearthan many choices have been made to limit expressiveness in an effortto ease the job of the compiler. The suitability of StreamIt todescribe truly complex application behaviors is therefore not yetproved.

Next in Part 7: Computer-aidedsoftware development tools
To read Part 1, go to Whyon-chip networking?
To read Part 2, go to SoCobjectives and NoC needs.
To read Part 3, go to Onceover lightly
To read Part 4 go to  Networks-on-chip programming issues
To read Part 5, go to Task-Level Parallel  Programming

Used with the permission of the publisher, Newnes/Elsevier, this seriesof seven articles is based on material from “NetworksOn Chips: Technology and Tools,” by Luca Benini and Giovanni DeMicheli.

Luca Benini is professor at theDepartment of Electrical Engineering and Computer Science at theUniversity of Bologna, Italy. Giovanni De Micheli is professor anddirector of the Integrated Systems  Center at EPF in Lausanne,Switzerland.

References
[1] F. Boekhorst, “AmbientIntelligence, the Next Paradigm for Consumer Electronics: How will itAffect Silicon?,'' Internationa l Solid-Stat e Circuit sConference , Vol. 1, 2002, pp. 28″31.

[2] G. Declerck, “ALook into the Future of Nanoelectronics,'' IEEE Symposium on VLSITechnology, 2005, pp. 6″10.

[3] W. Weber, J. Rabaey and E. Aarts(Eds.), Ambient Intelligence. Springer, Berlin, Germany, 2005.

[4] S. Borkar, et al., “Platform2015: Intel Processor and Platform Evolution for the Next Decade,''INTEL White Paper 2005 (PDF).

[5] D. Culler and J. Singh, ParallelComputer Architecture: A Hardware/Software Approach, MorganKaufmann Publishers, 1999.

[6] L. Hennessy and D. Patterson,Computer Architecture ” A Quantitative Approach, 3rd edition, MorganKaufmann Publishers, 2003.

[7] Philips Semiconductor, PhilipsNexperia Platform.

[8] M. Rutten, et al., “Eclipse:Heterogeneous Multiprocessor Architecture for Flexible Media Processing,''International Conference on Parallel and Distributed Processing, 2002,pp. 39″50.

[9] ARM Ltd, MPCoreMultiprocessors Family

[10] B. Ackland, et al., “ASingle Chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP,'' IEEEJournal of Solid State Circuits, Vol. 35, No. 3, 2000, pp. 412″424.

[11] G. Strano, S. Tiralongo and C.Pistritto, “OCP/STBUS Plug-in Methodology,'' GSP X Conferenc e2004.

[12] M. Loghi, F. Angiolini, D.Bertozzi, L. Benini and R. Zafalon, “AnalyzingOn-Chip Communication in a MPSoC Environment,'' Desig n an dTes t i n Europ e Conferenc e (DATE )2004, pp. 752″757.

[13] F. Poletti, P. Marchal, D. Atienza,L. Benini, F. Catthoor and J. M. Mendias, “AnIntegrated Hardware/Software Approach For Run-Time Scratch-padManagement,'' in Desig n Automatio n Conference ,Vol. 2, 2004, pp. 238″243.

[14] D. Pham, et al., “TheDesign and Implementation of a First-generation CELL Processor,''in IEE E Internationa l Solid-Stat e Circuit sConference , Vol. 1, 2005, pp. 184″592.

[15] L. Hammond, et al., “TheStanford Hydra CMP,'' IEEE Micro, Vol. 20, No. 2, 2000, pp. 71″84.

[16] L. Barroso, et al., “Piranha:A Scalable Architecture Based on Single-chip Multiprocessing,'' inInternational Symposium on Computer Architecture, 2000, pp. 282″293.

[17] Intel Semiconductor, IXP2850Network Processor,

[18] STMicroelectronics, Nomadik Platform

[19] Texas Instruments, “OMAP5910 Platform''

[20] M. Banikazemi, R. Govindaraju, R.Blackmore and D. Panda. “MP-LAPI: An Efficient Implementation of MPIfor IBM RS/6000 SP systems'', IEE E transaction s Paralle lan d Distribute d Systems , Vol. 12, No. 10, 2001,pp. 1081″1093.

[21] W. Lee, W. Dally, S. Keckler, N.Carter and A. Chang, “AnEfficient Protected Message Interface,'' IEE E Computer ,Vol. 31, No. 11, 1998, pp. 68″75.

[22] U. Ramachandran, M. Solomon and M.Vernon, “HardwareSupport for Interprocess Communication,'' IEE E transaction sParalle l an d Distribute d Systems , Vol.1, No. 3, 1990, pp. 318″329.

[23] H. Arakida et al., “A160 mW, 80 nA Standby, MPEG-4 Audiovisual LSI 16Mb Embedded DRAM and a5 GOPS Adaptive Post Filter,'' IEEE International Solid-StateCircuits Conference 2003, pp. 62″63.

[24] F. Gilbert, M. Thul and N. When, “CommunicationCentric Architectures for Turbo-decoding on Embedded Multiprocessors,''Desig n an d Tes t i n Europ e Conferenc e2003, pp. 356″351.

[25] S. Hand, A. Baghdadi, M. Bonacio,S. Chae and A. Jerraya, “

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.