The challenges of next-gen multicore networks-on-chip systems: Part 4 - Embedded.com

The challenges of next-gen multicore networks-on-chip systems: Part 4

Raising the abstraction level for computation and communicationspecification seems the only way to master the complexity of mapping alarge software application onto an multi-processor systems-on-chip(MPSoC). Even though the bulk of this book is on architectural andlower-level issues, high-level programming models are needed to supportabstraction of hardware and software architectures.

Parallel computer architectures andparallel programming have deep roots in high-performance computing.Early programming abstractions for parallel machines go back almost 60years.

In the last half-century, thetraditional dichotomy between shared memory and message passing asprogramming models for multi-processor systems has consolidated. Forsmall-to-medium scale multi-processor systems consensus was reached oncache-coherent architectures based on shared memory programming model.

In contrast, large-scalehigh-performance multi-processor systems have converged towardnon-uniform memory access (NUMA) architectures based on message passing(MP) [5, 6]. As already discussed, several characteristicsdifferentiate NoCs and MPSoCs from classical multiprocessing platforms,and this view must be carefully revisited.

First, the “on-chip'' nature ofinterconnects reduces the cost of inter-processor communication. Thecost of delivering a message on an on-chip network is in fact at leastone order of magnitude lower (power-and performance-wise) than that ofan off-chip interconnect. NoC platforms feature a growing amount ofon-chip memory and the cost of on-chip memory accesses is also smallerwith respect to off-chip memories.

Second, NoCs are often deployed inresource-constrained, safety-critical systems. This implies that whileperformance is obviously important, other cost metrics such as powerconsumption and predictability must be considered and re�ected in theprogramming model.  

Unfortunately, it is not usuallypossible to optimize all these metrics concurrently, and one quantitymust typically be traded off against the other. Third, unliketraditional MP systems, most NoC architectures integrate highlyheterogeneous end-nodes. For instance, some platforms are a mix ofstandard processor cores and application-specific processors such asdigital signal processors or micro-controllers [7, 8].

Conversely, other platforms are highlymodular and reminiscent of traditional, homogeneous multi-processorarchitectures [9, 10], but they have a highly application-specificmemory hierarchy and input output interface.

Parallelprogramming not a “one-size-fits-all”
These issues indicate that parallelprogramming based on uniform, coherent shared memory abstraction is nota viable “one-size-fits-all'' solution for highly parallel NoCs. Thisconclusion is strengthened by observing that uniform, coherent sharedmemory is not supported even by lower-dimension MPSoC platforms on themarket today (refer to the following sections), as well as by manyforward-looking research prototypes. On the contrary, evidenceaccumulates in favor of “communication exposed'' programming models,where communication costs are made visible (at various levels ofabstraction) to the programmer.

In other words, we are witnessing aparadigm shift from computation-centric programming towardcommunication-centric programming, where most of the developer'singenuity and the software development system support must be focusedon reducing or hiding the cost of communication. This poses significantchallenges, especially when considering the inefficiency incurred whenmapping high-level programming models (such as message passing) ontogeneric architectures, in terms of software and communication overhead.

Another important trend in parallelprogramming for NoC architectures is the distinction between twodifferent types of parallelism: instruction-leve l (ILP) and task-leve l(TLP). These two types of parallelism are addressed in different ways.

Fine-grained ILP discovery andexploitation is extremely difficult and labor-intensive for applicationprogrammers. Hence, much effort has been devoted to create tools forILP extraction and hardware architectures that can dynamically extractILP at execution time. From the programmer viewpoint, emphasis shouldbe on adopting a programming style that does not obfuscate ILP withconvoluted computation and memory access patterns. Programminglanguages abstractions can facilitate ILP extraction by forcingprogrammers to use “ILP-friendly'' constructs.

Almost all the commercial and researchMPSoC/NoC platforms provide some degree of support for ILP-friendlyprogramming. The key challenges are in shortening the learning curvefor the programmer, in ensuring software portability under programmingmodels which are expressive enough to allow synthetic description ofcomplex functionality. These are con�icting requirements, and there acomprehensive solution has not emerged so far.

TLP discovery and exploitation isgenerally left to the programmer. Automatic TLP is at a verypreliminary stage, even though some interesting solutions do exist(they are discussed later in the chapter). Most of existing softwaredevelopment environments provide several facilities to help programmersspecify task-level parallel computation in an efficient way. To be moreprecise, parallel computation is only one (and possibly not the mostcritical) characteristic of parallel applications.

In fact, parallel memory access andsynchronization are extremely critical for efficiency. This isespecially true in NoCs, where communication latencies are relativelyshort and on-chip memory is often not sufficient to store the fullworking set of the application. Thus, the key to an efficientparallelization is often not so much in discovering a large number ofparallel threads, but in ensuring that their execution is not starvedby insufficient I/O bandwidth to main memory, or slowed down by thelatency of synchronization and off-chip memory accesses.

This chapter focuses primarily onsoftware support for TLP. This choice has several motivations. First,exploitation of ILP is highly dependent on the target processorarchitecture, and it is very difficult to derive general approaches andguidelines. Second, ILP extraction approaches is often performed atcompile time (e.g., in VLIW architectures) or at run time (e.g., indynamically scheduled superscalar architectures) without directinvolvement by the programmer.

Nevertheless we shall discuss a fewconcrete examples of programming environments that support ILPextraction. In these environments, the programmer is offered constructsand libraries that greatly facilitate the automatic extraction, byoptimizing compilers, of a high degree of ILP from sequential code.

TLP is supported in various ways. Inthis chapter, we distinguish three main levels of support, namely:augmentation to traditional sequential programming,communication-exposed programming and automatic parallelization. Thefirst level is commonplace in current software development environmentsfor MPSoCs, the second is at an advanced research stage, with severalongoing technology transfer initiatives, the third is still at an earlyresearch stage, not fully mature for industrial exploitation. Theremainder of this chapter is divided in three sections, focused on thethree support levels mentioned above.

ArchtectureTemplate
Before delving into the details of TLPexploitation, we set the stage for a quantitative comparison bydefining a general architectural template of a multi-core NoC platformthat will be referred throughout the chapter in a few case studies. Thetemplate is shown in Figure 7.1, below :it features a number of (possibly heterogeneous) processing cores, witha level-1 private fast-access memory (usually, single-cycle).

Level-1 memory can be instantiated as asoftware-controlled scratchpad, a hardware-controlled cache or a mix ofboth. These cores and their level-1 memories constitute the basiccomputing tiles, which are tied to an NoC communication fabric throughone or more network interface ports.

Figure7.1. Multiprocessor platform template

The simplest configuration has oneinitiator port, but more complex configurations, with multipleinitiator and target ports are possible. The NoC is also connected to anumber of targets, which represent on-chip or off-chip memories, andI/Os.

Memories can be associated to a specialdevice, which is used for synchronization. The synchronization devicesupports atomic read-modify operations, the basic hardware primitivefor supporting mutual exclusion. The memory-synchronization device pairis needed for ensuring protected memory access when two computing tilesshare one memory slave.

An important feature of thearchitectural template is interrupt support. Interrupts can be issuedby writing to an interrupt device, which raises an interrupt to anyspecified target processors. Interrupt requests are issued to theinterrupt device by dedicated point-to-point channels. Multipleinterrupt devices can be instantiated, for better scalability.

It is important to notice thatinterrupts can be raised by processing tiles and also bysynchronization devices. Thus, it is possible to send an interrupt to aprocessor not only through explicit write to the interrupt device fromanother tile, but also when a lock on a specific memory location isfreed.

This generic platform, called MPSIM [12]has been described in cycle-accurate SystemC, and it can beinstantiated to model and simulate a wide variety of NoC architectures.We describe in more details two instantiations which have been used inthe following sections to quantitatively analyze and compare differentparallel programming models on an NoC target.

Shared Memory MPSIM
This architecture consists of avariable number of processor cores (ARM7 simulation models will bedeployed for our analysis framework) and of a shared memory device towhich the shared addressing space, connected via a system interconnect.

As an extension, each processor also hasa private memory connected to the NoC where it can store its own localvariables and data structures. Hardware semaphores and slaves forinterrupt generation are also connected to system interconnect.

In order to guarantee data coherencefrom concurrent multi-processor accesses, shared memory can beconfigured to be non-cacheable, but in this case it can only beaccessed through NoC transactions, as no caching of shared memory datais allowed.

Alternatively, the shared memory can bedeclared cacheable, but in this case cache coherence has to be ensured.Hardware coherence support is based on a write-through policy, whichcomes into two variants: one based on an invalidate policy (Write-Throug hInvalidate , WTI ), the other based on an update policy (Write-Throug hUpdate , WTU ).

In contrast with the noncacheable sharedmemory platform, the cache-coherent platform imposes a very significantrestriction to the interconnect architecture. Namely, a shared-businterconnect is required because cache coherency is ensure through asnoopy protocol, which requires continuous monitoring of bustransaction.

Figure7.2. Interface and operations of the snoop device for the (a)invalidate and (b) update policies.

The hardware snoop devices, for bothinvalidate and update case, are depicted in Figure 7.2, above. Thesnoop devices sample the bus signals to detect the transaction which isbeing performed on the bus, the involved data and the originating core.The input pinout of the snoop device depends on the particular busprotocol supported in the system, and Fig. 7.2 reports the specificexample of the interface with the STBus shared-bus node fromSTMicroelectronics [11].

When a write operation is �agged, thecorresponding action is performed, for example, invalidation for theWTI policy, rewriting of the data for the WTU one. Write operations areperformed in two steps. The first one is performed by the core, whichdrives the proper signals on the bus, while the second one is performedby the target memory, which sends its acknowledge back to the mastercore to notify operation completion (there can be an explicit andindependent response phase in the communication protocol or a readysignal assertion in a unified bus communication phase). The write endsonly when the second step is completed and when the snoop device isallowed to consistently interact with the local cache.

Of course, the snoop device must ignorewrite operations performed by its associated processor core. In oursimulation model, synchronization between the core and the snoop devicein a computation tile is handled by means of a local hardware semaphorefor mutually exclusive access to the cache memory.

The template followed by this sharedmemory architecture re�ects the design approach of many semiconductorcompanies to the implementation of shared memory multi-processorarchitectures. As an example, the MPCore processor implements the ARM11micro-architecture and can be configured to contain between 1 and 4processor cores, while supporting fully coherent data caches [9].

Message-Oriented Distributed Memory (MPSIM)
This instantiation represents a distributed memory MPSoC withlightweight hardware extensions for message passing, as depicted inFigure 7.3 below.

Figure7.3. Message-oriented distributed memory architecture.

In the proposed architecture, messagescan be directly transmitted between scratch-pad memories attached tothe processor cores within each computation tile. In order to send amessage, a producer writes in the message queue stored in itsscratch-pad memory, without generating any traffic on the interconnect.

Then, the consumer is allowed totransfer the message to its own scratch-pad, directly or via a directmemory access(DMA) controller. Scratch-pad memories are thereforeconnected as slave ports to the communication architecture and theirmemory space is visible to the other processors.

As far as synchronization is concerned,when a producer intends to generate a message, it locally checks aninteger semaphore which contains the number of free messages in thequeue.

If enough space is available, itdecrements the semaphore and stores the message in its scratch-pad.Completion of the write transaction and availability of the message issignaled to the consumer by incrementing a semaphore located in itsscratch-pad memory.

This single write operation goes throughthe NoC interconnect. Semaphores are therefore distributed among theprocessing elements, resulting in two advantages: the read/writetraffic to the semaphores is distributed and the producer (consumer)can locally poll whether space (a message) is available, therebyreducing interconnect traffic.

A DMA engine has been attached to eachcore, as presented in [13], allowing efficient data transfers betweenthe local scratch-pad and non-local memories reachable through the NoCinterconnect. The DMA control logic supports multichannel programming,while the DMA transfer engine has a dedicated connection to thescratch-pad memory allowing fast data transfers from or to it.

The architectural template of the CellProcessor [14] developed by Sony, IBM and Toshiba shares manysimilarities with the distributed memory architecture.

The Cell Processor exhibits eight vectorcomputers equipped with local storage and connected through adata-ring-based system interconnect. The individual processing elementscan use this ring-NoC to communicate with each other, and this includesthe transfer of data in between the units acting as peers of thenetwork.

MemoryAbstraction Implications on Interconnect Architectures
The analysis of the various platformembodiments described above reveals some interesting inter-dependenciesbetween the memory abstraction support and the interconnectarchitecture.

More specifically, non cache-coherentand distributed memory architectures impose weak coupling constraintson interconnect architecture: these architectures are well-matched toboth a shared-bus interconnect and a complex multi-hop network on chip.In contrast, supporting cache coherency on an NoC-based interconnectappears to be a very challenging task.

More in details, non-cacheable sharedmemory communication is relatively easy to support, from the functionalviewpoint, on a multi-hop NoC interconnect. The shared memory bank canbe connected to the NoC as a target, and the NoC will route all theshared-memory reads and writes to the corresponding end-node.

Synchronization and atomicity are muchmore challenging. In MPSIM, for instance, synchronization is supportedby a special-purpose slave featuring atomic read-modify operations.Every master willing to get atomic access to a shared memory regionmust first acquire a lock via a read to the special-purpose slave.

In terms on NoC transactions, sharedmemory locking requires one or more read transaction (multiple memoryreads are required in case of access contention) to the synchronizationslave. Clearly, this paradigm is not very efficient nor scalable in anNoC context, because it implies a lot of inefficient reads (rememberthat a read transaction incurs NoC latency twice) and destinationcongestion, as all locked accesses must go through lock acquisition onthe synchronization slave.

Efficiency can be improved if sharedmemory targets and the corresponding synchronization targets are notunique. In this way, destination congestion can be alleviated, but evenin the best case, locked writing to shared memory has a latency costwhich is at least three times the NoC latency (a read-modify to thesynchronization slave followed by a posted write to the shared memorytarget).

Supporting the cache-coherent memoryabstraction across an NoC interconnect is even more challenging. Inprinciple, this problem has been solved by directory-basedcache-coherency schemes [5], which have been proposed for large-scalemulti-computers with multi-hop interconnects.

These schemes have a significanthardware overhead (the directory, and directory-management logic) andthey trigger a number of network transactions that are completelyinvisible to the programmer. Directory-based cache-coherent memoryhierarchies have never been implemented in an NoC platform, hence theircost and efficiency has not been assessed. On the other hand, severalMPSoCs with snoop-based cache coherency have been developed [15, 16].

This scheme requires a shared-businterconnect, and it is the one supported in our MPSIM platformtemplate. Clearly, it inherits the same scalability problems of anybus-based architecture, and it is not easily generalized to a scalableNoC fabric. The development of efficient cache-coherency schemes forNoC targets is still an open research topic.

Finally, the message-passingarchitecture is clearly well-suited to an NoC interconnect. Distributedcommunication FIFOs and synchronization eliminatedestination-contention bottlenecks (unless they are present at theapplication level), and most of the communication can be done throughposted writes which can be farmed off to DMA engines that run inparallel with the processors in every computational tile.

It is then quite clear, from thearchitectural viewpoint, that the message-passing architecture is morescalable and best-matched to an NoC interconnect. However, adopting apure message-passing architecture has severe implications on the�exibility of the programming model.

For instance, applications whereparallelism comes from having many workers operating in parallel on avery large common data structure (e.g., a high-definition TV frame,where each processor works on a frame window) are not easily codedusing a strict message-passing semantics. These issues will be dealt inmore detail in Section 7.3.2.

In the following sections, we will usethe MPSIM template to quantitatively compare various programming modelsand the corresponding architectural support. To allow a faircomparison, we will assume that the interconnect is a shared bus.

This choice reduces to some degree thecompetitive advantage of message-oriented architectures in terms ofinterconnect scalability, but it allows a more precise assessment ofcost and benefits of programming models, without distortions caused bydifferent interconnect fabrics.

To read Part 1, go to Whyon-chip networking?
To read Part 2, go to SoCobjectives and NoC needs.
To read Part 3, go to Onceover lightly
Next in Part 5: Task-Level ParallelProgramming

Used with the permission of the publisher,Newnes/Elsevier, this series of six articles is based on material from “ NetworksOn Chips: Technology and Tools,” byLuca Benini and Giovanni De Micheli.

Luca Benini is professor at theDepartment of Electrical Engineering and Computer Science at theUniversity of Bologna, Italy. Giovanni De Micheli is professor anddirector of the Integrated Systems  Center at EPF in Lausanne,Switzerland.

References
[1] F. Boekhorst, “AmbientIntelligence, the Next Paradigm for Consumer Electronics: How will itAffect Silicon?,'' Internationa l Solid-Stat e Circuit sConference , Vol. 1, 2002, pp. 28″31.

[2] G. Declerck, “ALook into the Future of Nanoelectronics,'' IEEE Symposium on VLSITechnology, 2005, pp. 6″10.

[3] W. Weber, J. Rabaey and E. Aarts(Eds.), Ambient Intelligence. Springer, Berlin, Germany, 2005.

[4] S. Borkar, et al., “Platform2015: Intel Processor and Platform Evolution for the Next Decade,''INTEL White Paper 2005 (PDF).

[5] D. Culler and J. Singh, ParallelComputer Architecture: A Hardware/Software Approach, MorganKaufmann Publishers, 1999.

[6] L. Hennessy and D. Patterson,Computer Architecture ” A Quantitative Approach, 3rd edition, MorganKaufmann Publishers, 2003.

[7] Philips Semiconductor, PhilipsNexperia Platform.

[8] M. Rutten, et al., “Eclipse:Heterogeneous Multiprocessor Architecture for Flexible Media Processing,''International Conference on Parallel and Distributed Processing, 2002,pp. 39″50.

[9] ARM Ltd, MPCoreMultiprocessors Family

[10] B. Ackland, et al., “ASingle Chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP,'' IEEEJournal of Solid State Circuits, Vol. 35, No. 3, 2000, pp. 412″424.

[11] G. Strano, S. Tiralongo and C.Pistritto, “OCP/STBUS Plug-in Methodology,'' GSP X Conferenc e2004.

[12] M. Loghi, F. Angiolini, D.Bertozzi, L. Benini and R. Zafalon, “AnalyzingOn-Chip Communication in a MPSoC Environment,'' Desig n an dTes t i n Europ e Conferenc e (DATE )2004, pp. 752″757.

[13] F. Poletti, P. Marchal, D. Atienza,L. Benini, F. Catthoor and J. M. Mendias, “AnIntegrated Hardware/Software Approach For Run-Time Scratch-padManagement,'' in Desig n Automatio n Conference ,Vol. 2, 2004, pp. 238″243.

[14] D. Pham, et al., “TheDesign and Implementation of a First-generation CELL Processor,''in IEE E Internationa l Solid-Stat e Circuit sConference , Vol. 1, 2005, pp. 184″592.

[15] L. Hammond, et al., “TheStanford Hydra CMP,'' IEEE Micro, Vol. 20, No. 2, 2000, pp. 71″84.

[16] L. Barroso, et al., “Piranha:A Scalable Architecture Based on Single-chip Multiprocessing,'' inInternational Symposium on Computer Architecture, 2000, pp. 282″293.

[17] Intel Semiconductor, IXP2850 NetworkProcessor,

[18] STMicroelectronics, Nomadik Platform

[19] Texas Instruments, “OMAP5910 Platform''

[20] M. Banikazemi, R. Govindaraju, R.Blackmore and D. Panda. “MP-LAPI: An Efficient Implementation of MPIfor IBM RS/6000 SP systems'', IEE E transaction s Paralle lan d Distribute d Systems , Vol. 12, No. 10, 2001,pp. 1081″10

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.