The challenges of next-gen multicore networks-on-chip systems: Part 4Raising the abstraction level for computation and communication specification seems the only way to master the complexity of mapping a large software application onto an multi-processor systems-on-chip (MPSoC). Even though the bulk of this book is on architectural and lower-level issues, high-level programming models are needed to support abstraction of hardware and software architectures.
Parallel computer architectures and
parallel programming have deep roots in high-performance computing.
Early programming abstractions for parallel machines go back almost 60
In the last half-century, the
traditional dichotomy between shared memory and message passing as
programming models for multi-processor systems has consolidated. For
small-to-medium scale multi-processor systems consensus was reached on
cache-coherent architectures based on shared memory programming model.
In contrast, large-scale high-performance multi-processor systems have converged toward non-uniform memory access (NUMA) architectures based on message passing (MP) [5, 6]. As already discussed, several characteristics differentiate NoCs and MPSoCs from classical multiprocessing platforms, and this view must be carefully revisited.
First, the "on-chip'' nature of
interconnects reduces the cost of inter-processor communication. The
cost of delivering a message on an on-chip network is in fact at least
one order of magnitude lower (power-and performance-wise) than that of
an off-chip interconnect. NoC platforms feature a growing amount of
on-chip memory and the cost of on-chip memory accesses is also smaller
with respect to off-chip memories.
Second, NoCs are often deployed in resource-constrained, safety-critical systems. This implies that while performance is obviously important, other cost metrics such as power consumption and predictability must be considered and re�ected in the programming model.
Unfortunately, it is not usually
possible to optimize all these metrics concurrently, and one quantity
must typically be traded off against the other. Third, unlike
traditional MP systems, most NoC architectures integrate highly
heterogeneous end-nodes. For instance, some platforms are a mix of
standard processor cores and application-specific processors such as
digital signal processors or micro-controllers [7, 8].
Conversely, other platforms are highly
modular and reminiscent of traditional, homogeneous multi-processor
architectures [9, 10], but they have a highly application-specific
memory hierarchy and input output interface.
programming not a "one-size-fits-all"
These issues indicate that parallel programming based on uniform, coherent shared memory abstraction is not a viable "one-size-fits-all'' solution for highly parallel NoCs. This conclusion is strengthened by observing that uniform, coherent shared memory is not supported even by lower-dimension MPSoC platforms on the market today (refer to the following sections), as well as by many forward-looking research prototypes. On the contrary, evidence accumulates in favor of "communication exposed'' programming models, where communication costs are made visible (at various levels of abstraction) to the programmer.
In other words, we are witnessing a paradigm shift from computation-centric programming toward communication-centric programming, where most of the developer's ingenuity and the software development system support must be focused on reducing or hiding the cost of communication. This poses significant challenges, especially when considering the inefficiency incurred when mapping high-level programming models (such as message passing) onto generic architectures, in terms of software and communication overhead.
Another important trend in parallel
programming for NoC architectures is the distinction between two
different types of parallelism: instruction-level (ILP) and task-level
(TLP). These two types of parallelism are addressed in different ways.
Fine-grained ILP discovery and
exploitation is extremely difficult and labor-intensive for application
programmers. Hence, much effort has been devoted to create tools for
ILP extraction and hardware architectures that can dynamically extract
ILP at execution time. From the programmer viewpoint, emphasis should
be on adopting a programming style that does not obfuscate ILP with
convoluted computation and memory access patterns. Programming
languages abstractions can facilitate ILP extraction by forcing
programmers to use "ILP-friendly'' constructs.
Almost all the commercial and research MPSoC/NoC platforms provide some degree of support for ILP-friendly programming. The key challenges are in shortening the learning curve for the programmer, in ensuring software portability under programming models which are expressive enough to allow synthetic description of complex functionality. These are con�icting requirements, and there a comprehensive solution has not emerged so far.
TLP discovery and exploitation is
generally left to the programmer. Automatic TLP is at a very
preliminary stage, even though some interesting solutions do exist
(they are discussed later in the chapter). Most of existing software
development environments provide several facilities to help programmers
specify task-level parallel computation in an efficient way. To be more
precise, parallel computation is only one (and possibly not the most
critical) characteristic of parallel applications.
In fact, parallel memory access and synchronization are extremely critical for efficiency. This is especially true in NoCs, where communication latencies are relatively short and on-chip memory is often not sufficient to store the full working set of the application. Thus, the key to an efficient parallelization is often not so much in discovering a large number of parallel threads, but in ensuring that their execution is not starved by insufficient I/O bandwidth to main memory, or slowed down by the latency of synchronization and off-chip memory accesses.
This chapter focuses primarily on
software support for TLP. This choice has several motivations. First,
exploitation of ILP is highly dependent on the target processor
architecture, and it is very difficult to derive general approaches and
guidelines. Second, ILP extraction approaches is often performed at
compile time (e.g., in VLIW architectures) or at run time (e.g., in
dynamically scheduled superscalar architectures) without direct
involvement by the programmer.
Nevertheless we shall discuss a few concrete examples of programming environments that support ILP extraction. In these environments, the programmer is offered constructs and libraries that greatly facilitate the automatic extraction, by optimizing compilers, of a high degree of ILP from sequential code.
TLP is supported in various ways. In this chapter, we distinguish three main levels of support, namely: augmentation to traditional sequential programming, communication-exposed programming and automatic parallelization. The first level is commonplace in current software development environments for MPSoCs, the second is at an advanced research stage, with several ongoing technology transfer initiatives, the third is still at an early research stage, not fully mature for industrial exploitation. The remainder of this chapter is divided in three sections, focused on the three support levels mentioned above.
Before delving into the details of TLP exploitation, we set the stage for a quantitative comparison by defining a general architectural template of a multi-core NoC platform that will be referred throughout the chapter in a few case studies. The template is shown in Figure 7.1, below: it features a number of (possibly heterogeneous) processing cores, with a level-1 private fast-access memory (usually, single-cycle).
Level-1 memory can be instantiated as a
software-controlled scratchpad, a hardware-controlled cache or a mix of
both. These cores and their level-1 memories constitute the basic
computing tiles, which are tied to an NoC communication fabric through
one or more network interface ports.
|Figure 7.1. Multiprocessor platform template|
The simplest configuration has one
initiator port, but more complex configurations, with multiple
initiator and target ports are possible. The NoC is also connected to a
number of targets, which represent on-chip or off-chip memories, and
Memories can be associated to a special device, which is used for synchronization. The synchronization device supports atomic read-modify operations, the basic hardware primitive for supporting mutual exclusion. The memory-synchronization device pair is needed for ensuring protected memory access when two computing tiles share one memory slave.
An important feature of the
architectural template is interrupt support. Interrupts can be issued
by writing to an interrupt device, which raises an interrupt to any
specified target processors. Interrupt requests are issued to the
interrupt device by dedicated point-to-point channels. Multiple
interrupt devices can be instantiated, for better scalability.
It is important to notice that interrupts can be raised by processing tiles and also by synchronization devices. Thus, it is possible to send an interrupt to a processor not only through explicit write to the interrupt device from another tile, but also when a lock on a specific memory location is freed.
This generic platform, called MPSIM  has been described in cycle-accurate SystemC, and it can be instantiated to model and simulate a wide variety of NoC architectures. We describe in more details two instantiations which have been used in the following sections to quantitatively analyze and compare different parallel programming models on an NoC target.Shared Memory MPSIM
This architecture consists of a variable number of processor cores (ARM7 simulation models will be deployed for our analysis framework) and of a shared memory device to which the shared addressing space, connected via a system interconnect.
As an extension, each processor also has
a private memory connected to the NoC where it can store its own local
variables and data structures. Hardware semaphores and slaves for
interrupt generation are also connected to system interconnect.
In order to guarantee data coherence from concurrent multi-processor accesses, shared memory can be configured to be non-cacheable, but in this case it can only be accessed through NoC transactions, as no caching of shared memory data is allowed.
Alternatively, the shared memory can be
declared cacheable, but in this case cache coherence has to be ensured.
Hardware coherence support is based on a write-through policy, which
comes into two variants: one based on an invalidate policy (Write-Through
Invalidate, WTI), the other based on an update policy (Write-Through
In contrast with the noncacheable shared
memory platform, the cache-coherent platform imposes a very significant
restriction to the interconnect architecture. Namely, a shared-bus
interconnect is required because cache coherency is ensure through a
snoopy protocol, which requires continuous monitoring of bus
|Figure 7.2. Interface and operations of the snoop device for the (a) invalidate and (b) update policies.|
The hardware snoop devices, for both invalidate and update case, are depicted in Figure 7.2, above. The snoop devices sample the bus signals to detect the transaction which is being performed on the bus, the involved data and the originating core. The input pinout of the snoop device depends on the particular bus protocol supported in the system, and Fig. 7.2 reports the specific example of the interface with the STBus shared-bus node from STMicroelectronics .
When a write operation is �agged, the corresponding action is performed, for example, invalidation for the WTI policy, rewriting of the data for the WTU one. Write operations are performed in two steps. The first one is performed by the core, which drives the proper signals on the bus, while the second one is performed by the target memory, which sends its acknowledge back to the master core to notify operation completion (there can be an explicit and independent response phase in the communication protocol or a ready signal assertion in a unified bus communication phase). The write ends only when the second step is completed and when the snoop device is allowed to consistently interact with the local cache.
Of course, the snoop device must ignore write operations performed by its associated processor core. In our simulation model, synchronization between the core and the snoop device in a computation tile is handled by means of a local hardware semaphore for mutually exclusive access to the cache memory.
The template followed by this shared memory architecture re�ects the design approach of many semiconductor companies to the implementation of shared memory multi-processor architectures. As an example, the MPCore processor implements the ARM11 micro-architecture and can be configured to contain between 1 and 4 processor cores, while supporting fully coherent data caches .Message-Oriented Distributed Memory (MPSIM)
This instantiation represents a distributed memory MPSoC with lightweight hardware extensions for message passing, as depicted in Figure 7.3 below.
|Figure 7.3. Message-oriented distributed memory architecture.|
In the proposed architecture, messages
can be directly transmitted between scratch-pad memories attached to
the processor cores within each computation tile. In order to send a
message, a producer writes in the message queue stored in its
scratch-pad memory, without generating any traffic on the interconnect.
Then, the consumer is allowed to
transfer the message to its own scratch-pad, directly or via a direct
memory access(DMA) controller. Scratch-pad memories are therefore
connected as slave ports to the communication architecture and their
memory space is visible to the other processors.
As far as synchronization is concerned, when a producer intends to generate a message, it locally checks an integer semaphore which contains the number of free messages in the queue.
If enough space is available, it
decrements the semaphore and stores the message in its scratch-pad.
Completion of the write transaction and availability of the message is
signaled to the consumer by incrementing a semaphore located in its
This single write operation goes through the NoC interconnect. Semaphores are therefore distributed among the processing elements, resulting in two advantages: the read/write traffic to the semaphores is distributed and the producer (consumer) can locally poll whether space (a message) is available, thereby reducing interconnect traffic.
A DMA engine has been attached to each core, as presented in , allowing efficient data transfers between the local scratch-pad and non-local memories reachable through the NoC interconnect. The DMA control logic supports multichannel programming, while the DMA transfer engine has a dedicated connection to the scratch-pad memory allowing fast data transfers from or to it.
The architectural template of the Cell
Processor  developed by Sony, IBM and Toshiba shares many
similarities with the distributed memory architecture.
The Cell Processor exhibits eight vector computers equipped with local storage and connected through a data-ring-based system interconnect. The individual processing elements can use this ring-NoC to communicate with each other, and this includes the transfer of data in between the units acting as peers of the network.
Abstraction Implications on Interconnect Architectures
The analysis of the various platform embodiments described above reveals some interesting inter-dependencies between the memory abstraction support and the interconnect architecture.
More specifically, non cache-coherent and distributed memory architectures impose weak coupling constraints on interconnect architecture: these architectures are well-matched to both a shared-bus interconnect and a complex multi-hop network on chip. In contrast, supporting cache coherency on an NoC-based interconnect appears to be a very challenging task.
More in details, non-cacheable shared
memory communication is relatively easy to support, from the functional
viewpoint, on a multi-hop NoC interconnect. The shared memory bank can
be connected to the NoC as a target, and the NoC will route all the
shared-memory reads and writes to the corresponding end-node.
Synchronization and atomicity are much
more challenging. In MPSIM, for instance, synchronization is supported
by a special-purpose slave featuring atomic read-modify operations.
Every master willing to get atomic access to a shared memory region
must first acquire a lock via a read to the special-purpose slave.
In terms on NoC transactions, shared
memory locking requires one or more read transaction (multiple memory
reads are required in case of access contention) to the synchronization
slave. Clearly, this paradigm is not very efficient nor scalable in an
NoC context, because it implies a lot of inefficient reads (remember
that a read transaction incurs NoC latency twice) and destination
congestion, as all locked accesses must go through lock acquisition on
the synchronization slave.
Efficiency can be improved if shared memory targets and the corresponding synchronization targets are not unique. In this way, destination congestion can be alleviated, but even in the best case, locked writing to shared memory has a latency cost which is at least three times the NoC latency (a read-modify to the synchronization slave followed by a posted write to the shared memory target).
Supporting the cache-coherent memory
abstraction across an NoC interconnect is even more challenging. In
principle, this problem has been solved by directory-based
cache-coherency schemes , which have been proposed for large-scale
multi-computers with multi-hop interconnects.
These schemes have a significant
hardware overhead (the directory, and directory-management logic) and
they trigger a number of network transactions that are completely
invisible to the programmer. Directory-based cache-coherent memory
hierarchies have never been implemented in an NoC platform, hence their
cost and efficiency has not been assessed. On the other hand, several
MPSoCs with snoop-based cache coherency have been developed [15, 16].
This scheme requires a shared-bus interconnect, and it is the one supported in our MPSIM platform template. Clearly, it inherits the same scalability problems of any bus-based architecture, and it is not easily generalized to a scalable NoC fabric. The development of efficient cache-coherency schemes for NoC targets is still an open research topic.
Finally, the message-passing
architecture is clearly well-suited to an NoC interconnect. Distributed
communication FIFOs and synchronization eliminate
destination-contention bottlenecks (unless they are present at the
application level), and most of the communication can be done through
posted writes which can be farmed off to DMA engines that run in
parallel with the processors in every computational tile.
It is then quite clear, from the
architectural viewpoint, that the message-passing architecture is more
scalable and best-matched to an NoC interconnect. However, adopting a
pure message-passing architecture has severe implications on the
�exibility of the programming model.
For instance, applications where parallelism comes from having many workers operating in parallel on a very large common data structure (e.g., a high-definition TV frame, where each processor works on a frame window) are not easily coded using a strict message-passing semantics. These issues will be dealt in more detail in Section 7.3.2.
In the following sections, we will use
the MPSIM template to quantitatively compare various programming models
and the corresponding architectural support. To allow a fair
comparison, we will assume that the interconnect is a shared bus.
This choice reduces to some degree the
competitive advantage of message-oriented architectures in terms of
interconnect scalability, but it allows a more precise assessment of
cost and benefits of programming models, without distortions caused by
different interconnect fabrics.
Luca Benini is professor at the
Department of Electrical Engineering and Computer Science at the
University of Bologna, Italy. Giovanni De Micheli is professor and
director of the Integrated Systems Center at EPF in Lausanne,
 F. Boekhorst, "Ambient Intelligence, the Next Paradigm for Consumer Electronics: How will it Affect Silicon?,'' International Solid-State Circuits Conference, Vol. 1, 2002, pp. 28"31.
 G. Declerck, "A Look into the Future of Nanoelectronics,'' IEEE Symposium on VLSI Technology, 2005, pp. 6"10.
 W. Weber, J. Rabaey and E. Aarts (Eds.), Ambient Intelligence. Springer, Berlin, Germany, 2005.
 S. Borkar, et al., "Platform 2015: Intel Processor and Platform Evolution for the Next Decade,'' INTEL White Paper 2005 (PDF).
 D. Culler and J. Singh, Parallel Computer Architecture: A Hardware/Software Approach, Morgan Kaufmann Publishers, 1999.
 L. Hennessy and D. Patterson, Computer Architecture " A Quantitative Approach, 3rd edition, Morgan Kaufmann Publishers, 2003.
 Philips Semiconductor, Philips
 M. Rutten, et al., "Eclipse: Heterogeneous Multiprocessor Architecture for Flexible Media Processing,'' International Conference on Parallel and Distributed Processing, 2002, pp. 39"50.
 ARM Ltd, MPCore Multiprocessors Family
 B. Ackland, et al., "A Single Chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP,'' IEEE Journal of Solid State Circuits, Vol. 35, No. 3, 2000, pp. 412"424.
 G. Strano, S. Tiralongo and C. Pistritto, "OCP/STBUS Plug-in Methodology,'' GSPX Conference 2004.
 M. Loghi, F. Angiolini, D. Bertozzi, L. Benini and R. Zafalon, "Analyzing On-Chip Communication in a MPSoC Environment,'' Design and Test in Europe Conference (DATE) 2004, pp. 752"757.
 F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor and J. M. Mendias, "An Integrated Hardware/Software Approach For Run-Time Scratch-pad Management,'' in Design Automation Conference, Vol. 2, 2004, pp. 238"243.
 D. Pham, et al., "The Design and Implementation of a First-generation CELL Processor,'' in IEEE International Solid-State Circuits Conference, Vol. 1, 2005, pp. 184"592.
 L. Hammond, et al., "The Stanford Hydra CMP,'' IEEE Micro, Vol. 20, No. 2, 2000, pp. 71"84.
 L. Barroso, et al., "Piranha: A Scalable Architecture Based on Single-chip Multiprocessing,'' in International Symposium on Computer Architecture, 2000, pp. 282"293.
 Intel Semiconductor, IXP2850 Network
 STMicroelectronics, Nomadik Platform
 Texas Instruments, "OMAP5910 Platform''
 M. Banikazemi, R. Govindaraju, R. Blackmore and D. Panda. "MP-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP systems'', IEEE transactions Parallel and Distributed Systems, Vol. 12, No. 10, 2001, pp. 1081"1093.
 W. Lee, W. Dally, S. Keckler, N. Carter and A. Chang, "An Efficient Protected Message Interface,'' IEEE Computer, Vol. 31, No. 11, 1998, pp. 68"75.
 U. Ramachandran, M. Solomon and M. Vernon, "Hardware Support for Interprocess Communication,'' IEEE transactions Parallel and Distributed Systems, Vol. 1, No. 3, 1990, pp. 318"329.
 H. Arakida et al., "A 160 mW, 80 nA Standby, MPEG-4 Audiovisual LSI 16Mb Embedded DRAM and a 5 GOPS Adaptive Post Filter,'' IEEE International Solid-State Circuits Conference 2003, pp. 62"63.
 F. Gilbert, M. Thul and N. When, "Communication Centric Architectures for Turbo-decoding on Embedded Multiprocessors,'' Design and Test in Europe Conference 2003, pp. 356"351.
 S. Hand, A. Baghdadi, M. Bonacio, S. Chae and A. Jerraya, "An Efficient Scalable and Flexible Data Transfer Architectures for Multiprocessor SoC with Massive Distributed Memory,'' Design Automation Conference 2004, pp. 250"255.
 P. Paulin, C. Pilkington, E. Bensoudane, "StepNP: A system-level Exploration Platform for Network Processors,'' IEEE Design and Test of Computers, Vol. 19, No. 6, 2002, pp. 17"26.
 P. Stenström, "A Survey of Cache Coherence Schemes for Multiprocessors,'' IEEE Computer, Vol. 23, No. 6, 1990, pp. 12"24.
 M. Tomasevic, V.M. Milutinovic, "Hardware Approaches to Cache Coherence in Shared-Memory Multiprocessors,'' IEEE Micro, Vol. 14, No. 5"6, 1994, pp. 52"59.
 I. Tartalja, V.M. Milutinovic, "Classifying Software-Based Cache Coherence Solutions,'' IEEE Software, Vol. 14, No. 3, 1997, pp. 90"101.
 P. Kongetira, K. Aingaran and K. Olukotun, "Niagara: a 32-way Multithreaded Sparc Processor,'' IEEE Micro, Vol. 25, No. 2, 2005, pp. 21"29.
 H. Sutter and J. Larus, "Software and the Concurrency Revolution,'' ACM Queue, Vol. 3, No. 7, 2005, pp. 54"62.
 R. Stephens, "A Survey of Stream Processing,'' Acta Informatica, Vol. 34, No. 7, 1997, pp. 491"541.
 E. Lee and D. Messerschmitt, "Pipeline Interleaved Programmable DSP's: Synchronous Data Flow Programming,'' IEEE Transactions on Signal Processing, Vol. 35, No. 9, 1987, pp. 1334"1345.
 W. Thies, et al., "Language and Compiler Design for Streaming Applications,'' IEEE International Parallel and Distributed Processing Symposium, 2004, pp. 201.
 M. Bekoij, Constraint Driven Operation Assignment for Retargetable VLIW Compilers, Ph.D. Dissertation, 2004.
 R. Allen, K. Kennedy, Optimizing Compilers for Modern Architectures: A Dependence-based Approach, Morgan-Kaufman, 2001.
 P. Faraboschi, J. Fisher and C. Young, "Instruction Scheduling for Instruction Level Parallel Processors,'' Proceedings of the IEEE, vol. 89, No. 11, 2001, pp. 1638"1659.
 M. Taylor, W. Lee, S. Amarasinghe and A. Agarwal, "Scalar Operand Networks,'' IEEE Transactions on Parallel and Distributed Systems, Vol. 16, No. 2, 2005, pp. 145"162.
 P. Mattson, et al., "Communication Scheduling,'' ACM Conference on Architectural Support for Programming Languages and Operating Systems, 2000, pp. 82"92.
 M. Sato, "OpenMP: Parallel Programming API for Shared Memory Multiprocessors and On-chip Multiprocessors,'' IEEE International Symposium on System Synthesis, 2002, pp. 109"111.
 N. Genko, D. Atienza, G. De Micheli, J. Mendias, R. Hermida, F. Catthoor, "A Complete Network-on-chip Emulation Framework,'' Design, Automation and Test in Europe, Vol. 1, 2005, pp. 246"251.
 C. Shin, et al., "Fast Exploration of Parameterized Bus Architecture for Communication-centric SoC Design,'' Design, Automation and Test in Europe, Vol. 1, 2004, pp. 352"357.
 M. Michael, "Scalable, Lock-free Dynamic memory Allocation,'' ACM Conference on Programming Languages Design and Implementation, 2004, pp.110"122.