The challenges of next-gen multicore networks-on-chip systems: Part 3 -

The challenges of next-gen multicore networks-on-chip systems: Part 3

There is a large literatureon architectures for local-and wide-area networks and more speci�callyfor single-chip multiprocessors [1, 5, 6, 13, 20, 33]. Thesearchitectures can be classi�ed by their topology, structure andparameters. The most common on-chip communication architecture is theshared-medium architecture, as exempli�ed by the sharedbus.

Unfortunately, bus performance and energyconsumption are penalizedwhen the number of end nodes scales up. Point-to-point architecturessuch as mesh, torus and hypercube have been shown to scale up despite ahigher complexity in their design. There are a few ad hoc NoCarchitectures, such as Octagon [20], which have been designed forsilicon implementation and have no counterpart in macroscopic networks.

Some properties of NoC architectures depend onthe topology, while some others depend on speci�c choices of protocols.For example, deadlock, livelock and starvation depend also on theforwarding scheme and routing algorithms chosen for the speci�carchitecture. NoCprotocols are typically organized in layers, in a fashion thatresembles the OSI protocol stack (Fig. 1.10).

Figure1.10. Micro-network stack.

Physical Layer
Global wires are the physical implementation of the communicationchannels. Physical layer signaling techniques for lossy transmissionlines have been studied for a long time by high-speed board designersand microwave engineers [3, 12]. Traditional rail-to-rail voltagesignaling with capacitive termination, as used today for on-chipcommunication, is de�nitely not well suited for high-speed, low-energycommunication on future global interconnect [12].

Reduced swing,current-mode transmission, as used in some processor-memory systems,can signi�cantly reduce communication power dissipation whilepreserving speed of data communication.

Nevertheless, as the technology trends lead usto use smaller voltage swings and capacitances, the error probabilitieswill increase. Thus, the trend toward faster and lower-powercommunication may decrease reliability as an unfortunate side effect.Reliability bounds as voltages scale can be derived from theoretical(entropic) considerations [17] and can be measured also by experimentson real circuits.

NoCs support a design paradigm shift. FormerSoC design styles consider wiring-related effects as undesirableparasitics, and try to reduce or cancel them by speci�c and detailedphysical design techniques. Recent and future NoC design willde-emphasize physical design, by allowing circuits to produce errors.

Errors will be contained, detected andcorrected at the data-linklevel, by using, for example, error correcting codes (ECCs). This ispossible in NoCs due to the layering of protocols. Thus, emphasis onphysical layer design will be mainly on signal drivers and receivers,as well as design technologies for restoring and pipelining signals.

Data-Link Layer
The data-link layer abstracts the physical layer as an unreliabledigital link, where the probability of bit errors is small but notnegligible (and increasing as technology scales down). Furthermore,reliability can be traded off for energy [17]. The main purpose ofdata-link protocols is to increase the reliability of the link up to aminimum required level, under the assumption that the physical layer byitself is not suf�ciently reliable.

An additional source of errors is contentionin shared-medium networks. Contention resolution is fundamentally anon-deterministic process because it requires synchronization of adistributed system, and for this reason it can be seen as an additionalnoise source. In general, non-determinism can be virtually eliminatedat the price of some performance penalty. For instance, centralized busarbitration in a synchronous bus eliminates contention-induced errors,at the price of a substantial performance penalty caused by the slowbus clock and by bus request/release cycles.

Future high-performance shared-medium on-chipmicro-networks may evolve in the same direction as high-speedlocal-area networks, where contention for a shared communicationchannel can cause errors because two or more transmitters are allowedto concurrently send data on a shared medium. In this case, provisionsmust be made for dealing with contention-induced errors.

An effective way to deal with errors incommunication is to packetiz e data. If data is sent on anunreliable channel in packets, error containment and recovery areeasier, because the effect of errors is contained by packet boundaries,and error recovery can be carried out on a packet-by-packet basis. Atthe data-link layer, error correction can be achieved by using standardECCs that add redundancy to the transferred information.

Errorcorrection can be complemented by several packet-based error detectionand recovery protocols. Several parameters in these protocols (e.g.,packet size, number of outstanding packets, etc.) can be adjusteddepending on the goal to achieve maximum performance at a speci�edresidual error probability and/or within given energy consumptionbounds.

Network and Transport Layers
At the network layer, packetized data transmission can becustomized by the choice of switching and routing algorithms. Theformer establishes the type of connection while the latter determinesthe path followed by a message through the network to its �naldestination. Popular packet switching techniques include store-and-forwar d(SAF), virtua l cut-throug h (VCT) and wormhole .SAF forwarding inspects each packet's content before forwarding it tothe next stage.

While SAF enables more elaborated routingalgorithms(e.g., content-aware packet routing), it introduces extra packet delayat every router stage. Furthermore, SAF also requires a substantialamount of buffer spaces because the switches need to store multiplecomplete packets at the same time.

The VCT scheme can forward a packetto the next stage before its entirety is received by the currentswitch. Therefore, VCT switching reduces the delay, as compared to SAF,but it still requires storage because when the next stage switch is notavailable, the entire packet still needs to be stored in the buffers.

With wormhole switching, each packet isfurther segmented into flow control units (flits) . The header �itreserves the routing channel of each switch, the body �its will thenfollow the reserved channel and the tail �it will later release thechannel reservation.

One major advantage of wormhole switching isthatit does not require the complete packet to be stored in the switchwhile waiting for the header �it to route to the next stages. Thus,wormhole switching not only reduces the SAF delay at each switch, butit also requires much smaller buffer spaces. On the other hand, withwormhole, one packet may occupy several intermediate switches at thesame time.

Thus, it may block the transmission of otherpackets. Deadloc kand liveloc k are the potential problems in wormhole schemes [8,13]. A number of recently proposed NoC prototype implementations areindeed based on wormhole packet switching [1, 7, 10, 16].

Switching is tightly coupled to routing.Routing algorithms establish the path followed by a message through thenetwork to its �nal destination. The classi�cation, evaluation andcomparison of on-chip routing schemes [13] involve the analysis ofseveral trade-offs, such as predictability versus average performance,router complexity and speed versus achievable channel utilization, androbustness versus aggressiveness. A coarse distinction can be madebetween deterministi c and adaptiv e routing algorithms.Deterministic approaches always supply the same path between a givensource”destination pair, and they are the best choice for uniform orregular traf�c patterns. In contrast, adaptive approaches useinformation on network traf�c and channel conditions to avoid congestedregions of the network. They are preferable in presence of irregulartraf�c or in networks with unreliable nodes and links. Among otherrouting schemes, probabilistic broadcast algorithms [14] have beenproposed for NoCs.

At the transport layer, algorithms deal withthe decomposition of messages into packets at the source and theirassembly at destination. Packetization granularity is a critical designdecision because the behavior of most network control algorithms isvery sensitive to packet size. Packet size can be application speci�cin SoCs, as opposed to general networks. In general, �ow control andnegotiation can be based on either deterministic or statisticalprocedures. Deterministic approaches ensure that traf�c meetsspeci�cations, and provide hard bounds on delays or message losses. Themain disadvantage of deterministic techniques is that they are based onworst cases, and they generally lead to signi�cant under-utilization ofnetwork resources. Statistical techniques are more ef�cient in terms ofutilization, but they cannot provide worst-case guarantees. Similarly,from an energy viewpoint, we expect deterministic schemes to be moreinef�cient than statistical schemes because of their implicitworst-case assumptions.

Software Layers
Current and future SoCs will be highly programmable, andtherefore their power consumption will critically depend on softwareaspects. Software layers comprise system and application software. Thesystem software provides us with an abstraction of the underlyinghardware platform, which can be leveraged by the application developerto safely and effectively exploit the hardware's capabilities. Thehardware abstraction layer (HAL) istightly coupled to the design of wrappers for processor cores, that actas network interfaces between processing cores and the NoC.

Current SoC software development platforms aremostly geared toward single microcontroller with multiple coprocessorsarchitectures. Most of the system software runs on the controlprocessor, which orchestrates the system activity and farms offcomputationally intensive tasks to domain-speci�c coprocessors.

Microcontroller”coprocessor communication isusually not data-intensive(e.g., synchronization and recon�guration information), and mosthigh-bandwidth data communication (e.g., coprocessor”coprocessor andcoprocessor”IO) is performed via shared memories and direc t memor yacces s (DMA) transfers.

The orchestration activities in themicrocontroller are performed via run-time services provided bysingle-processor RTOSes (e.g., VxWorks, Micro-OS, Embedded Linuxes,etc.), which differentiate from standard operating systems in theirenhanced modularity, reduced memory footprint, and support forreal-time scheduling and bounded time interrupt service times.

Application programming is mostly based onmanual partitioning and distribution of the most computationallyintensive kernels to data coprocessors (e.g., VLIW multimedia engines,DSPs, etc.). After partitioning, different code generation andoptimization tool chains are used for each target coprocessor and thecontrol processor.

Hand-optimization at the assembly level isstillquite common for highly irregular signal processors, while advancedoptimizing compilers are often used for VLIW engines and �ne-grainedrecon�gurable fabrics. Explicit communication via shared memory isusually supported via storage classes declarations (e.g., non-cacheablememory pages) and DMA transfers from and to shared memories are usuallyset up via specialized system calls which access the memory-mappedcontrol registers of the DMA engines.

Even though the standard single-processorsoftware design �ows have been adapted to deal with architectures withsome degree of parallelism, NoCs are communication-dominatedarchitectures. Thus, they require much more fundamental work onsoftware abstraction and computer-aided software developmentframeworks.

On one hand, more aggressive and effectivetechniques forautomatic parallelism discovery and exploitations are needed; on theother hand however, programming languages and environments shouldenable explicit description of parallel computation without obfuscatingfunctional speci�cation behind the complexity of low-level parallelismmanagement tasks. In our view, software issues are among the mostcritical and less understood in NoC.

We believe that the full potentialof on-chip networks can be effectively exploited only if adequatesoftware abstractions and programming aids are developed to supportthem.

NoC Design Tools and Design
Designing NoCs requires specialized environments and tools. Onone hand, analysis tools are important to evaluate the performance ofan NoC of interest, as well as to trade-off alternative implementationsand tune parameters. On the other hand, synthesis of NoCs is important,in view of the possibility of using ad hoc architectures and ofcustomizing NoCs for given applications.

Synthesis tools aim at takingabstract, high-level views of network topologies and protocols, and atgenerating an implementation instance. This instance can be ahigh-level functional implementation of an NoC; yet it may be used asinput to other, lower level and more standard, synthesis tools.

It isimportant to note that the relation between the functional view and thephysical view is important in both analysis and synthesis. Indeed theSoC �oorplan determines wiring lengths and delays that representconstraints for the NoC realization.

NoCswere conceived with the goal of boosting system performance by taking asystem view of on-chip communication and by rationalizing communicationresources. At the same time, NoCs promise to solve some of the problemsof DSM technology, by providing a means to deal with signal delayvariability and with unreliability of the physical interconnect.

NoCsare also well poised to realize the communication infrastructure forSoCs with tens or hundreds of computational cores, where parallelprocessing creates a signi�cant traf�c that needs to be managed withrun-time techniques. NoCs can be thought as the evolution ofhigh-performance busses, which today are available in multi-layer,multilevel instantiations. Yet NoCs require a new vision of SoC design,which incorporates fault tolerance and layering.

A few advanced chips use NoCs as communicationsubstrate, and it is conjectured that NoCs will be the backbone of allSoCs of signi�cant complexity designed with the 65 nm technology nodeand below. NoCs will be an integral part of SoC platforms dedicated tospeci�c application domains, and programming platforms with NoCs willbe simpler due to regularity and predictability.

Moreover, FPGAsrepresent a large and ever-increasing sector of the semiconductormarket. FPGAs represent another embodyment of NoCs. Within advancedFPGA architectures, both hardware computing elements and theirinterconnects are programmable, thus achieving an unprecedented�exibility.

Some design and implementation issues arestill in search of ef�cient solutions. Problems get harder as we moveup from the physical layer because most hardware design problems arewell understood while system software issues, from handling massiveparallelism to generating predictable code are still beinginvestigated. It is the purpose of this book to shed some light on thisimportant and timely technology, and to gather together results andexperiences of several researchers.

Next in Part 4: NoC programming issues and approaches.
To read Part2, go to “SoCobjectives and NoC needs
To read Part 1, go to “Whyon-chip networking?

Used with the permission of the publisher,Newnes/Elsevier, this series of six articles is based on material from “ NetworksOn Chips: Technology and Tools,” by Luca Benini and Giovanni DeMicheli.

LucaBenini is professor at the Department of Electrical Engineering andComputer Science at the University of Bologna, Italy. Giovanni DeMicheli is professor and director of the Integrated Systems Center at EPF in Lausanne, Switzerland.

[1] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiezand and C.Zeferino, “SPIN:A Scalable, Packet Switched, On-Chip Micro-network,''DAT E- Design , Automatio n an d Tes t i nEurop e Conferenc e an d Exhibition ,2003, pp. 70 -73 .
[2]A.H. Ajami, K. Banerjee and M. Pedram, Modeling and Analysis of NonuniformSubstrate Temperature Effects on Global ULSI Interconnects ,''IEE E Transaction s o n CAD , Vol.24, No. 6, June 2005, pp. 849 – 861.
[3] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI,Addison-Wesley, Upper Saddle River, NJ, 1990.
[4] L. Benini, A. Bogliolo and G. De Micheli, “ASurvey of Design Techniques for System-Level Dynamic Power Management,''IEE E Transaction s o n Ver y Large-Scal eIntegratio n Systems , Vol. 8, No. 3, June 2000, pp.299 – 316.
[5]W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, L.Gauthier,
M. Diaz-Nava and A.A. Jerraya, “MultiprocessorSoC Platforms: A Component-Based Design Approach,'' IEE E Desig nan d Tes t o f Computers , Vol. 19,No. 6, November”December 2002, pp. 52 – 63.
[6]W. Dally and B. Towles,Principles and Practices of Interconnection Networks, MorganKaufmann, San Francisco, CA, 2004.
[7]W. Dally and B. Towles, “RoutePackets, Not Wires: On-Chip Interconnection Networks,'' Proceeding so f th e 38t h Desig n Automatio nConference . 2001.
[8]W.J. Dally and H. Aoki, “Deadlock-FreeAdaptive Routing in Multicomputer Networks Using Virtual Channels,''IEE E Transaction s o n Paralle l an dDistribute d Systems , Vol. 4, No. 4, April 1993, pp.466 – 475.
[9]W. Dally and C. Seitz, “The TorusRouting Chip,'' Distribute d Processing , Vol. 1,1996, pp. 187 – 196.
[10]M. Dall'Osso, G. Biccari, L. Giovannini, D. Bertozzi and L. Benini,”Xpipes: A LatencyInsensitive Parameterized Network-on-Chip Architecture forMultiprocessor SoCs,'' International Conference on Computer Design,2003, pp. 536″539.
[11]D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kimand K. Flautner, “Razor:Circuit-Level Correction of Timing Errors for Low-Power Operation,''IEE E Micro , Vol. 24, No. 6, November-December 2004,pp. 10 – 20.
[12]W. Dally and J. Poulton, DigitalSystems Engineering, Cambridge University Press, Cambridge, MA,1998.
[13]J. Duato, S. Yalamanchili and L. Ni, InterconnectionNetworks: An Engineering Approach, Morgan Kaufmann, San Francisco,CA, 2003.
[14]T. Dumitra, S. Kerner and R. Marculescu, “Towards On-ChipFault-Tolerant Communication,'' ASPDA C – Proceeding so f th e Asian-Sout h Paciý cDesig n Automatio n Conference , 2003, pp.225 – 232.
[15]S. Goel, K. Chiu, E. Marinissen, T. Nguyen and S. Oostdijk, “TestInfrastructure Design for the Nexperia Home Platform PNX8550 System Chip,''DAT E – Proceeding s o f th e Desig nAutomatio n an d Tes t Europ e Conference ,2004.
[16]K. Goossens, J. van Meerbergen, A. Peeters and P. Wielage, “

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.