The challenges of next-gen multicore networks-on-chip systems: Part 2 - Embedded.com

The challenges of next-gen multicore networks-on-chip systems: Part 2

There are several hardware types ofSoC designs that can be defined according to the required functionalityand market. In general, SoCs can be classified in terms of theirversatility (i.e., support for programming) and application domains. Asimple taxonomy is described next:

General-purposeon-chipmultiprocessors arehigh-performance chips that benefitfromspatial locality to achieve high performance. They are designed tosupport various applications, and thus the processor core usage andtraffic patterns may vary widely. They are the evolution of on-boardmultiprocessors, and they are typified by having a homogeneous set ofprocessing and storage arrays.

For these reasons, on-chip networkdesign can benefit from the experience on many architectures andtechniques developed for on-board multiprocessors, with the appropriateadjustments to operate on a silicon substrate.

Figure1.6. Razor [11] is another realization of self-calibrating circuits,where a processor's supply is lowered till errors occur. The correctoperation of the processor is preserved by an error detection andpipeline adjustment technique. As a result, the processor settleson-line to an operating voltage which minimizes the energy consumptioneven in the presence of variation of technological parameters.

Application-specifi c SoC sare hardware chips dedicated to an application. In some cases, as forall mobile applications, energy consumption is a major concern. Mostapplication-specific SoCs are programmable, but their applicationdomainis limited and the software characteristics are known a priori .

Thus, some knowledge of the trafficpattern is available when the NoC is designed. In many cases, thesesystems contain fairly heterogeneous computing elements, such asprocessors, controllers, digita l signa l processor s(DSPs) and a number of domain-specific hardwareaccelerators.Thisheterogeneity may lead to specific traffic patterns and requirements,thus requiring NoCs with specialized architectures and protocols.

So C platform s areapplication-specific SoCs dedicated to a family of applications in aspecific domain. Examples are SoCs for GSM telephony support andplatforms for automotive control. A platform is more versatile innature, as it can be used in different (embedded) systems by differentmanufacturers.

Figure1.7. T-error is a timing methodology for NoCs where data is pipelinedthrough double latches, where the former used an aggressive period andthe latter a safe one. For most patterns, T-error will forward datafrom the first latch. When the slowest patterns are transmitted thatfail the deadline at the first latch, correct but slower operation isperformed by the second latch [30].

Thus, versatility and programmabilityare preferred to customization, yielding SoCs that can be produced inhigh volumes, and thus offset the non-recurren t engineerin g(NRE) costs. Whereas the processing and storage unit may differ innature and performance, the traffic patterns are harder to guess a prior ias the application software may vary widely.

Field-programmable gate arrays(FPGAs) are hardware systems where the functionalityisdetermined after manufacturing by connecting and configuringcomponents.Components vary in size and in functionality and are connected byreprogrammable networks.

These networks are simple and providebit-level connectivity with little or no control. Nevertheless weexpect FPGAs to grow substantially over the coming years and requireeffective NoC communication.

Some Design Examples

One of the first multiprocessordesigned around an NoC is the RAW architecture [32]. This is a fullyprogrammable SoC consisting of an array of identical computationaltiles with local storage. Full programmability means that the compilercan program both the function of each tile and the interconnectionsamong them. The name RAW stems from the fact that the “raw'' hardwareis fully exposed to the compiler.

To accomplish programmablecommunication, each tile has a router. The compiler programs therouters on all tiles to issue a sequence of commands that determineexactly which set of wires connect at every cycle. Moreover, thecompiler pipelines the long wires to support high clock frequency. 

The Cell processor [26] wasdevelopedby Sony, Toshiba and IBM to build a general-purpose processor for acomputer, even though it is primarily targeted for Sony's Playstation3. Its architecture resembles multiprocessor vector supercomputers,targeting high-performance distributed computing.

The architecture comprises one 64-bitpower processor element (PPE), eight synergistic processor elements(SPEs), memory and interconnection. The PPE is a dual issue, dualthreaded in-order RISC processor, with 512K cache. Each SPE is aself-contained in-order vector processor which acts as an independentprocessor.

Each contains a 128 × 128-bitregister, four (single-precision) �oating point units and four integerunits. The element interconnection bus (EIB) connects the PPE, theeight SPEs and the memory interface controller (Figure 1.8, below ). The EIB hasindependent networks for commands (requests for data from othersources) and for the data being moved.

Commands are filtered through addressconcentrators which handle collision detection and prevention, andensure that all units have equal access to the command bus. There aremultiple address concentrators, all of which forward data to asingle-serial command re�ection point. Data transfer is elaborate.

There are four “rings,'' each ofwhich is a chain connecting all data ports. Data can move down a ringonly in one direction. For instance, a connection that allows data tomove from the PPE to SPE1 cannot be used to move data from SPE1 back tothe PPE.

Two rings go clockwise and twocounterclockwise, and all four rings have the components attached inthe same order. Each ring can move 16 bytes at a time from any positionon the ring to any other position. In fact, each ring can transmitthree concurrent transfers, but those transfers cannot overlap.

Figure1.8. The Element Interconnection Bus (EIB) in Cell.

The Nexperia architecture, developedby Philips (NXPSemiconductdor), is a platform for handling digital video and audioinconsumer electronics (Figure 1.9, below ).It uses one or more 32-bit MIPS CPUs for control processing, andone ormore 32-bit Trimedia processors for streaming data. Moreover, theplatform can house a �exible range of programmable modules, such as anMPEG decoder, a UART, etc.

To connect the CPUs and other moduleswith each other and with the main external memory, a high-speed memoryaccess network, and two devic e contro l an dstatu s (DCS) networks are used. These DCS networks enable eachprocessor to control and observe on chip the status of the othermodules.

One of the advantages of the platformis the variable number of CPUs used, thus making Nexperia fit wellvarious applications. A specific implementation of Nexperia, thePNX8550system chip, houses 10 million gates in 62 cores, out of which five arehard (including the MIPS and Trimedia CPUs) and the others are softcores [15].

Figure1.9. The Nexperia architecture.

The Xilinx Spartan-II FPGA chips arerectangular arrays of configurable logic blocks (CLBs). Each block canbe programmed to perform a specificlogic function. CLBs are connectedvia a hierarchy of routing channels. A more complex and interestingfamily of products is the Xilinx Virtex-II and Virtex-II Pro. TheseFPGAs have various complex elements, such as CLBs, RAMs, processorcores, multipliers and clock managers.

Programmable interconnection isachieved by routing switches. Each programmable element is connected toa switch matrix, allowing multiple connections to the general routingmatrix. All programmable elements, including the routing resources, arecontrolled by values stored in static memory cells. Thus, Virtex-II canbe also seen as NoC over a heterogeneous fabric of components.

The complexity of the chip designsdescribed above has prompted the development of infrastructure tosupport communication. For example, STMicroelectronics has developedthe STBus kit that can provide various functions including full (andpartial) crossbar connection. A similar framework is provided by theadvanced microcontroller bus architecture (AMBA) multi-layer bussystem.

DistinguishingCharacteristics of NoCs

SoCs differ from wide-area networksbecause of local proximity and because they exhibit much lessnon-determinism. Indeed, despite the undesirable variability featuresof DSM CMOS technologies, it is still possible to predict many physicaland electrical parameters with reasonable accuracy.

On the other hand, on-chip networkshave a few distinctive characteristics, namely low communicationlatency, energy consumption constraints and design-time specialization.Latency of communication on chip needs to be small, that is, in theorder of few clock periods.

The shortest latency implementationscan be achieved by fully hard-wired implementations, which defeat the�exibility required by on-chip networks. Clearly, smart protocols forcommunication may add to the latency of the signals. Thus, to becompetitive in performance, NoCs require streamlined protocols.

Energy consumption in NoCs is often amajor concern, because whereas computation and storage energy greatlybenefits from device scaling (smaller gates and smaller memory cells),the energy for global communication does not scale down.

On the contrary, projections based oncurrent delay optimization techniques for global wires [18, 29, 31]show that global communication on chip will require increasingly higherenergy consumption. Hence, communication-energy minimization will be agrowing concern in future technologies.

Furthermore, network traffic controland monitoring can help in better managing the power consumed bynetworked computational resources. For instance, clock speed andvoltage of end nodes can be varied according to available networkbandwidth.

Design-time specialization is anotherfacet of NoC design, and it is relevant to application-specific andplatform SoCs. Whereas macroscopic networks emphasize general-purposecommunication and modularity, in NoCs these constraints are lessrestrictive because most on-chip solutions are proprietary.

Thus, NoC implementation may separatedata from control, use arbitrary bus width and control �ow schemes.Such a �exibility needs to be mitigated at the NoC boundary, that is,where the communication infrastructure connects to end nodes (e.g.,processors).

Existing standards like the OpenCoreProtocol (OCP) are extremely useful in defining theinterfacebetweenprocessor/storage arrays and NoCs. Interestingly enough, the �exibilityin tailoring the NoC to the specific application can be usedeffectivelyto design low-energy communication schemes.

To read Part 1 go to  “Why  on-chip networking?
Next in Part 3:  Once overlightly – a survey of the issues related to NoC design.

Used with the permission of the publisher,Newnes/Elsevier, this series of six articles is based on material from “ NetworksOn Chips: Technology and Tools,” by Luca Benini and Giovanni DeMicheli.

Luca Benini isprofessor at the Department of Electrical Engineering and ComputerScience at the University of Bologna, Italy. Giovanni De Micheli isprofessor and director of the Integrated Systems  Center at EPF inLausanne, Switzerland.

References
[1] A. Adriahantenaina, H. Charlery, A. Greiner, L. Mortiezand and C.Zeferino, “SPIN:A Scalable, Packet Switched, On-Chip Micro-network,''DAT E- Design , Automatio n an d Tes t i nEurop e Conferenc e an d Exhibition ,2003, pp. 70 -73 .
[2]A.H. Ajami, K. Banerjee and M. Pedram, Modeling and Analysis of NonuniformSubstrate Temperature Effects on Global ULSI Interconnects ,''IEE E Transaction s o n CAD , Vol.24, No. 6, June 2005, pp. 849 – 861.
[3] H. Bakoglu, Circuits, Interconnections, and Packaging for VLSI,Addison-Wesley, Upper Saddle River, NJ, 1990.
[4] L. Benini, A. Bogliolo and G. De Micheli, “ASurvey of Design Techniques for System-Level Dynamic Power Management,''IEE E Transaction s o n Ver y Large-Scal eIntegratio n Systems , Vol. 8, No. 3, June 2000, pp.299 – 316.
[5]W.O. Cesario, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, L.Gauthier,
M. Diaz-Nava and A.A. Jerraya, “MultiprocessorSoC Platforms: A Component-Based Design Approach,'' IEE E Desig nan d Tes t o f Computers , Vol. 19,No. 6, November”December 2002, pp. 52 – 63.
[6]W. Dally and B. Towles,Principles and Practices of Interconnection Networks, MorganKaufmann, San Francisco, CA, 2004.
[7]W. Dally and B. Towles, “RoutePackets, Not Wires: On-Chip Interconnection Networks,'' Proceeding so f th e 38t h Desig n Automatio nConference . 2001.
[8]W.J. Dally and H. Aoki, “Deadlock-FreeAdaptive Routing in Multicomputer Networks Using Virtual Channels,''IEE E Transaction s o n Paralle l an dDistribute d Systems , Vol. 4, No. 4, April 1993, pp.466 – 475.
[9]W. Dally and C. Seitz, “The TorusRouting Chip,'' Distribute d Processing , Vol. 1,1996, pp. 187 – 196.
[10]M. Dall'Osso, G. Biccari, L. Giovannini, D. Bertozzi and L. Benini,”Xpipes: A LatencyInsensitive Parameterized Network-on-Chip Architecture forMultiprocessor SoCs,'' International Conference on Computer Design,2003, pp. 536″539.
[11]D. Ernst, S. Das, S. Lee, D. Blaauw, T. Austin, T. Mudge, N. S. Kimand K. Flautner, “Razor:Circuit-Level Correction of Timing Errors for Low-Power Operation,''IEE E Micro , Vol. 24, No. 6, November-December 2004,pp. 10 – 20.
[12]W. Dally and J. Poulton, DigitalSystems Engineering, Cambridge University Press, Cambridge, MA,1998.
[13]J. Duato, S. Yalamanchili and L. Ni, InterconnectionNetworks: An Engineering Approach, Morgan Kaufmann, San Francisco,CA, 2003.
[14]T. Dumitra, S. Kerner and R. Marculescu, “Towards On-ChipFault-Tolerant Communication,'' ASPDA C – Proceeding so f th e Asian-Sout h Paciý cDesig n Automatio n Conference , 2003, pp.225 – 232.
[15]S. Goel, K. Chiu, E. Marinissen, T. Nguyen and S. Oostdijk, “TestInfrastructure Design for the Nexperia Home Platform PNX8550 System Chip,''DAT E – Proceeding s o f th e Desig nAutomatio n an d Tes t Europ e Conference ,2004.
[16]K. Goossens, J. van Meerbergen, A. Peeters and P. Wielage, “Networkson Silicon: Combining Best Efforts and Guaranteed Services,'' Desig nAutomatio n an d Tes t i n Europ eConference , 2002, pp. 423 – 427.
[17]R. Hegde and N. Shanbhag, “TowardAchieving Energy Efýciency in Presence of Deep Submicron Noise,''IEE E Transaction s o n VLS I Systems ,Vol. 8, No. 4, August 2000, pp. 379 – 391.
[18]R. Ho, K. Mai and M. Horowitz, “TheFuture of Wires,'' Proceedings of the IEEE, January 2001.
[19]J. Hu and R. Marculescu, “Energy-AwareMapping for Tile-Based NOC Architectures Under Performance Constraints,''Asian-Pacific Desig n Automatio n Conference ,2003.
[20]F. Karim, A. Nguyen and S. Dey, “On-ChipCommunication Architecture for OC-768 Network Processors,''Proceedings of the 38th Design Automation Conference, 2001.
[21]B. Khailany, et al., “Imagine:Media Processing with Streams,'' IEEE Micro, Vol. 21, No. 2, 2001,pp. 35″46.
[22]S. Kumar, et al., “ANetwork on Chip Architecture and Design Methodology,'' VLSI onAnnual Symposium, IEEE Computer Society ISVLSI 2002.
[23]D. Lackey, P. Zuchowski, T. Bednar, D. Stout, S. Gould and J. Cohn,”ManagingPower and Performance for Systems on Chip Design Using Voltage Islands,''ICCAD –  International Conference on Computer Aided Design, 2002,pp. 195 – 202.
[24]P. Lieverse, P. van der Wolf, K. Vissers and E. Deprettere, “AMethodology for Architecture Exploration of Heterogeneous SignalProcessing Systems,'' Journa l o f VLS ISigna l Processin g fo r Signal , Imag ean d Vide o Technology , Vol. 29, No. 3,2001, pp. 197 – 207.
[25]M. Oka and M. Suzuoki, “Designingand Programming the Emotion Engine,'' IEE E Micro ,Vol. 19, No. 6, November – December 1999, pp. 20 – 28.
[26]D. Pham, et al., “Overviewof the Architecture, Circuit Design, and Physical Implementation of aFirst-Generation Cell Processor,'' IEE E Journa lo f Solid-Stat e Circuits , Vol. 41, No. 1,January 2006, pp. 179 – 196.
[27]A. Pinto, L. Carloni and A. Sangiovanni-Vincentelli, “Constraint-DrivenCommunication Synthesis,'' Design Automation Conference, 2002, pp.195 – 202.
[28]K. Skadron, et al., “Temperature-AwareComputer Systems: Opportunities and Challenges,'' IEE E Micro ,Vol. 23, No. 6, November”December 2003, pp. 52 – 61.
[29]D. Sylvester and K. Keutzer, “AGlobal Wiring Paradigm for Deep Submicron Design,'' IEE E Transaction so n CAD/ICAS , Vol. 19, No. 2, February 2000, pp. 242- 252.
[30]R. Tamhankar, S. Murali and G. De Micheli, “PerformanceDriven Reliable Link for Networks on Chip,'' ASPDAC – Proceedingsof the Asian Paciýc Conference on Design Automation, Shahghai,2005, pp. 749 – 754.
[31]T. Theis, “TheFuture of Interconnection Technology,'' IB M Journa lo f Researc h an d Development ,Vol. 44, No. 3, May 2000, pp. 379″390.
[32]E. Waingold, et al., “BaringIt All to Software: Raw Machines,'' IEE E Computer ,Vol. 30, No. 9, September 1997, pp. 86 – 93.
[33]J. Walrand and P. Varaiya,High-Performanc e Communicatio n Networks ,Morgan Kaufmann, San Francisco, CA, 2000.
[34]M. Wolfe, Hig h Performanc e Compiler s fo rParalle l Computing , Addison-Wesley, Upper SaddleRiver, NJ, 1995.
[35]F. Worm, P. Ienne, P. Thiran and G. De Micheli, “An AdaptiveLow-Power Transmission Scheme for On-Chip Networks,'' ISSS ,Proceeding s o f th e Internationa lSymposiu m o n Syste m Synthesis ,Kyoto, October 2002, pp. 92 – 100.
[36] H. Zhang, V. George and J. Rabaey, “Low-SwingOn-Chip Signaling Techniques: Effectiveness and Robustness,'' IEEE Transaction s o n VLS I Systems ,Vol. 8, No. 3, June 2000, pp. 264 – 272.
[37]
InternationalTechnology Roadmap for Semiconductors ( http://public.itrs.net/ )


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.