The challenges of next-gen multicore networks-on-chip systems: Part 7 - Embedded.com

The challenges of next-gen multicore networks-on-chip systems: Part 7

Parallel programming critically requires tool support because of itsunfamiliarity and intrinsic difficulty. Programming tools are notlimited to compilers for efficient executable generation. They shouldhelp in systematically finding defects and performance bottlenecks, aswell as debugging and testing programs.

Without these tools, parallelism is likely become anobstacle that reduces developer and tester productivity and makesdevelopment more expensive and of lower quality.

Compilation
Compilers for high-performance processors have traditionally focused onILP extraction. This approach has been followed also in embeddedcomputing for signal and media processing. These application domainsare characterized by vast amounts of ILP and word-level parallelism, aswell as by regular data access patterns.

For this reason, processor architectures for signaland media processing usually feature a large number of parallelexecution units and follow architectural templates reminiscent ofgeneral purpose VLIW and vector processors [35]. These processorsheavily rely on compiler support to keep execution units busy andachieve massivespeedups over single-instruction pipelined execution.

Compilers for data-parallel processors have evolvedvery rapidly, and they feature aggressive allocation and schedulingalgorithms that achieve utilizations which are comparable tohand-optimized code [36].

A survey of this interesting field is outside thescope of the chapter, and the reader is referred to one of the manysurveys and textbooks available [37, 38].

One interesting point is worth mentioning, however.Even within a single-processor engine, architectures are becomingincreasing interconnect dominated. More specifically, the communicationbetween execution units the register file is the communicationbottleneck that limits scalability in the number of functional unitsthat can be successfully operated in parallel.

For this reason processor architectures are evolvingtoward clustered organizations, with multiple register files andcomplex interconnects.

These in-processor interconnects, also called scalaroperand networks [39] have strong similarities with chip-level NoCs, inthat they need to be scalable and to provide huge bandwidth.

However, since their main purpose is to deliverinstruction operands to functional units and results to register files,their latency is extremely tightly bound (e.g., one or two clockcycles). Hence, computation and communication cannot be decoupled via anetwork interface and its connection-oriented services, which wouldimpose unacceptable latency penalties.

Consequently, scalar operand networks are tightlycoupled with the instruction set architecture, and they are viewed asexplicit resources during compilation. Communication links areallocated and scheduled exactly like functional units and registers[40].

From the topological viewpoint, scalar operandnetworks are usually full or partial crossbars, even though multi-hoptopologies will most likely appear in the future. As the focus of thisbook is on chip-level NoCs, we refer the interested reader to the manyinteresting references in the literature that deal with scalar operandnetworks architecture and scaling trends [39].

Coarse-grain parallelism extraction
While compilers for ILP extraction are now mature and successful inindustrial applications, the massive amount of research work on TLPdiscovery and extraction has not enjoyed a similar degree of success.

The most common approaches in the industrial practicerely on explicit directives, inserted by the programmer to drive thecompiler toward good task-level parallelizations. This pragmaticsemi-automated approach can be verysuccessful, even though its implementation is quite challenging.

One of the main issues in semi-automated TLP is thedifficulty for programmer in understanding and using programannotations for parallelism when specifying computation in a sequentialfashion. In other words, if the programming language has an implicitsequential semantics, the programmer is required to manage continuousshifts between a sequential and a parallel programming model, whichultimately lead to inefficient or incorrect code.

As a result the learning curve for the programmer isquite steep and parallelization is done as an afterthought, bymodifying sequential code. This process is inefficient and often leadsto sub-optimalresults.

An example of semi-automated approach is OpenMP[41]. OpenMP provides a set of pragmas that aim at giving thecompilation system directives on how to automatically generate amulti-threaded execution flow.

The programmer inserts pragmas, such as #pragma Ompparallel, immediately before loops that should be parallelized, and thecompilation system decides how to split the target loop in parallelthreads that are then distributed among the available processors.

Thus, the programmer is relieved from the burden ofmanaging data sharing and synchronization, while the compiler isfacilitated in focusing efforts where the programmer expects mostadvantages from parallelization. OpenMP is gainingacceptance in high-performance chip multi-processors, such as IBM'sCell.

Example. Cell uses OpenMP pragmasto guide parallelization decisions. It then uses the pre-existingparallelization infrastructure of the proprietary IBM XL compiler. Inthe first pass of the compiler machineindependent optimizations areapplied to the functions outlined by the programmer. In the second passof optimization, the target functions are cloned, then optimized codesare generated for both the master processor (SPE) and the slavedata-parallel processors (PPEs).

The Master Thread runs on the PPE processor andmanages work sharing and synchronization. Note that cloning of the callsub-graph of the target functions occurs in the second pass ofoptimization, after interprocedural analysis. Thus, the compiler canlink the appropriate versions for all the library function calls forthe SPE and the PPE.

An important issue is the limited-size local memoryof the SPEs, which must accommodate both code and data. Hence, there isalways a possibility that a single SPE-allocated block will be toolarge to fit. Since OpenMP generally is used to drive parallelizationof tight loop nests, code spilling is not very common. Data spilling isinstead a much more critical problem.

Fortunately data is managed explicitly and theoptimizing compiler can in principle insert all the necessary datamanipulation function (e.g., DMA-accelerated data transfers) in anautomated fashion. This is however most easily said than done, and inreality a sub-optimal data organization choices cannot be fullyremedied by background DMA-basedmemory transfers.

The critical challenge for the success of OpenMP andsimilar approaches is in the efficient implementation of theparallelization step. In an NoC architecture, communication latenciesand network topology should be accounted when deciding the degree ofparallelization of a loop (declared by the programmer as”parallelizable'').

In other words, it may be more efficient toparallelize the loop to a limited degree, if then the various parallelthreads can be mapped on a set of topologically close computationalunits.

Aggressive parallelization may be inefficient if itrequires long-range data communication across many network switches.Clearly, a significant amount of research and development work isrequired to develop NoC (or more generally communication) awaresoftware parallelization tools, that can account for NoC communicationlatencies, and can at the same time interact with NoC mapping toinfluence the way parallel threads are allocated onto the networktopology.

At the same time, OpenMP appears somewhat limited inits expressiveness for what concerns the important issue of dataallocation and partitioning: in other words, it is based on an uniformmemory access abstraction which does not fit well to NoC-baseddistributed architectures.

Testing and debugging
Program testing and debugging is of the utmost importance. In fact,application parallelism introduces new types of programming errors, inaddition the usual ones in sequential code. Incorrect synchronizationcan cause data races and livelocks which are extremely difficult tofind and to understand, since they appear as non-deterministic and hardto reproduce.

Conventional debugging, based on breakpointing andre-execution, does not work in this context, because errormanifestation depends on subtle timing relationships between paralleltasks. The simple insertion of a breakpoint or of instrumentation codechanges the timing of a program and may completely hide many seriousbugs. Even if the problem can be reproduced, its diagnosis can beextremely difficult to show, since the stateof the system is distributed and extremely difficult to analyze.

Language support for development ofcorrect-by-construction programs is therefore as critical as languagesupport for efficient compilation. Low-level hardware-related featuresfor communication and synchronization should not be exposed to theprogrammer, not only because of code portability issues, but above allfor avoiding the creation of intricate program logic which becomesimpossible to debug.

Clearly, even though well-designed languages cangreatly help in writing correct code, systematic defect detection toolsare extremely valuable. One promising approach is to use static programanalysis to systematically explore all possible executions traces of aprogram. In this way it becomes possible to find errors that are hiddenvery deep in the program logic, and are almost impossible to reproduceby standard testing.

Similar techniques are used with an increasingdegree of success in hardware verification, and are known as “formalverification'' techniques. Hardware is highly parallel and complex, andit is notoriously hard to verify, but unfortunately, softwareverification is even more difficult.

One of the main challenges is the size of the statespace (which relates to the number of possible execution trace) ofprograms. Its size is formidable, especially if we consider parallelprograms (remember that the size of the state space for two programsrunning in parallel is the product of the two component's statespaces). Hence, its complete exploration is hopeless, and in many casewe must limit ourselves to formally checking only someprogram properties [31].

Since formal software verification is not a panacea,developers critically require traditional debuggers to explore andunderstand the complex behavior of their parallel programs. Two mainapproaches are used in this area.

First, tracing and logging tools aim at keepingtrace, and tagging messages and shared data accesses by variousthreads, allowing the developer to directly observe a parallel programpartially ordered execution. Important features that require furtherdevelopments are: the ability to follow causality trails across threads(such as chains of accesses to distributed objects), to replay messagesin queues and reorder them if necessary, step through asynchronous callpatterns (including callbacks).

The second approach is reverse execution, whichpermits a programmer to back up in a programs execution history andre-execute some code. This technique is expensive and complex, but itis becoming viable thanks to the increased capabilities of executioncores, which have a large amount of shadow storage and are increasinglyused as virtual processors, emulatingvia micro-code well-known instruction sets.

Software testing will also have to undergosubstantial extensions. Parallel programs have non-deterministicbehaviors and are therefore much more difficult to test. Basic codecoverage metrics (e.g., branch or statement coverage), will have to beextended to take into account multiple concurrently executing threads.

Single-thread coverage is very optimistic, as itcompletely neglects the notion of global state required by multipleexecution threads. Thus, stress tests will have to be augmented by moresystematic semi-formal techniques (e.g., model-checking-likeassertions).

In embedded systems, system simulation and emulationtechniques are extremely useful and relevant for debugging. Simulationrelies on the concept of virtual platform [43], as exemplified in Part4 which are run on powerful host machines.

When a parallel application is run on a virtualplatform, its execution can be very carefully monitored, stopped androlled back in ways that are not possible when debugging on the targethardware, and the developer has much more control on fine-grain detailsof the execution.

The main shortcoming of virtual platforms is thatthey executed ad orders-of-magnitude slower rates than actual hardware,and reproducing complex incorrect behaviors with billion-instructiontraces may require a huge amount of time.

Emulation platforms greatly accelerate simulationspeed, but they are extremely expensive and traditionally they havebeen used only for hardware debugging. However, new approaches toaffordable MPSoC platform emulation are emerging that show promise [42].

Summary
Software abstractions and software development support for NoCsare still immature, as the corresponding hardware architectures arejust now emerging from the laboratories. It should be clear however,that NoCs will be successful in practice only if they can beefficiently programmed. For this reason we believe that the topicssurveyed here will witnesssignificant growth in research interest and results in the next fewyears.

To read Part 1, go to Whyon-chip networking?
To read Part 2, go to SoCobjectives and NoC needs.
To read Part 3, go to Onceover lightly
To read Part 4 go to  Networks-on-chip programming issues
To read Part 5, go to Task-Level Parallel  Programming
To read Part 6, go to Communications-exposed Programming

Used with the permission of the publisher, Newnes/Elsevier, this seriesof seven articles is based on material from “NetworksOn Chips: Technology and Tools,” by Luca Benini and Giovanni DeMicheli.

Luca Benini is professor at theDepartment of Electrical Engineering and Computer Science at theUniversity of Bologna, Italy. Giovanni De Micheli is professor anddirector of the Integrated Systems  Center at EPF in Lausanne,Switzerland.

For additonal resources aboutmulti-core NoC and SoC hardware and software design on Embedded.com, goto Moreon Multicores and Multiprocessors.

References
[1] F. Boekhorst, “AmbientIntelligence, the Next Paradigm for Consumer Electronics: How will itAffect Silicon?,'' Internationa l Solid-Stat e Circuit sConference , Vol. 1, 2002, pp. 28″31.

[2] G. Declerck, “ALook into the Future of Nanoelectronics,'' IEEE Symposium on VLSITechnology, 2005, pp. 6″10.

[3] W. Weber, J. Rabaey and E. Aarts(Eds.), Ambient Intelligence. Springer, Berlin, Germany, 2005.

[4] S. Borkar, et al., “Platform2015: Intel Processor and Platform Evolution for the Next Decade,''INTEL White Paper 2005 (PDF).

[5] D. Culler and J. Singh, ParallelComputer Architecture: A Hardware/Software Approach, MorganKaufmann Publishers, 1999.

[6] L. Hennessy and D. Patterson,Computer Architecture ” A Quantitative Approach, 3rd edition, MorganKaufmann Publishers, 2003.

[7] Philips Semiconductor, PhilipsNexperia Platform.

[8] M. Rutten, et al., “Eclipse:Heterogeneous Multiprocessor Architecture for Flexible Media Processing,''International Conference on Parallel and Distributed Processing, 2002,pp. 39″50.

[9] ARM Ltd, MPCoreMultiprocessors Family

[10] B. Ackland, et al., “ASingle Chip, 1.6 Billion, 16-b MAC/s Multiprocessor DSP,'' IEEEJournal of Solid State Circuits, Vol. 35, No. 3, 2000, pp. 412″424.

[11] G. Strano, S. Tiralongo and C.Pistritto, “OCP/STBUS Plug-in Methodology,'' GSP X Conferenc e2004.

[12] M. Loghi, F. Angiolini, D.Bertozzi, L. Benini and R. Zafalon, “AnalyzingOn-Chip Communication in a MPSoC Environment,'' Desig n an dTes t i n Europ e Conferenc e (DATE )2004, pp. 752″757.

[13] F. Poletti, P. Marchal, D. Atienza,L. Benini, F. Catthoor and J. M. Mendias, “AnIntegrated Hardware/Software Approach For Run-Time Scratch-padManagement,'' in Desig n Automatio n Conference ,Vol. 2, 2004, pp. 238″243.

[14] D. Pham, et al., “TheDesign and Implementation of a First-generation CELL Processor,''in IEE E Internationa l Solid-Stat e Circuit sConference , Vol. 1, 2005, pp. 184″592.

[15] L. Hammond, et al., “TheStanford Hydra CMP,'' IEEE Micro, Vol. 20, No. 2, 2000, pp. 71″84.

[16] L. Barroso, et al., “Piranha:A Scalable Architecture Based on Single-chip Multiprocessing,'' inInternational Symposium on Computer Architecture, 2000, pp. 282″293.

[17] Intel Semiconductor, IXP2850Network Processor,

[18] STMicroelectronics, Nomadik Platform

[19] Texas Instruments, “OMAP5910 Platform''

[20] M. Banikazemi, R. Govindaraju, R.Blackmore and D. Panda. “MP-LAPI: An Efficient Implementation of MPIfor IBM RS/6000 SP systems'', IEE E transaction s Paralle lan d Distribute d Systems , Vol. 12, No. 10, 2001,pp. 1081″1093.

[21] W. Lee, W. Dally, S. Keckler, N.Carter and A. Chang, “AnEfficient Protected Message Interface,'' IEE E Computer ,Vol. 31, No. 11, 1998, pp. 68″75.

[22] U. Ramachandran, M. Solomon and M.Vernon, “HardwareSupport for Interprocess Communication,'' IEE E transaction sParalle l an d Distribute d Systems , Vol.1, No. 3, 1990, pp. 318″329.

[23] H. Arakida et al., “A160 mW, 80 nA Standby, MPEG-4 Audiovisual LSI 16Mb Embedded DRAM and a5 GOPS Adaptive Post Filter,'' IEEE International Solid-StateCircuits Conference 2003, pp. 62″63.

[24] F. Gilbert, M. Thul and N. When, “CommunicationCentric Architectures for Turbo-decoding on Embedded Multiprocessors,''Desig n an d Tes t i n Europ e Conferenc e2003, pp. 356″351.

[25] S. Hand, A. Baghdadi, M. Bonacio,S. Chae and A. Jerraya, “AnEfficient Scalable and Flexible Data Transfer Architectures forMultiprocessor SoC with Massive Distributed Memory,'' Desig nAutomatio n Conferenc e 2004, pp. 250″255.

[26] P. Paulin, C. Pilkington, E.Bensoudane, “StepNP:A system-level Exploration Platform for Network Processors,'' IEE EDesig n an d Tes t o f Computers ,Vol. 19, No. 6, 2002, pp. 17″26.

[27] P. Stenström, “ASurvey of Cache Coherence Schemes for Multiprocessors,'' IEEEComputer, Vol. 23, No. 6, 1990, pp. 12″24.

[28] M. Tomasevic, V.M. Milutinovic, “HardwareApproaches to Cache Coherence in Shared-Memory Multiprocessors,'' IEE EMicro , Vol. 14, No. 5″6, 1994, pp. 52″59.

[29] I. Tartalja, V.M. Milutinovic, “ClassifyingSoftware-Based Cache Coherence Solutions,'' IEE E Software ,Vol. 14, No. 3, 1997, pp. 90″101.

[30] P. Kongetira, K. Aingaran and K.Olukotun, “Niagara:a 32-way Multithreaded Sparc Processor,'' IEE E Micro ,Vol. 25, No. 2, 2005, pp. 21″29.

[31] H. Sutter and J. Larus, “Softwareand the Concurrency Revolution,'' ACM Queue, Vol. 3, No. 7, 2005,pp. 54″62.

[32] R. Stephens, “A Surveyof Stream Processing,'' Acta Informatica, Vol. 34, No. 7, 1997, pp.491″541.

[33] E. Lee and D. Messerschmitt, “PipelineInterleaved Programmable DSP's: Synchronous Data Flow Programming,''IEE E Transaction s o n Signa l Processing ,Vol. 35, No. 9, 1987, pp. 1334″1345.

[34] W. Thies, et al., “Languageand Compiler Design for Streaming Applications,'' IEEEInternational Parallel and Distributed Processing Symposium, 2004, pp.201.

[35] Silicon Hive, AVISPA-CH

[36] M. Bekoij, Constraint DrivenOperation Assignment for Retargetable VLIW Compilers, Ph.D.Dissertation, 2004.

[37] R. Allen, K. Kennedy, OptimizingCompilers for Modern Architectures: A Dependence-based Approach,Morgan-Kaufman, 2001.

[38] P. Faraboschi, J. Fisher and C.Young, “Instruction Scheduling for Instruction Level ParallelProcessors,'' Proceedings of the IEEE, vol. 89, No. 11, 2001, pp.1638″1659.

[39] M. Taylor, W. Lee, S. Amarasingheand A. Agarwal, “ScalarOperand Networks,'' IEE E Transaction s o n Paralle lan d Distribute d Systems , Vol. 16, No. 2, 2005,pp. 145″162.

[40] P. Mattson, et al., “CommunicationScheduling,'' ACM Conference on Architectural Support forProgramming Languages and Operating Systems, 2000, pp. 82″92.

[41] M. Sato, “OpenMP:Parallel Programming API for Shared Memory Multiprocessors and On-chipMultiprocessors,'' IEEE International Symposium on SystemSynthesis, 2002, pp. 109″111.

[42] N. Genko, D. Atienza, G. DeMicheli, J. Mendias, R. Hermida, F. Catthoor, “AComplete Network-on-chip Emulation Framework,'' Design , Automatio nan d Tes t i n Europe , Vol. 1, 2005, pp.246″251.

[43] C. Shin, et al., “FastExploration of Parameterized Bus Architecture for Communication-centricSoC Design,'' Design , Automatio n an d Tes ti n Europe , Vol. 1, 2004, pp. 352″357.

[44] M. Michael, “Scalable,Lock-free Dynamic memory Allocation,'' ACM Conference onProgramming Languages Design and Implementation, 2004, pp.110″122.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.