Techniques for Designing Energy-Aware MPSoCs: Part 4 -

Techniques for Designing Energy-Aware MPSoCs: Part 4

Network architecture heavily influences communication energy. As hintedin the previous section, shared-medium networks (busses) are currentlythe most common choice, but it is intuitively clear that busses are notenergy-efficient as network size scales up, as it must for MPSoCs.

In bus-based communication, data are always broadcast from onetransmitter to all possible receivers, whereas in most cases messagesare destined to reach only one receiver, or a small group. Buscontention, with the related arbitration overhead, further contributesto the energy overhead.

Studies on energy-efficient on-chip communication indicate thathierarchical and heterogeneous architectures are much moreenergy-efficient than busses. Zhang et al. [65] developed ahierarchical generalized mesh whereby network nodes with a highcommunication bandwidth requirement are clustered and connected througha programmable generalized mesh consisting of several shortcommunication channels joined by programmable switches.

Clusters are then connected through a generalized mesh of globallong communication channels. Clearly such architecture is heterogeneousbecause the energy cost of intra-cluster communication is much smallerthan that of inter-cluster communication.

Although the work of Zhang et al. [65] demonstrates that power canbe saved by optimizing network architecture, many network design issuesare still open, and we need tools and algorithms to explore the designspace and to tailor network architecture to specific applications orclasses of applications.

Network architecture is only one facet of network layer design, theother major facet being network control. A critical issue in this areais the choice of a switching scheme for indirect network architectures.

From the energy viewpoint, the tradeoff is between the cost ofsetting up a circuit-switched connection once and for all, and theoverhead of switching packets throughout the entire communication timeon a packet-based connection. In the former case the network controloverhead is “lumped'' and incurred once, whereas in the latter case, itis distributed over many small contributions, one for each packet.

When communication flow between network nodes is extremelypersistent and stationary, circuit-switched schemes are likely to bepreferable, whereas packet switched schemes should be moreenergy-efficient for irregular and non-stationary communicationpatterns. Needless to say, circuit switching and packet switching arejust two extremes of a spectrum, with many hybrid solutions in between.

Packetization and Energy Efficiency
The current trend in on-chip network design is toward packet-switchedarchitectures. This choice is mostly driven by the need for utilizinglimited global wiring resources as efficiently as possible. (It is awell-known fact that circuit-switched networks have a low utilizationwith respect to packet-switched networks.)

Ye et al. [67] have studied the impact of packet size on energy andperformance for a regular on-chip network (a mesh) connectinghomogeneous processing elements, in a cache-coherent MPSoC. Packetizeddata flow on the MPSoC network is affected by: (1) the number ofpackets in the network, (2) the energy consumed by each packet on onehop, and (3) the number of hops each packet travels.

Different packetization schemes will have different impact on thesefactors and, consequently, affect the network power consumption.Several findings are reported in Ye et al. [67] and are summarized herebelow.

Larger packet size will increase the energy consumed per packet,because there are more bits in the payload. Furthermore, larger packetswill occupy the intermediate node switches for a longer time and causeother packets to be re-routed to non-shortest datapaths.

This leads to more contention that will increase the total number ofhops needed for packets traveling from source to destination. As aresult, as packet size increases, energy consumption on theinterconnect network will increase.

Although an increase in packet size will increase the energydissipated by the network, it will decrease the energy consumption incache and memory. Because larger packet sizes will decrease the cachemiss rate, both cache energy consumption and memory energy consumptionwill be reduced.

The total energy dissipated by the MPSoC comes from non-cacheinstructions (instructions that do not involve cache access) executedby each processor, the caches, and the shared memories, as well as theinterconnect network. In order to assess the impact of packet size onthe total system energy consumption, all MPSoC energy contributorsshould be considered together.

Ye et al. [67] showed that the overall MPSoC energy initiallydecreases as packet size increases. However, when the packets are toolarge, the total MPSoC energy starts increasing again. This is becausethe interconnect network energy outgrows the decrease of energy oncache, memory, and non-cache instructions.

As a result, there is an optimal packet size, which, for thearchitecture analyzed by Ye et al., is around 128 bytes. It isimportant to stress, however, that different processor architecturesand system interconnects may have different optimal packet sizes.

Energy and Reliability inPacket-Switched NoCs
As MPSoC communication architectures evolve from shared busses topacket-based micro-networks, packetization issues have to be consideredwhen assessing the energy efficiency of error recovery techniques. Ingeneral, packets are often broken into message flow control units orflits.

In the presence of channel width constraints, multiple physicalchannel cycles may be used to transfer a single flit. A phit is theunit of information that can be transferred across a physical channelin a single cycle.

Flits represent logicalunits of information, as opposed to phits,which correspond to physical quantities, i.e., the number of bits thatcan be transferred in a single cycle. In many implementations, a flitis set to be equal to a phit.

Communication reliability can be guaranteed at different levels ofgranularity. We might refer control bits (i.e., a checksum) to anentire packet, thus minimizing control bits overhead, although thiswould prevent us from stopping the propagation of corrupted flits, asrouting decisions would be taken in advance with respect to dataintegrity checking. In fact, control bits would be transmitted as thelast flit of the packet.

In this scenario, the cost for redundancy would be paid in the timedomain. The alternative solution is to provide reliability at the flitlevel, thus refining control granularity but paying for redundancy inthe space domain (additional wiring resources for check bits). Theconsiderations that follow will refer to this latter scenario, whereintwo different solutions are viable:

1)The error recovery strategy can be distributed over the network. Each communication switch is equipped with error detecting/correctingcircuitry, so that error propagation can be immediately stopped. Thisis the only way to avoid routing errors: should the header getcorrupted, its correct bit configuration could be immediately restored,preventing the packet from being forwarded across the wrong path to thewrong destination.

Unfortunately, retransmission-oriented schemes need power-hungrybuffering resources at each switch, so their advantage in terms ofhigher detection capability has to be paid for with circuit complexityand power dissipation.

2) Alternatively, anend-to-end approach to error recovery is feasible: onlyend-nodes are able to perform error detection/correction. In this case,retransmission may not be convenient at all, especially when source anddestination are far apart from each other, and retransmitting corruptedpackets would stimulate a large number of transitions, beyond givingrise to large delays.

For this scenario, error correction is the most efficient solution,even though the proper course of action has to be taken to handleincorrectly routed packets (retransmission time-outs at the sourcenode, deadlock avoidance, and so on).

Another consideration regards the way retransmissions are carriedout in an NoC . Traditionalshared bus architectures can be modified toperform retransmissions in a ''stop and wait'' fashion: the masterdrives the data bus and waits for the slave to carry out sampling onone of the following clock edges.

If the slave detects corrupted data, a feedback has to be given tothe master, scheduling a retransmission. To this purpose, an additionalfeedback line can be used, or built-in mechanisms of the bus protocolcan be exploited.

In packet-switched networks, data packets transmitted by the mastercan be seen as a continuous flow, so the retransmission mechanism mustbe either “go-back-N” or “selective repeat.''

In both cases, each packet has to be acknowledged (ACK), and thedifference lies in the receiver (switch or network interface)complexity. In a “go-back-N” scheme, the receiver sends a not ACK(NACK) to the sender relative to a certain incorrectly received packet.The sender reacts by retransmitting the corrupted packet as well as allother following packets in the data flow.

This alleviates the receiver from the burden to store packetsreceived out of order and to reconstruct the original sequence. On theother hand, when this capability is available at the receiver side (atthe cost of further complexity), retransmissions can be carried out byselectively requiring the corrupted packet without the need toretransmit also successive packets. The tradeoff here is between switchand network interface complexity and number of transitions on the linklines.

Energy-Aware Software
Although employing energy-efficient circuit/architecture techniques isa must for energy-aware MPSoCs, there are also important roles for thesoftware to play. Perhaps one of the most important of these is toparallelize a given application across on-chip processors.

It is to be noted that, in this context, it may not be acceptable touse a conventional application parallelization strategy from thehigh-end computing domain since such strategies focus only onperformance and do not account for the energy behavior of the optimizedcode.

More specifically, such high-end computing-oriented techniques canbe very extravagant in their use of processors (or machine resources ingeneral), i.e., they tend to use a large number of parallel processorsin executing applications even though most of these processors bringonly marginal performance benefits.

However, a compiler designed for energy-aware MPSoCs may not havethis luxury. In other words, it should strive to strike an acceptablebalance between increased parallelism (reduced execution cycles) andincreased energy consumption as the number of processors is increased.

Therefore, our belief is that it is very important to determine theideal number of processors to use in executing a given MPSoCapplication. In addition, it is conceivable to assume that differentparts of the same application can demand different numbers ofprocessors to generate the best behavior at run-time.

Consequently, it may not be sufficient to fix the number ofprocessors to use at the same value throughout the execution. Duringexecution of a piece of code, unused processors can be placed in a lowpower state for saving energy [68].

Most published work on parallelism for high-end machines is static,that is, the number of processors that execute the code is fixed forthe entire execution. For example, if the number of processors thatexecute a code is fixed at eight, all parts of the code (for example,all loop nests) are executed using the same eight processors.

However, this may not be a very good strategy from the viewpoint ofenergy consumption. Instead, one can consume much less energy by usingthe minimum number of processors for each loop nest (and shutting downthe unused processors).

In adaptive parallelization, the number of processors is tailored tothe specific needs of each code section (e.g., a nested loop inarray-intensive applications). In other words, the number of processorsthat are active at a given period of time changes dynamically as theprogram executes.

For instance, an adaptive parallelization strategy can use four,six, two, and eight processors to execute the first four nests in agiven code. There are two important issues that need to be addressedfor energy-aware parallelization (assuming that interprocessorcommunication occurs through shared memory):

1) Determining the idealnumber of processors to use . There are at least two ways todetermine the number of processors needed per nest. The first option isto adopt a fully dynamic strategy whereby the number of processors (foreach nest) is decided on in the course of execution (at run-time).

Kandemir et al. [71] present such a dynamic strategy. Although thisapproach is expected to generate better results once the number ofprocessors has been decided (as it can take run-time code behavior anddynamic resource constraints into account), it may also incur someperformance overhead during the process of determining the number ofprocessors.

This overhead can, in some cases, offset the potential benefits ofadaptive parallelism. In the second option, the number of processorsfor each nest is decided statically at compile-time. This approach hasa compile-time overhead, but it does not lead to any run-time penalty.

It should be emphasized, however, that although this approachdetermines the number of processors statically at compile-time, theactivation/deactivation of processors and their caches occursdynamically at run-time. Examples of this strategy can be found inKadayif et al. [72].

2) Selection criterion for the number ofprocessors. An optimizing compiler can target differentobjective functions such as minimizing the execution time of thecompiled code, reducing executable size, improving power or energybehavior of the generated code, and so on.

Figure2-11. Evolving design tradeoffs: T, time; R, reliability; A, area; P,power; PT, product of power and time.

In addition, it is also possible to adopt complex objectivefunctions and compilation criteria such as minimizing the sum of thecache and main memory energy consumptions while keeping the totalexecution cycles bounded by M. For example, the framework described inKandemir et al. [71] accepts as objective function and constraints anylinear combination of energy consumptions and execution cycles ofindividual hardware components.

This series of articles has surveyed a number of energy-aware designtechniques for controlling both active and standby power consumptionthat are applicable to MPSoCs. In particular, it covered techniquesused to design energy-aware MPSoC processor cores, the on-chip memoryhierarchy, the on-chip communication system, and MPSoC energy-awaresoftware techniques.

The vast majority of the techniques are derived from existinguniprocessor energy-aware systems that take on new dimensions in theMPSoC space. As energy-aware MPSoC design is a relatively new area,many open questions remain.

Other issues to be considered are the possible promise of globallyasynchronous, locally synchronous (GALS) clocking forMPSoCs, as wellas the inextricable link between energy-aware design techniques andtheir impact on system reliability.

As shown in Figure 2-11 above design tradeoffs have changed over the years from a focus on area to afocus on performance and area to a focus on performance and power. Inthe future, the metrics of performance, power, and reliability andtheir tradeoffs will be of increasing importance.

To read Part 1, go to “Energy-awareprocessor design.”
To read Part 2, go to “Energy-awarememory design.”
To read Part 3, go to “Energy-aware on-chip communication systemdesign.”

This series of articles is based oncopyrighted material submitted by Mary Jane Irwin, Luca Beni, N.Vijaykrishnan and Mahmut Kandemir to “MultiprocessorSystems-On-Chips,” edited by Wayne Wolf and Ahmed Amine Jerraya. Itis used with the permission of the publisher, Morgan Kaufmann, animprint of Elsevier. The book can be purchased on-line.

Mary Jane Irwin is the A.Robert Noll Chair in Engineering in the Department of Computer Scienceand Engineering at Pennsylvania State University. Luca Benini is professor at theDepartment of Electrical Engineering and Computer Science at theUniversity of Bologna in Italy. N.Vijaykrishnan is an associate professor, and Mahmut Kandemir is an assistantprofessor in the Computer Science and Engineering Department atPennsylvania State University.

Ahmed Jerraya is researchdirector with CNRS and is currently managing research on multiprocessorsystem-on-chips at TIMA Laboratory in France. Wayne Wolf is currently the GeorgiaResearch Alliance Eminent Scholar holding the Rhesa “Ray” S. Farmer,Jr. Distinguished Chair in Embedded Computer Systems at Georgia Tech'sSchool of Electrical and Computer Engineering (ECE). Previously aprofessor of electrical engineering at Princeton University, he workedat AT&T Bell Laboratories.

[65] Shang, H.,,”Interconnection architecture exploration for low energy configurablesingle-chip DSPs,” IEEE Computer Society Workshop on VLSI, 1999.
[67] Ye, T.,,”Packetized on-chip interconect communications analysis for MPSoc,”Design Automation and Test in Europe Conference. 2003.
[71] Kandemir, M.,,”:Runtime code parallelization for onchip processors,” Proceedings ofthe Sixth Design Automation and Test In Europe Conference. 2003.
[72] Kadayif, I.,, “Aninterger linear programming-based approach for parallelizingapplications in on-chip multiprocessors.” Proceedings of the DesignAutomation Conference, 2002.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.