Reducing Active Energy
Many techniques have been proposed in the past to reduce cache energyconsumption. Among these are partitioning large caches into smallerstructures to reduce the dynamic energy and the use of a memoryhierarchy that attempts to capture most accesses in the smallest sizememory.
By accessing the tag and data array in series,
In Kin et al. , a small filter cache is placed prior to the L1cache to reduce energy consumption. Dynamic zero compression employssingle-bit access for zero-valued byte in the cache to reduce energyconsumption.
Many of these techniques applied in the context of single processorsare also applicable for the design of caches associated with theindividual processors of the MPSoC system. In addition to the hardwaredesign techniques, software optimizations that reduce the number ofmemory accesses through code and data transformations can be veryuseful in reducing energy consumption.
Reducing Standby Energy
However, most of the above techniques do little to alleviate theleakage energy problem as the memory cells in all partitions and alllevels of the hierarchy continue to consume leakage power as long asthe power supply is maintained to them, irrespective of whether theyare used or not.
Various circuit technologies have been designed specifically toreduce leakage power when the component is not in use. Some of thesetechniques focus on reducing leakage during idle cycles of thecomponent by turning off the supply voltage.
One such scheme, gated-Vdd, was integrated into the architecture ofcaches [24} to shut down portions of the cache dynamically. Thistechnique was applied at a cache block granularity in Kaxiras et al. and used in conjunction with software to remove dead objects inChen et al. .
However, all these techniques assume that the state (contents) ofthe supply-gated cache memory is lost. Although totally eliminating thesupply voltage results in the state of the cache memory being lost, itis possible to apply a state-preserving leakage optimization techniqueif a small supply voltage is maintained to the memory cell.
Many alternate implementations have recently been proposed at thecircuit level to achieve such a state-preserving leakage controlmechanism. As an abstraction of these techniques, the choice betweenthe state-preserving and state-destroying techniques depends on therelative overhead of the additional leakage required to maintain thestate as opposed to the cost of restoring the lost state from otherlevels of the memory hierarchy.
An important requirement to reduce leakage energy using either astate-preserving or a state-destroying leakage control mechanism is theability to identify unused resources (or data contained in them). InYang et al. , the cache size is reduced (or increased) dynamicallyto optimize the utility of the cache. In Kaxiras et al. , the cacheblock is supply-gated if it has not been accessed for a period of time.
In Zhou et al. , hardware tracks the hypothetical miss rate andthe real miss rate by keeping the tag line active when deactivating acache line. Then the turn-off interval can be dynamically adjustedbased on such information. In Flautner et al.  and Kim et al. ,dynamic supply voltage scaling is used to reduce the leakage in theunused portions of the memory.
In contrast to the other schemes, their drowsy cache scheme alsopreserves data when a cache line is in low leakage mode. The usefulnessand practicality of such state-preserving voltage scaling schemes forembedded power-optimized memories is demonstrated in Qin and Rabaey. The focus in Heo et al.  is on reducing bitline leakage powerusing leakage-biased bitlines.
The technique turns off precharging transistors of unused subbanksto reduce bitline leakage, and actual bitline precharging is delayeduntil the subbank is accessed. All these techniques for cache poweroptimization can also be applied to reduce the leakage energy in theMPSoC.
Influence of Cache Architecture onEnergy Consumption
In addition to applying these techniques for active and standby energyreduction, the choice of cache architecture plays an important role inthe amount of utilized memory storage space and, consequently, theleakage energy. There are two popular alternatives for building a cachearchitecture for on-chip multiprocessors.
The first one is to adopt a single multi-ported cache architecturethat is shared by the multiple processors. There are two majoradvantages of this alternative: (1) constructive interference canreduce overall miss rates, and (2) inter-processor communication iseasy to implement. However, a large multi-ported cache can consumesignificant energy.
Additionally, this is not a scalable option. The second alternativeis to allow each processor to have its own private cache. The mainbenefits of the private cache option are low power per access, lowaccess latency, and good scalability. Its main drawback is duplicationof data and instructions in different caches. In addition, a complexcache coherence protocol is needed to maintain consistency.
Here we explore a cache configuration, referred to as thecrossbar-connected cache (CCC) that tries to combine the advantages ofthe two options discussed above without their drawbacks. Specifically,the shared cache is divided into multiple banks using an N ? M crossbarto connect the N processors and the M cache banks. In this way, theduplication problem is eliminated (as logically there is a singlecache), a consistency mechanism is no longer needed, and the solutionis scalable. These advantages are useful in reducing the energyconsumption.
|Figure2-7. Percentage sharing of instructions and data in an 8-processorMPSoC.|
Figure 2-7 above illustratesthe percentage of sharing for instructions and data when SPLASH-2benchmarks are executed using eight processors each with private L1caches. Percentage sharing is the percentage of data or instructionsshared by at least two processors when considering the entire executionof the application. The first observation is that instruction sharingis extremely high.
This is a direct result of the data parallelism exploited in thesebenchmarks, that is, different processors typically work on (mostly)different sets of data using the same set of instructions. Instructionsharing is above 90% for all applications except raytrace.
These results clearly imply a large amount of instructionduplication across private caches. Consequently, a strategy thateliminates this duplication can bring large benefits (as there will bemore available space), i.e., the effective cache capacity is increasedby eliminating instruction duplication. Eliminating duplications alsoeliminated the extra leakage energy expended in powering duplicatedcache lines.
When one looks at the data sharing in Figure 2-7 , the picture changes. Thepercentage sharing for data is low, averaging around 11%. However,employing CCC can still be desirable from the cache coherence viewpoint(since complex coherence protocols associated with private caches canbe a major energy consumer). Based on these observations, using CCCmight be highly beneficial in practice.
|Figure2-8. Different cache organizations: multiported shared cache, privatecaches, and CCC.|
Figure 2-8 above illustratesthe three different cache organizations evaluated here: a multi-portedshared cache, private caches, and CCC. A common characteristic of theseconfigurations is that they all have a shared L2 cache. For performancereasons, a CCC architecture would need a crossbar interconnect betweenthe processors and L1 cache banks.
Compared with a bus-based architecture, the crossbar-basedarchitecture provides simultaneous accesses to multiple cache banks(which, in turn, helps to improve performance). However, due to thepresence of the crossbar, the pipeline of the CCC processors needs tobe restructured.
For example, if the pipeline stages of the private cache (andmulti-ported cache) architecture are instruction fetch (IF),instruction decode (ID), execution (EX), and write-back (WB), thepipeline stages in the CCC system are instruction crossbar transfer(IX), IF, ID, data crossbar transfer (DX), EX, and WB. In the IX stage,the processor transfers its request to an instruction cache bankthrough the crossbar.
If more than one request targets the same bank, only one request cancontinue with the next pipeline stage (IF), and all other processorsexperience a pipeline stall. In the DX stage, if the instruction isload or store, another transfer request through the crossbar is issued.
|Figure2-9. Address formats used by the different cache organizations.|
Figure 2-9 above shows theaddress formats used by the private cache system (and also the sharedmultiported cache system) and the CCC architecture. In the latter, anew field, called the bank number, is added. This field is used toidentify the bank to be accessed. Typically, continuous cache blocksare distributed across different banks in order to reduce bankconflicts.
Although CCC is expected to be preferable over the other twoarchitectures, it can cause processor stalls when concurrent accessesto the same bank occur. To alleviate this, there are two possibleoptimizations. The first optimization is to use more cache banks thanthe number of processors. For instance, for a target MPSoC with fourprocessors, utilize 4, 8, 16, or 32 banks.
The rationale behind this is to reduce conflicts for a given bank.The second optimization deals with the references to the same block.When two such references occur, instead of stalling one of theprocessors, just read the block once and forward it to both theprocessors. This optimization should be more successful when there is ahigh degree of inter-processor block sharing.
To compare the cache energy consumption of the three architecturesdepicted in Figure 2-8, we ran a set of simulations assuming 70-nmtechnology (in which standby energy is expected to be comparable toactive energy). We used the cache leakage control strategy proposed inFlautner et al.  in both the L1 and L2 caches whereby a cache linethat has not been used for some time is placed into a leakage controlmode.
The experimental results obtained through cycle-accurate simulationsand the SPLASH-2 benchmark suite indicate that the energy benefits ofCCC range from 9% to 26% with respect to the private cache option.These savings come at the expense of a small degradation inperformance. These results show that the choice of cache architectureis very critical in designing energy-aware MPSoCs.
Reducing Snoop Energy
Since the cache snoop activity primarily involves tag comparisonsand mostly incur tag misses, tag and data array accesses are separated,unlike normal caches in which both are accessed simultaneously toimprove performance. Thus, energy optimizations include the use ofdedicated tag arrays for snoops and the serialization of tag and dataarray accesses. For example, servers based on both Intel Xeon 2 andAlpha 21164 fetch tag and data serially.
An optimization that goes beyond reducing the data array energy ofthese tag misses is presented in Moshovos et al. . In this work, asmall, cache-like structure, called JETTY, is introduced between thebus and the backside of each processor's cache to filter the vastmajority of snoops that would not find a locally cached copy.
Energy is reduced, as accesses are mostly confined to JETTY, andthus accesses to the energy consuming cache tag arrays are decreased.Their work demonstrates that JETTY filters 74% (average) of allsnoop-induced tag accesses that would miss, resulting in an averageenergy reduction of 29% in the cache.
Another approach to reducing the snoop energy is proposed inSaldhana and Lipasti . In this work, the authors suggest a serialsnooping of the different caches as opposed to a parallel snoop of allthe caches attached to the bus.
Whenever a request is generated on the bus, the snooping begins withthe node closest to the requester and then propagates to the nextsuccessive node in the path only if the previous nodes have notsatisfied the request. The tradeoff in this approach is the increasedaccess latency due to the sequential snooping in order to reduceenergy.
In Ekman et al. , an evaluation of JETTY and the serial snoopingschemes is performed in the context of on-chip multiprocessors. Theauthors concluded that serial snooping is not very effective, as allthe caches need to be searched when none of the caches have therequested data.
This performance penalty also translates to an energy penalty.Furthermore, they also found the JETTY approach not as attractive inthe case of MPSoCs because the private cache associated with eachprocessor in a MPSoC is comparable in size to the JETTY structure.
Another important impact of the snooping activity is the trafficgenerated on the interconnects. The memory accesses and organizationhave a significant impact on the interconnect activity and powerconsumption. The next section explores on-chip communication energyissues.
Next in Part 3: Energy-awareon-chip communication system design
To read Part 1, go to “Energy-awareprocessor design.”
This series of articles is based oncopyrighted material submitted by Mary Jane Irwin, Luca Beni, N.Vijaykrishnan and Mahmut Kandemir to “
Mary Jane Irwin is the A.Robert Noll Chair in Engineering in the Department of Computer Scienceand Engineering at Pennsylvania State University.
Ahmed Jerraya is researchdirector with CNRS and is currently managing research on multiprocessorsystem-on-chips at TIMA Laboratory in France.
 Inoue, K., et.al.,”Way-predicting set-associative cache for highperformance and low energy consumption,” Proceedings of the 1999International Symposium on low power electronics and design.
 Powell, M. D., “Reducingset-associative cache energy viaway-predictiion and selective direct mapping,” Proceedings of the 34-thannual ACM/IEEE International Symposium on Microarchitecture (2002).
 Albonesi, D.H. et.al.,”Selective cache ways: on demand cacheressource allocation.” Proceedings of the 34-th annual ACM/IEEEInternational Symposium on Microarchitecture, 1999.
 Kin, J. et.al., “Thefilter cache: an energy efficient memorystructure,” Proceedings of the 34-th annual ACM/IEEE InternationalSymposium on Microarchitecture, 1997.
 Yang, S., et.al., “Anintegrated circuit/architecture approache toreducing leakage in deep-submicron high performance I-caches.”Proceeding of the Seventh International Symposium on High PerformanceComputer Architecture.” 2001.
 Kaxiras, S., et.al.'”Cache decay: exploiting generationalbehavior to reduce cache leakage power.” Proceedings of the 28-thAnnual International Symposium on Computer Architecture, 2001.
 Chen, G, et.al., Tuninggarbage collection in an embedded Javaenvironment.” Proceedings of the Eighth International Symposium on HighPerformance Computer Architecture. 2002.
 Zhou, H., et.al. “Adaptivemode control: a static-power-efficientdesign,” Tenth International Conference on Parallel Architectures andCompilation Techniques,” Sept., 2001.
 Flautner, K., et.al.”Drowsy caches: simple techniques forreducing leakage power,” Proceedings of the 29-th Annual InternationalSymposium on Computer Architecture, 2002.
 Kim, N.S., et.al., “Drowsyinstruction caches: leakage powerreduction using dynamic voltage scaling,” Proceedings of the 33-rdannual ACM/IEEE International Symposium on Microarchitecture, 2002.
 Qin, H. and J. Rabney,”Leakage suppression of embedded memories,”Gigascale Silicon Research Center Annual Review, 2002.
 Heo, S., et.al., “Dynamic fine-grain leakage reduction usingleakage-biased bit lines.” Proceedings of the 29-th AnnualInternational Symposium on Computer Architecture, 2002.
 Moshovos, A., et.al.«JETTY: filtering snoops for reducedenergy consumption in SMP servers, » Proceedings of the HPCA,January, 2001.
 Saldhana, C., et.al.,« Power efficient cache coherence,» Proceedings of the Workshop on memory, performance issues. ISCA2001.
 Ekman, M., et.al.,”Evaluation of soop-energy reduction techniquesfor chip-multiprocessors.” Proceedings of the Workshop on memory,performance issues. ISCA 2001.