Optimizing embedded software for power efficiency: Part 3 – Optimizing data flow and memory - Embedded.com

Optimizing embedded software for power efficiency: Part 3 – Optimizing data flow and memory

Editor's note: In this third in a series on how to manage your embedded software design’s power requirements, the authors discuss how attention to the flow of data through the processor and how its memory can be used to manage power consumption efficiency. Excerpted from Software engineering for embedded systems .

Because clocks in an embedded system design have to be activated not only in the core components, but also in buses and memory cells, memory-related functionality can be quite power-hungry, but luckily memory access and data paths can also be optimized to reduce power.

This third in a series of articles covers methods to optimize power consumption with regard to access to DDR and SRAM memories by utilizing knowledge of the hardware design of these memory types. Then we will cover ways to take advantage of other specific memory set-ups at the SoC level.

Common practice is to optimize memory in order to maximize the locality of critical or heavily used data and code by placing as much in cache as possible. Cache misses incur not only core stall penalties, but also power penalties as more bus activity is needed, and higher-level memories (internal device SRAM, or external device DDR) are activated and consume power. As a rule, access to higher-level memory such as DDR is not as common as internal memory accesses, so high-level memory accesses are easier to plan, and thus optimize.

DDR overview
The highest level of memory we will discuss here is external DDR memory. To optimize DDR accesses in software, first we need to understand the hardware that the memory consists of. DDR SDRAM, as the DDR (dual data rate) name implies, takes advantage of both edges of the DDR clock source in order to send data, thus doubling the effective data rate at which data reads and writes may occur. DDR provides a number of different types of features which may affect total power utilization, such as EDC (error detection), ECC (error correction), different types of bursting, programmable data refresh rates, programmable memory configuration allowing physical bank interleaving, page management across multiple chip selects, and DDR-specific sleep modes.

Key DDR vocabulary to be discussed
Chip Select (also known as Physical Bank): selects a set of memory chips (specified as a “rank”) connected to the memory controller for accesses.

Rank: specifies a set of chips on a DIMM to be accessed at once. A Double Rank DIMM, for example, would have two sets of chips — differentiated by chip select. When accessed together, each rank allows for a data access width of 64 bits (or 72 with ECC).

Rows are address bits enabling access to a set of data, known as a “page” — so row and page may be used interchangeably.

Logical banks , like row bits, enable access to a certain segment of memory. By standard practice, the row bits are the MSB address bits of DDR, followed by the bits to select a logical bank, finally followed by column bits.

Column bits are the bits used to select and access a specific address for reading or writing.

On a typical embedded processor, like a DSP, the DSPs’ DDR SDRAM controller is connected to either discrete memory chips or a DIMM (dual inline memory module), which contains multiple memory components (chips). Each discrete component/chip contains multiple logical banks, rows, and columns which provide access for reads and writes to memory. The basic idea of how a discrete DDR3 memory chip’s layout is shown in Figure 13.9 .

Figure 13.9: Basic drawing of a discrete DDR3 memory chip’s rows/columns.

Standard DDR3 discrete chips are commonly made up of eight logical banks, which provide addressability as shown above. These banks are essentially tables of rows and columns. The action to select a row effectively opens that row (page) for the logical bank being addressed. So different rows can be simultaneously open in different logical banks, as illustrated by the active or open rows highlighted in the picture. A column selection gives access to a portion of the row in the appropriate bank.

When considering sets of memory chips, the concept of chip select is added to the equation.
Using chip selects, also known as “physical banks”, enables the controller to access a certain set of memory modules (up to 1 GB for the MSC8156, 2 GB for MSC8157 DSPs from Freescale for example) at a time. Once a chip select is enabled, access to the selected memory modules with that chip select are activated, using page selection (rows), banks, and columns. The connection of two chip selects is shown in Figure 13.10 .

Figure 13.10: Simplified view: DDR controller to memory connection: two chip selects

In Figure 13.10 we have our DSP device which is intended to access DDR memory. There are a total of 16 chips connected to two chip selects: chip select 0 on the left in red, and 1 on the right in orange. The 16 discrete chips are paired such that a pair of chips shares all the same signals (Address, bank, data, etc.), except for the chip select pin. (Interesting note: This is basically how a dual rank DDR is organized, except each “pair of chips” exists within a single chip.) There are 64 data bits. So for a single chip select, when we access DDR and write 64 contiguous bits of data to DDR memory space in our application, the DDR controller does the following:

  • Selecting chip select based on your address (0 for example).
  • Opening the same page (row) for each bank on all eight chips using the DDR address bits during the Row Access phase.
  • New rows are opened via the ACTIVE command, which copies data from the row to a “row buffer” for fast access.
  • Rows that were already opened do not require an active command and can skip this step.
  • During the next phase, the DDR controller will select the same column on all eight chips. This is the column-access phase.
  • Finally, the DDR controller will write the 64 bytes to the now open row buffers for each of the eight separate DDR chips which each input eight bits.

As there is a command to open rows, there is also one to close rows, called PRECHARGE, which tells the DDR modules to store the data from the row buffers back to the actual DDR memory in the chip, thus freeing up the row buffer. So when switching from one row to the next in a single DDR bank, we have to PRECHARGE the open row to close it, and then ACTIVATE the row we wish to start accessing.

A side effect of an ACTIVATE command is that the memory is automatically read and written — thus REFRESHing it. If a row in DDR is PRECHARGED, then it must be periodically refreshed (read/re-written with the same data) to keep data valid. DDR controllers have an autorefresh mechanism that does this for the programmer.

DDR data flow optimization for power
Now that the basics of DDR accesses have been covered, we can cover how DDR accesses can be optimized for minimal power consumption. As is often the case, optimizing for minimal power consumption is beneficial for performance as well.

DDR consumes power in all states, even when the CKE (clock enable — enabling the DDR to perform any operations) is disabled, though this is minimal. One technique to minimize DDR power consumption is made available by some DDR controllers which have a power saving mode that de-asserts the CKE pin — greatly reducing power. In some cases, this is called Dynamic Power Management Mode, which can be enabled via the DDR_SDRAM_CFG[DYN_PWR] register. This feature will de-assert CKE when no memory refreshes or accesses are scheduled. If the DDR memory has self-refresh capabilities, then this power-saving mode can be prolonged as refreshes are not required from the DDR controller.

This power-saving mode does impact performance to some extent, as enabling CKE when a new access is scheduled adds a latency delay.

Tools such as Micron’s DDR power calculator can be used to estimate power consumption for DDR. If we choose 1 GB x8 DDR chips with 2125 speed grade, we can see estimates for the main power-consuming actions on DDR. Power consumption for non-idle operations is additive, so total power is the idle power plus non-idle operations.

  • Idle with no rows open and CKE low is shown as: 4.3 mW (IDD2p)
  • Idle with no rows open and CKE high is shown as: 24.6 mW (IDD2n)
  • Idle with rows open and no CKE low is shown as: 9.9 mW (IDD3p)
  • Idle with rows open and CKE high is shown as: 57.3 mW (IDD3n)
  • ACTIVATE and PRECHARGE is shown as consuming 231.9 mW
  • REFRESH is shown as 3.9 mW
  • WRITE is shown as 46.8 mW READ is shown as 70.9 mW

We can see that using the Dynamic Power Management mode saves up to 32 mW of power, which is quite substantial in the context of DDR usage.

Also, it is clear that the software engineer must do whatever possible to minimize contributions to power from the main power contributors: ACTIVATE, PRECHARGE, READ, and WRITE operations.

The power consumption from row activation/precharge is expected as DDR needs to consume a considerable amount of power in decoding the actual ACTIVATE instruction and address followed by transferring data from the memory array into the row buffer. Likewise, the PRECHARGE command also consumes a significant amount of power in writing data back to the memory array from row buffers.

Optimizing power by timing
One can minimize the maximum “average power” consumed by ACTIVATE commands over time by altering the timing between row activate commands, tRC (a setting the programmer can set at start up for the DDR controller). By extending the time required between DDR row activates, the maximum power spike of activates is spread, so the amount of power pulled by the DDR in a given period of time is lessened, though the total power for a certain number of accesses will remain the same. The important thing to note here is that this can help with limiting the maximum (worst-case) power seen by the device, which can be helpful when having to work within the confines of a certain hardware limitation (power supply, limited decoupling capacitance to DDR supplies on the board, etc.).

Optimizing with interleaving
Now that we understand that our main enemy in power consumption on DDR is the activate/precharge commands (for both power and performance), we can devise plans to minimize the need for such commands. There are a number of things to look at here, the first being address interleaving, which will reduce ACTIVATE/PRECHARGE command pairs via interleaving chip selects (physical banks) and additionally by interleaving logical banks.

In setting up the address space for the DDR controller, the row bits and chip select/bank select bits may be swapped to enable DDR interleaving, whereby changing the higher-order address enables the DDR controller to stay on the same page while changing chip selects (physical banks) and then changing logical banks before changing rows. The software programmer can enable this by register configuration in most cases.

Optimizing memory software data organization
We also need to consider the layout of our memory structures within DDR. If using large ping-pong buffers, for example, the buffers may be organized so that each buffer is in its own logical bank. This way, if DDR is not interleaved, we still can avoid unnecessary ACTIVATE/PRECHARGE pairs if a pair of buffers is larger than a single row (page).

Optimizing general DDR configuration
There are other features available to the programmer which can positively or negatively affect power, including “open/closed” page mode. Closed page mode is a feature available in some controllers which will perform an auto-precharge on a row after each read or write access. This of course unnecessarily increases the power consumption in DDR as a programmer may need to access the same row 10 times, for example; closed page mode would yield at least 9 unneeded PRECHARGE/ACTIVATE command pairs. In the example DDR layout discussed above, this could consume an extra 231.9 mW s 9 5 2087.1 mW.

As you may expect, this has an equally negative effect on performance due to the stall incurred during memory PRECHARGE and ACTIVATE.

Optimizing DDR burst accesses
DDR technology has become more restrictive with each generation: DDR2 allows 4-beat bursts and 8-beat bursts, whereas DDR3 only allows 8. This means that DDR3 will treat all burst lengths as 8-beat (bursts of 8 accesses long). So for the 8-byte (64 bit) wide DDR accesses we have been discussing here, accesses are expected to be 8 beats of 8 bytes, or 64 bytes long.

If accesses are not 64 bytes wide, there will be stalls due to the hardware design. This means that if the DDR memory is accessed only for reading (or writing) 32 bytes of data at a time, DDR will only be running at 50% efficiency, as the hardware will still perform reads/writes for the full 8-beat burst, though only 32 bytes will be used. Because DDR3 operates this way, the same amount of power is consumed whether doing 32-byte or 64- byte-long bursts to our memory here. So for the same amount of data, if doing 4-beat (32 byte) bursts, the DDR3 would consume approximately twice the power.

The recommendation here then is to make all accesses to DDR full 8-beat bursts in order to maximize power efficiency. To do this, the programmer must be sure to pack data in the DDR so that accesses to the DDR are in at least 64-byte-wide chunks. Packing data so it is 64-byte-aligned or any other alignment can be done through the use of pragmas.

The concept of data packing can be used to reduce the amount of used memory as well. For example, packing eight single bit variables into a single character reduces memory footprint and increases the amount of usable data the core or cache can read in with a single burst.

In addition to data packing, accesses need to be 8-byte-aligned (or aligned to the burst length). If an access is not aligned to the burst length, for example, let’s assume an 8-byte access starts with a 4-byte offset, both the first and second access will effectively become 4-beat bursts, reducing bandwidth utilization to 50% (instead of aligning to the 64-byte boundary and reading data in with one single burst).

SRAM and cache data flow optimization for power
Another optimization related to the usage of off-chip DDR is avoidance: avoiding using external off-chip memory and maximizing accesses to internal on-chip memory saves the additive power draw that occurs when activating not only internal device buses and clocks, but also off-chip buses, memory arrays, etc.

High-speed memory close to the DSP processor core is typically SRAM memory, whether it functions in the form of cache or as a local on-chip memory. SRAM differs from SDRAM in a number of ways (such as no ACTIVATE/PRECHARGE, and no concept of REFRESH), but some of the principles of saving power still apply, such as pipelining accesses to memory via data packing and memory alignment.

The general rule for SRAM access optimization is that accesses should be optimized for higher performance. The fewer clock cycles the device spends doing a memory operation, the less time that memory, buses, and core are all activated for said memory operation.
SRAM (all memory) and code size
As programmers, we can affectthis in both program and data organization. Programs may be optimizedfor minimal code size (by a compiler, or by hand), in order to consume aminimal amount of space. Smaller programs require less memory to beactivated to read the program. This applies not only to SRAM, but alsoto DDR and any type of memory — less memory having to be accessedimplies a lesser amount of power drawn.

Aside from optimizingcode using the compiler tools, other techniques such as instructionpacking, which are available in some embedded core architectures, enablefitting maximum code into a minimum set of space. The VLES(variable-length execution set) instruction architecture allows theprogram to pack multiple instructions of varying sizes into a singleexecution set. As execution sets are not required to be 128-bit-aligned,instructions can be packed tightly, and the prefetch, fetch, andinstruction dispatch hardware will handle reading the instructions andidentifying the start and end of each instruction set.

Additionally,size can be saved in code by creating functions for common tasks. Iftasks are similar, consider using the same function with parameterspassed to determine the variation to run instead of duplicating the codein software multiple times.

Be sure to make use of combinedfunctions where available in the hardware. For example, in the FreescaleStarCore architecture, using a multiply accumulate (MAC) instruction,which takes one pipelined cycle, saves space and performance in additionto power compared with using separate multiple and add instructions.

Somehardware provides code compression at compile time and decompression onthe fly, so this may be an option depending on the hardware the user isdealing with. The problem with this strategy is related to the size ofcompression blocks. If data is compressed into small blocks, then not asmuch compression optimization is possible, but this is still moredesirable than the alternative. During decompression, if code containsmany branches or jumps, the processor will end up wasting bandwidth,cycles, and power decompressing larger blocks that are hardly used.

Theproblem with the general strategy of minimizing code size is theinherent conflict between optimizing for performance and space.Optimizing for performance generally does not always yield the smallestprogram, so determining ideal code size vs. cycle performance in orderto minimize power consumption requires some balancing and profiling. Thegeneral advice here is to use what tricks are available to minimizecode size without hurting the performance of a program that meetsreal-time requirements.

The 80/20 rule of applying performanceoptimization to the 20% of code that performs 80% of the work, whileoptimizing the remaining 80% of code for size, is a good practice tofollow.

SRAM power consumption and parallelization
Itis also advisable to optimize data accesses in order to reduce thecycles in which SRAM is activated, pipelining accesses to memory, andorganizing data so that it may be accessed consecutively. In systemslike the MSC8156, the core/L1 caches connect to the M2 memory via a128-bit wide bus. If data is organized properly, this means that 128-bitdata accesses from M2 SRAM could be performed in one clock cycle each,which would obviously be beneficial when compared to doing 16independent 8-bit accesses to M2 in terms of performance and powerconsumption.

An example showing how one may use move instructionsto write 128 bits of data back to memory in a single instruction set(VLES) is provided below:

   MOVERH.4F d0:d1:d2:d3,(r4) 1 n0
   MOVERL.4F d4:d5:d6:d7,(r5) 1 n0

Wecan parallelize memory accesses in a single instruction (as with theabove where both of the moves are performed in parallel) and, even ifthe accesses are to separate memories or memory banks, the single-cycleaccess still consumes less power than doing two independent instructionsin two cycles.

Another note: as with DDR, SRAM accesses need to be aligned to the bus width in order to make full use of the bus.

Data transitions and power consumption
SRAMpower consumption may also be affected by the type of data used in anapplication. Power consumption is affected by the number of datatransitions (from 0's to 1's) in memory as well. This power effect alsotrickles down to the DSP core processing elements, as found by Kojima etal. Processing mathematical instructions using constants consumes lesspower at the core than with dynamic variables. In many devices, becausepre-charging memory to reference voltage is common practice in SRAMmemories, power consumption is also proportional to the number of zerosas the memory is pre-charged to a high state.

Using thisknowledge, it goes without saying that re-use of constants wherepossible and avoiding zeroing out memory unnecessarily will, in general,save the programmer some power.

Cache utilization and SoC memory layout
Cacheusage can be thought of in the opposite manner to DDR usage whendesigning a program. An interesting detail about cache is that bothdynamic and static power increase with increasing cache sizes; however,the increase in dynamic power is small. The increase in static power issignificant, and becomes increasingly relevant for smaller featuresizes. As software programmers, we have no impact on the actual cachesize available on a device, but when it is provided, based on the above,it is our duty to use as much of it as possible!

For SoC-levelmemory configuration and layout, optimizing the most heavily usedroutines and placing them in the closest cache to the core processorswill offer not only the best performance, but also better powerconsumption.

Explanation of locality
The reason theabove is true is thanks to the way caches work. There are a number ofdifferent cache architectures, but they all take advantage of theprinciple of locality. The principle of locality basically states thatif one memory address is accessed, the probability of an address nearbybeing accessed soon is relatively high. Based on this, when a cache missoccurs (when the core tries to access memory that has not been broughtinto the cache), the cache will read the requested data in fromhigher-level memory one line at a time. This means that if the coretries to read a 1-byte character from cache, and the data is not in thecache, then there is a miss at this address. When the cache goes tohigher-level memory (whether it be on-chip memory or external DDR,etc.), it will not read in an 8-bit character, but rather a full cacheline. If our cache uses cache sizes of 256 bytes, then a miss will readin our 1-byte character, along with 255 more bytes that happen to be onthe same line in memory.

This is very effective in reducing powerif used in the right way. If we are reading an array of charactersaligned to the cache line size, once we get a miss on the first element,although we pay a penalty in power and performance for cache to read inthe first line of data, the remaining 255 bytes of this array will bein cache. When handling image or video samples, a single frame wouldtypically be stored this way, in a large array of data. When performingcompression or decompression on the frame, the entire frame will beaccessed in a short period of time, thus it is spatially and temporallylocal.

Again, let’s use the example of the six-core MSC8156 DSPSoC. In this SoC, there are two levels of cache for each of the six DSPprocessor cores: L1 cache (which consists of 32 KB of instruction and 32KB of data cache), and a 512 KB L2 memory which can be configured as L2cache or M2 memory. At the SoC level, there is a 1 MB memory shared byall cores called M3. L1 cache runs at the core processor speed (1 GHz),L2 cache effectively manages data at the same speed (double the buswidth, half the frequency), and M3 runs at up to 400 MHz. The easiestway to make use of the memory hierarchy is to enable L2 as cache andmake use of data locality. As discussed above, this works when data isstored with high locality. Another option is to DMA data into L2 memory(configured in non-cache mode). We will discuss DMA in a later section.

Whenwe have a large chunk of data stored in M3 or in DDR, the MSC8156 candraw this data in through the caches simultaneously. L1 and L2 cachesare linked, so a miss from L1 will pull 256 bytes of data in from L2,and a miss from L2 will pull data in at 64 bytes at a time (64 B linesize) from the requested higher-level memory (M3 or DDR). Using L2 cachehas two advantages over going directly to M3 or DDR. First, it isrunning at effectively the same speed as L1 (though there is a slightstall latency here, it is negligible), and second, in addition to beinglocal and fast, it can be up to 16 times larger than L1 cache, allowingus to keep much more data in local memory than just L1 alone would.

Explanation of set-associativity
Allcaches in the MSC8156 are eight-way set-associative. This means thatthe caches are split into eight different sections (“ways”). Eachsection is used to access higher-level memory, meaning that a singleaddress in M3 could be stored in one of eight different sections (ways)of L2 cache, for example. The easiest way to think of this is that thesection (way) of cache can be overlaid onto the higher-level memory xtimes. So if L2 is set up as all cache, the following equationcalculates how many times each set of L2 memory is overlaid onto M3:

Inthe MSC8156, a single way of L2 cache is 64 KB in size, so addressesare from 0 3 0000_0000 to 0 3 0001_0000 hexadecimal. If we consider eachway of cache individually, we can explain how a single way of L2 ismapped to M3 memory. M3 addresses start at 0xC000_0000. So M3 addresses0xC000_0000, 0xC001_0000, 0xC002_0000, 0xC003_0000, 0xC004_0000, etc.(up to 16 K times) all map to the same line of a way of cache. So if way#1 of L2 cache has valid data for M3’s 0xC000_0000, and the coreprocessor wants to next access 0xC001_0000, what is going to happen?

Ifthe cache has only one way set-associativity, then the line of cachecontaining 0xC000_0000 will have to be flushed back to cache and re-usedin order to cache 0xC001_0000. In an eight-way set-associative cache,however, we can take advantage of the other 7 3 64 KB sections (ways) ofcache. So we can potentially have 0xC000_0000 stored in way #1, and theother seven ways of cache have their first line of cache as empty. Inthis case, we can store our new memory access to 0xC001_0000 in way #2.

So, what happens when there is an access to 0xC000_0040? (0 3 40 5 5 64B). The answer here is that we have to look at the second cache line ineach way of L2 to see if it is empty, as we were only considering thefirst line of cache in our example above. So here we now have eight morepotential places to store a line of data (or program).

Figure 13.11 shows a four-way set-associative cache connecting toM3. In this figure, we can see that every line of M3 maps to fourpossible lines of the cache (one for each way). So line 0xC000_0040 mapsto the second line (second “set”) of each way in the cache. So when thecore wants to read 0xC000_0040, but the first way has 0xC000_0100 init, the cache can load the core’s request into any of the other threeways if their second lines are empty (invalid).

Figure 13.11: Set-associativity by cache line: four-way set-associative cache.

Thereason for discussing set-associativity of caches is that it does havesome effect on power consumption (as one might imagine). The goal foroptimizing power consumption (and performance) when using cache is tomaximize the hit rate in order to minimize accesses to external busesand hardware caused by misses. Set-associativity is normally alreadydetermined by hardware, but, if the programmer can change setassociativity, set-associative caches maintain a higher hit-rate thandirectly mapped caches, and thus draw lower power.

Memory layout for cache
Whilehaving an eight-way set-associative architecture is statisticallybeneficial in improving hit ratio and power consumption, the softwareprogrammer may also directly improve hit ratio in the cache, and thuslower power by avoiding conflicts in cache. Conflicts in cache occurwhen the core needs data that will replace cache lines with currentlyvalid data that will be needed again.

We can organize memory inorder to avoid these conflicts in a few different ways. For memorysegments we need simultaneously, it is important to pay attention to thesize of ways in the cache. In our eight-way L2 cache, each way is 64KB. As we discussed before, we can simultaneously load eight cache lineswith the same lower 16 bits of address (0 3 0000_xxxx).

Anotherexample is if we are working with nine arrays with 64 KB of datasimultaneously. If we organize each array contiguously data will beconstantly thrashed as all arrays share the same 64 KB offset. If thesame indices of each array are being accessed simultaneously, we canoffset the start of some of the arrays by inserting buffer, so that eacharray does not map to the same offset (set) within a cache way.

Whendata sizes are larger than a single way, the next step is to considerreducing the amount of data that is pulled into the cache at a time —process smaller chunks at a time.

Write-back vs. write-through caches
Somecaches are designed as either “write-back” or “write-through” caches,and others, such as the MSC815x series DSPs, are configurable as either.Write-back and write- through buffering differs in how data from thecore is managed by the cache in the case of writes.

Write-back isa cache writing scheme in which data is written only to the cache. Themain memory is updated when the data in the cache is replaced. In thewrite-through cache write scheme, data is written simultaneously to thecache and to memory. When setting up cache in software, we have to weighthe benefits of each of these. In a multicore system, coherency is ofsome concern, but so are performance and power. Coherency refers to howup-to-date data in main memory is compared to the caches. The greatestlevel of multicore coherency between internal core caches and systemlevel memory is attained by using write-through caching, as every writeto cache will immediately be written back to system memory, keeping itup to date. There are a number of down sides to write-through cachingincluding:

  • core stalls during writes to higher-level memory;
  • increased bus traffic on the system buses (higher chance of contention and system-level stalls);
  • increased power consumption as the higher-level memories and buses are activated for every single memory write.

Thewrite-back cache scheme, on the other hand, will avoid all of the abovedisadvantages at the cost of system-level coherency. For optimal powerconsumption, a common approach is to use the cache in write-back mode,and strategically flush cache lines/segments when the system needs to beupdated with new data.

Cache coherency functions
In addition to write-back and write-through schemes, specific cache commands should also be considered. Commands include:

  • invalidation sweep: invalidating a line of data by clearing valid and dirty bits (effectively just re-labeling a line of cache as “empty”);
  • synchronization sweep: writing any new data back to cache and removing the dirty label;
  • flush sweep: writing any new data back to cache and invalidating the line;
  • fetch: fetch data into the cache.

Generallythese operations can be performed by either cache line, a segment ofthe cache, or as a global operation. When it is possible to predict thata large chunk of data will be needed in the cache in the near future,performing cache sweep functions on larger segments will make better useof the full bus bandwidths and lead to fewer stalls by the core. Asmemory accesses all require some initial memory access set-up time, butafter set- up bursts will flow at full bandwidth, making use of largeprefetches will save power when compared to reading in the same amountof data line by line so long as this is done strategically so as toavoid the data we want from being thrashed before the core actually getsto use it.

When using any of these instructions, we have to becareful about the effect it has on the rest of the cache. For instance,performing a fetch from higher-level memory into cache may requirereplacing contents currently in the cache. This could result inthrashing data in the cache and invalidating cache in order to makespace for the data being fetched.

Compiler cache optimizations
Inorder to assist with the above, compilers may be used to optimize cachepower consumption by reorganizing memory or memory accesses for us. Twomain techniques available are array merging and loop interchanging,explained below.

Array merging organizes memory so that arraysaccessed simultaneously will be at different offsets (different “sets”)from the start of a way. Consider the following two array declarationsbelow:

   int array1[ array_size ];
   int array2[ array_size ];

The compiler can merge these two arrays as shown below:

   struct merged_arrays
     int array1;
     int array2;
     } new_array[ array_ size ]

Inorder to re-order the way that high-level memory is read into cache,reading in smaller chunks to reduce the chance of thrashing loopinterchanging can be used. Consider the code below:

   for (i 5 0; i,100; i 5 i 1 1)
    for (j 5 0; j,200; j 5 j 1 1)
     for (k 5 0; k,10000; k 5 k 1 1)
      z[ k ][ j ] 5 10 * z[ k ][ j ];

Byinterchanging the second and third nested loops, the compiler canproduce the following code, decreasing the likelihood of unnecessarythrashing during the innermost loop.

   for (i 5 0; i,100; i 5 i 1 1)
    for (k 5 0; k,10000; k 5 k 1 1)
     for (j 5 0; j,200; j 5 j 1 1)
      z[ k ][ j ] 5 10 * z[ k ][ j ];

Part 1: Measuring power
Part 2: Minimizing hardware power use
Part 4: Peripheral and algorithmic optimization

Rob Oshana has 30 years of experience in the software industry, primarily focusedon embedded and real-time systems for the defense and semiconductorindustries. He has BSEE, MSEE, MSCS, and MBA degrees and is a SeniorMember of IEEE. Rob is a member of several Advisory Boards including theEmbedded Systems group, where he is also an international speaker. Hehas over 200 presentations and publications in various technology fieldsand has written several books on embedded software technology. He is anadjunct professor at Southern Methodist University where he teachesgraduate software engineering courses. He is a Distinguished Member ofTechnical Staff and Director of Global Software R&D for DigitalNetworking at Freescale Semiconductor.

Mark Kraeling isProduct Manager at GE Transportation in Melbourne, Florida, where he isinvolved with advanced product development in real-time controls,wireless, and communications. He’s developed embedded software for theautomotive and transportation industries since the early 1990s. Mark hasa BSEE from Rose-Hulman, an MBA from Johns Hopkins, and an MSE fromArizona State.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software engineering for embedded systems, by Robert Oshana and Mark Kraeling.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.