The basics of programming embedded processors: Part 8 -

The basics of programming embedded processors: Part 8


Power consumption is a particularly important design metric for battery-powered systems because the battery has a very limited lifetime. However, power consumption is increasingly important in systems that run off the power grid. Fast chips run hot, and controlling power consumption is an important element of increasing reliability and reducing system cost.

How much control do we have over power consumption? Ultimately, we must consume the energy required to perform necessary computations. However, there are opportunities for saving power. Examples include the following:

* We may be able to replace the algorithms with others that do things in clever ways that consume less power.
* Memory accesses are a major component of power consumption in many applications. By optimizing memory accesses we may be able to significantly reduce power.
* We may be able to turn off parts of the system – such as subsystems of the CPU, chips in the system, and so on – when we don't need them in order to save power.

The first step in optimizing a program's energy consumption is knowing how much energy the program consumes. It is possible to measure power consumption for an instruction or a small code fragment [Tiw94]. The technique, illustrated in Figure 5-25 below, executes the code under test over and over in a loop.

By measuring the current flowing into the CPU, we are measuring the power consumption of the complete loop, including both the body and other code. By separately measuring the power consumption of a loop with no body (making sure, of course, that the compiler hasn't optimized away the empty loop), we can calculate the power consumption of the loop body code as the difference between the full loop and the bare loop energy cost of an instruction.

Figure 5-25. Measuring energy consumption for a piece of code

Program Energy Consumption and Optimization
Several factors contribute to the energy consumption of the program:

* Energy consumption varies somewhat from instruction to instruction.
* The sequence of instructions has some influence.
* The opcode and the locations of the operands also matter.

Choosing which instructions to use can make some difference in a program's energy consumption, but concentrating on the instruction opcodes has limited payoffs in most CPUs. The program has to do a certain amount of computation to perform its function.

While there may be some clever ways to perform that computation, the energy cost of the basic computation will change only a fairly small amount compared to the total system energy consumption, and usually only after a great deal of effort. We are further hampered in our ability to optimize instruction-level energy consumption because most manufacturers do not provide detailed, instruction-level energy consumption figures for their processors.Memory effects. In many applications, the biggest payoff in energy reduction for a given amount of designer effort comes from concentrating on the memory system. As Figure 5-26 below shows [Cat98], memory transfers are by far the most expensive type of operation performed by a CPU – a memory transfer takes 33 times more energy than does an addition.

As a result, the biggest payoffs in energy optimization come from properly organizing instructions and data in memory. Accesses to registers are the most energy efficient; cache accesses are more energy efficient than main memory accesses.

Figure 5-26. Relative energy consumption of various operations [Cat98]

Caches are an important factor in energy consumption. On the one hand, a cache hit saves a costly main memory access, and on the other, the cache itself is relatively power hungry because it is built from SRAM, not DRAM. If we can control the size of the cache, we want to choose the smallest cache that provides us with the necessary performance.

Li and Henkel [Li98] measured the influence of caches on energy consumption in detail. Figure 5-27 below breaks down the energy consumption of a computer running MPEG (a video encoder) into several components: software running on the CPU, main memory, data cache, and instruction cache.

As the instruction cache size increases, the energy cost of the software on the CPU declines, but the instruction cache comes to dominate the energy consumption. Experiments like this on several benchmarks show that many programs have sweet spots in energy consumption.

If the cache is too small, the program runs slowly and the system consumes a lot of power due to the high cost of main memory accesses. If the cache is too large, the power consumption is high without a corresponding payoff in performance.

At intermediate values, the execution time and power consumption are both good. How can we optimize a program for low power consumption? The best overall advice is that high performance = low power. Generally speaking, making the program run faster also reduces energy consumption.

Figure 5-27. Energy and execution time versus instruction data cache size for a benchmark program[Liu98]

Clearly, the biggest factor that can be reasonably well controlled by the programmer is the memory access patterns. If the program can be modified to reduce instruction or data cache conflicts, for example, the energy required by the memory system can be significantly reduced.

The effectiveness of changes such as reordering instructions or selecting different instructions depends on the processor involved, but they are generally less effective than cache optimizations. A few optimizations mentioned previously for performance are also often useful for improving energy consumption:

1) Try to use registers efficiently. Group accesses to a value together so that the value can be brought into a register and kept there.

2) Analyze cache behavior to find major cache conflicts. Restructure the code to eliminate as many of these as you can:

— For instruction conflicts, if the offending code segment is small, try to rewrite the segment to make it as small as possible so that it better fits into the cache. Writing in assembly language may be necessary. For conflicts across larger spans of code, try moving the instructions or padding with NOPs.

— For scalar data conflicts, move the data values to different locations to reduce conflicts.

— For array data conflicts, consider either moving the arrays or changing your array access patterns to reduce conflicts.

3) Make use of page mode accesses in the memory system whenever possible. Page mode reads and writes eliminate one step in the memory access, saving a considerable amount of power.

Metha et al. [Met97] present some additional observations about energy optimization as follows:

1) Moderate loop unrolling eliminates some loop control overhead. However, when the loop is unrolled too much, power increases due to the lower hit rates of straight-line code.
2) Software pipelining reduces pipeline stalls, thereby reducing the average energy per instruction.
3) Eliminating recursive procedure calls where possible saves power by getting rid of function call overhead. Tail recursion can often be eliminated; some compilers do this automatically.Analysis and Optimization of Program Size
The memory footprint of a program is determined by the size of its data and instructions. Both must be considered to minimize program size.

Data provide an excellent opportunity for minimizing size because the data are most highly dependent on programming style. Because inefficient programs often keep several copies of data, identifying and eliminating duplications can lead to significant memory savings usually with little performance penalty.

Buffers should be sized carefully – rather than defining a data array to a large size that the program will never attain, determine the actual maximum amount of data held in the buffer and allocate the array accordingly. Data can sometimes be packed, such as by storing several flags in a single word and extracting them by using bit-level operations.

A very low-level technique for minimizing data is to reuse values. For instance, if several constants happen to have the same value, they can be mapped to the same location. Data buffers can often be reused at several different points in the program. This technique must be used with extreme caution, however, since subsequent versions of the program may not use the same values for the constants.

A more generally applicable technique is to generate data on the fly rather than store it. Of course, the code required to generate the data takes up space in the program, but when complex data structures are involved there may be some net space savings from using code to generate data.

Minimizing the size of the instruction text of a program requires a mix of high-level program transformations and careful instruction selection. Encapsulating functions in subroutines can reduce program size when done carefully.

Because subroutines have overhead for parameter passing that is not obvious from the high-level language code, there is a minimum-size function body for which a subroutine makes sense. Architectures that have variable-size instruction lengths are particularly good candidates for careful coding to minimize program size, which may require assembly language coding of key program segments.

There may also be cases in which one or a sequence of instructions is much smaller than alternative implementations – for example, a multiply-accumulate instruction may be both smaller and faster than separate arithmetic operations.

When reducing the number of instructions in a program, one important technique is the proper use of subroutines. If the program performs identical operations repeatedly, these operations are natural candidates for subroutines.

Even if the operations vary somewhat, you may be able to construct a properly parameterized subroutine that saves space. Of course, when considering the code size savings, the subroutine linkage code must be counted into the equation.

There is extra code not only in the subroutine body but in each call to the subroutine that handles parameters. In some cases, proper instruction selection may reduce code size; this is particularly true in CPUs that use variable-length instructions.

Some microprocessor architectures support dense instruction sets, specially designed instruction sets that use shorter instruction formats to encode the instructions. The ARM Thumb instruction set and the MIPS-16 instruction set for the MIPS architecture are two examples of this type of instruction set.In many cases, a microprocessor that supports the dense instruction set also supports the normal instruction set, although it is possible to build a microprocessor that executes only the dense instruction set. Special compilation modes produce the program in terms of the dense instruction set.

Program size of course varies with the type of program, but programs using the dense instruction set are often 70 to 80% of the size of the standard instruction set equivalents.

Another possibility is to use statistical compression algorithms to compress the program after assembly and then decompress it on the fly. Statistical compression views the data to be compressed as a stream of symbols.

It relies on the fact that not all symbols or sequences of symbols are equally probable, so that more probable sections of the data are encoded in a small number of bits while less probable sections are encoded with more bits. File compression programs such as gzip can often reduce a file to 50% of its uncompressed size.

However, file compression algorithms cannot be directly applied to dynamic code decompression. First, the algorithm must not force us to decompress the entire program before executing – that would defeat the purpose of code compression since the entire program would reside in memory. We must be able to decompress small sections of the program during execution.

This creates particular problems for branches, since the address in a branch refers to an uncompressed location, not the location of the branch target in the compressed program. Second, we must be able to decompress the block of code quickly to avoid slowing down the CPU excessively.

There are several techniques for code compression, but a common one [Wol92A] is illustrated in Figure 5-28 below. In this scheme, a block of compressed code is fetched from main memory and decompressed into the cache to satisfy a cache miss. Because decompression occurs only on cache misses, it usually occurs relatively infrequently, and the cache is already organized to handle blocks of code.

The code is compressed at the end of compilation after the file has been linked to form an executable; the compression process yields tables that program the decompression unit. Various compression algorithms can be used to generate the compressed code that can also yield fast hardware decompression logic. Application Example 5-1 below describes the code compression technique used in PowerPC.

Figure 5-28. Cache-based instruction decompression

Application Example 5-1: Code Compression for PowerPC
IBM has developed a code decompression module for PowerPC processors [Kem98]. Their decompression unit decompresses instructions after they have been fetched from main memory.

They found that they could achieve much higher compression ratios by breaking each 32-bit instruction into two 16-bit halves. IBM uses Huffman compression to compress each half, compressing instructions in blocks of 64 bytes.

An index table is used to translate uncompressed addresses, such as addresses found in branch instructions, to the new location of the instruction in the compressed code. They also add the K bit to the TLB to indicate whether a page in memory holds compressed instructions. This method compresses benchmark programs to 55 to 65% of their original size.

Next in Part 9: Program validationand testing.
To read Part 1, go to “Programdesign and analysis.
To read Part 2 , go to ” Modelsof program, assemblers and linkers.”
To read Part 3, go to “BasicCompilation Techniques
To read Part 4, go to “The creation ofprocedures
To read Part 5, go to  “Registerallocation and scheduling
To read Part 6, go to “Analysis andoptimization of execution time
To read Part 7, go to “Trace-DrivenPerformance Analysis

Used with the permission of thepublisher, Newnes/Elsevier, this series of six articles is based oncopyrighted material from “Computersas Components: Principles of Embedded Computer System Design” byWayne Wolf. The book can be purchased on line.

Wayne Wolf  is currently the Georgia Research Alliance EminentScholar holding the Rhesa “Ray” S. Farmer, Jr., Distinguished Chair inEmbedded Computer Systems at Georgia Tech's School of Electrical andComputer Engineering (ECE). Previously a professor of electricalengineering at Princeton University, he worked at AT&T BellLaboratories. He has served as editor in chief of the ACM Transactionson Embedded Computing and of DesignAutomation for Embedded Systems.

[Cat98] Francky Catthoor, “Custom Memory Management Methodology: Exploration of MemoryOrganization for Embedded Multimedia Systems Design.” Kluwer AcademicPublishers, 1998.
[Kem98] T.M. Kemp et. al. ” Adecompression core for PowerPC “IBM Journal of Research andDevelopment, November. 1998.
[Li98] Yanbing Li and JoergHenkel, “A frame work for estimating and minimizing energy dissipationof embedded HW/SW systems.” Proceedings, DAC 1998.
[Tiw94] Vivek Tiwari, Sharad Maalik and Andrfew Wolfe, “Poweranalysis of embedded software: A first stept toward software powerminimization,” IEEE Transations on VLSI Systems, December, 1994
[Wol92A]Andrew Wolfe and Alex Chanim, “Ececuting compression programs inan embedded RISC architecture, in Proceedins, 25th Annual InternationalSimposium on Microarchitecture, 1992.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.