Optimizing embedded software for power efficiency: Part 1 – measuring power

Editor's note: First in a series on how managing your embedded software design’s power requirements. The authors provide tips on the power measurements needed before applying optimization techniques at the hardware, algorithmic, dataflow, and memory level. Excerpted from Software engineering for embedded systems .

One of the most important considerations in the product lifecycle of an embedded project is to understand and optimize the power consumption of the device. Power consumption is highly visible for hand-held devices which require battery power to be able to guarantee certain minimum usage/idle times between recharging. Other embedded applications, such as medical equipment, test, measurement, media, and wireless base stations, are very sensitive to power as well — due to the need to manage the heat dissipation of increasingly powerful processors, power supply cost, and energy consumption cost — so the fact is that power consumption cannot be overlooked.

The responsibility for setting and keeping power requirements often falls on the shoulders of hardware designers, but the software programmer has the ability to provide a large contribution to power optimization. Often, the impact that the software engineer has on influencing the power consumption of a device is overlooked or underestimated.

The goal of this series of articles is to discuss how software can be used to optimize power consumption, starting with the basics of what power consumption consists of, how to properly measure power consumption, and then moving on to techniques for minimizing power consumption in software at the algorithmic level, hardware level, and data-flow level. This will include demonstrations of the various techniques and explanations of both how and why certain methods are effective at reducing power so the reader can take and apply this work to their application right away.

Basics of power consumption
In general, when power consumption is discussed, the four main factors discussed for a device are the application, the frequency, the voltage and the process technology, so we need to understand why exactly it is that these factors are so important.

The application is highly important, so much so that the power profile for two hand-held devices could differ to the point of making power optimization strategies the complete opposite. While we will be explaining more about power optimization strategies later on, the basic idea is clear enough to introduce in this section.

Take for example a portable media player vs. a cellular phone. The portable media player needs to be able to run at 100% usage for a long period of time to display video (full-length movies), audio, etc. We will discuss this later, but the general power-consumption profile for this sort of device would have to focus on algorithmic and data flow power optimization more than on efficient usage of low-power modes.

Compare this to the cellular phone, which spends most of its time in an idle state, and during call time the user only talks for a relatively small percentage of the time. For this small percentage of time, the processor may be heavily loaded, performing encode/decode of voice and transmit/receive data. For the remainder of the call time, the phone is not so heavily tasked, performing procedures such as sending heartbeat packets to the cellular network and providing “comfort noise” to the user to let the user know the phone is still connected during silence. For this sort of profile, power optimization would be focused first on maximizing processor sleep states to save as much power as possible, and then on data flow/algorithmic approaches.

In the case of process technology, the current cutting-edge embedded cores are based on
45 nm and in the near future 28 nm technology, a decrease in size from its predecessor, the 65 nm technology. What this smaller process technology provides is a smaller transistor. Smaller transistors consume less power and produce less heat, so are clearly advantageous compared with their predecessors.

Smaller process technology also generally enables higher clock frequencies, which is clearly a plus, providing more processing capability, but higher frequency, along with higher voltage, comes at the cost of higher power draw. Voltage is the most obvious of these, as we learned in physics (and EE101), power is the product of voltage times current. So if a device requires a large voltage supply, power consumption increase is a fact of life.

While staying on the subject of P=V*I , the frequency is also directly part of this equation because current is a direct result of the clock rate. Another thing we learned in physics and EE101: when voltage is applied across a capacitor, current will flow from the voltage source to the capacitor until the capacitor has reached an equivalent potential.

While this is an over-simplification, we can imagine that the clock network in a core consumes power in such a fashion. Thus at every clock edge, when the potential changes, current flows through the device until it reaches the next steady state. The faster the clock is switching, the more current is flowing, therefore faster clocking implies more power consumed by the embedded processor. Depending on the device, the clock circuit is responsible for consuming between 50% and 90% of dynamic device power, so controlling clocks is a theme that will be covered very heavily here.

Types of power consumption
Total power consumption consists of two types of power: dynamic and static (also known as static leakage) consumption, so total device power is calculated as:

Ptotal =  PDynamic + PStatic

As we have just discussed, clock transitions are a large portion of the dynamic consumption, but what is this “dynamic consumption”? Basically, in software we have control over dynamic consumption, but we do not have control over static consumption.

Static power consumption Leakage consumption is the power that a device consumes independent of any activity or task the core is running, because even in a steady state there is a low “leakage” current path (via transistor tunneling current, reverse diode leakage, etc.) from the device’s Vin to ground. The only factors that affect the leakage consumption are supply voltage, temperature, and process.

We have already discussed voltage and process in the introduction. In terms of temperature, it is fairly intuitive to understand why heat increases leakage current. Heat increases the mobility of electron carriers, which will lead to an increase in electron flow, causing greater static power consumption. As the focus of this chapter is software, this will be the end of static power consumption theory.

Dynamic power consumption The dynamic consumption of the embedded processor includes the power consumed by the device actively using the cores, core subsystems, peripherals such as DMA, I/O (radio, Ethernet, PCIe, CMOS camera), memories, and PLLs and clocks. At the low level, this can be translated to say that dynamic power is the power consumed by switching transistors which are charging and discharging capacitances.

Dynamic power increases as we use more elements of the system, more cores, more arithmetic units, more memories, higher clock rates, or anything that could possibly increase the amount of transistors switching, or the speed at which they are switching. The dynamic consumption is independent of temperature, but still depends on voltage supply levels.

Maximum, average, worst-case and typical. When measuring power, or determining power usage for a system, there are four main types of power that need to be considered: maximum power, average power, worst-case power consumption, and typical power consumption.

Maximum and average power are general terms, used to describe the power measurement itself more than the effect of software or other variables on a device’s power consumption.

Simply stated, maximum power is the highest instantaneous power reading measured over a period of time. This sort of measurement is useful to show the amount of decoupling capacitance required by a device to maintain a decent level of signal integrity (required for reliable operation).

Average power is intuitive at this point: technically the amount of energy consumed in a time period, divided by that time (power readings averaged over time). Engineers do this by calculating the average current consumed over time and use that to find power. Average power readings are what we are focusing on optimizing as this is the determining factor for how much power a battery or power supply must be able to provide for a processor to perform an application over time, and this also used to understand the heat profile of the device.

Both worst case and typical power numbers are based on average power measurement. Worst-case power, or the worst-case power profile, describes the amount of average power a device will consume at 100% usage over a given period. One hundred percent usage refers to the processer utilizing the maximum number of available processing units (data and address generation blocks in the core, accelerators, bit masking, etc.), memories, and peripherals simultaneously. This may be simulated by putting the cores in an infinite loop of performing six or more instructions per cycle (depending on the available processing units in the core) while having multiple DMA channels continuously reading from and writing to memory, and peripherals constantly sending and receiving data. Worst-case power numbers are used by the system architect or board designer in order to provide adequate power supply to guarantee functionality under all worst-case conditions.

In a real system, a device will rarely if ever draw the worst-case power, as applications do not use all the processing elements, memory, and I/O for long periods of time, if at all. In general, a device provides many different I/O peripherals, though only a portion of them are needed, and the device cores may only need to perform heavy computation for small portions of time, accessing just a portion of memory. Typical power consumption then may be based on the assumed “general use case” example application that may use anywhere from 50% to 70% of the processor’s available hardware components at a time. This is a major aspect of software applications that we are going to be taking advantage of in order to optimize power consumption.

Measuring power consumption
Measuring power is hardware dependent: some embedded processors provide internal measurement capabilities; processor manufacturers may also provide “power calculators” which give some power information; there are a number of power supply controller ICs which provide different forms of power measurement capabilities; some power supply controllers called VRMs (voltage regulator modules) have these capabilities internal to them to be read over peripheral interfaces; and finally, there is the old-fashioned method of connecting an ammeter in series to the core power supply.

Measuring power using an ammeter. The “old-fashioned” method is to measure power via the use of an external power supply connected in series to the positive terminal of an ammeter, which connects via the negative connector to the DSP device power input, as shown in Figure 13.1 .

Figure 13.1: Measuring power via ammeters.

Note that there are three different set-ups shown in Figure 13.1, which are all for a single processor. This is due to the fact that processor power input is isolated, generally between cores (possibly multiple supplies), peripherals, and memories. This is done by design in hardware as different components of a device have different voltage requirements, and this is useful to isolate (and eventually optimize) the power profile of individual components.

In order to properly measure power consumption, the power to each component must be properly isolated, which in some cases may require board modification, specific jumper settings, etc. The most ideal situation is to be able to connect the external supply/ammeter combination as close as possible to the processor power input pins.

Alternatively, one may measure the voltage drop across a (shunt) resister which is in series with the power supply and the processor power pins. By measuring the voltage drop across the resistor, current is found simply by calculating I = V/R.

Measuring power on a half sensor IC. In order to simplify efficient power measurement, many embedded vendors are building boards that use a Hall-effect-based sensor. When a Hall sensor is placed on a board in the current path to the device’s power supply, it generates a voltage equivalent to the current times some coefficient with an offset.

In the case of Freescale’s MSC8144 DSP Application Development System board, an Allegro ACS0704 Hall sensor is provided on the board, which enables such measurement. With this board, the user can simply place a scope to the board, and view the voltage signal over time, and use this to calculate average power using Allegro’s current to voltage graph, shown in Figure 13.2 .

Figure 13.2:  Hall effect IC voltage-to-current graph (www.allegromicro.com/en/Products/Part.. ./0704/ 0704-015.pdf).

Using Figure 13.2, we can calculate input current to a device based on measuring potential across Vout as:

I = (Vout – 2:5)*10A

Using VRM ICs. Finally, some voltage regulator module power supply ICs (VRMs) are used to split a large input voltage into a number of smaller ones to supply individual sources at varying potentials, measure current/power consumption and store the values in registers to be read by the user. Measuring current via the VRM requires no equipment, but this sometimes comes at the cost of accuracy and real-time measurement.

For example, the PowerOne ZM7100 series VRM (also used on the MSC8144ADS) provides current readings for each supply, but the current readings are updated once every 0.5 to 1 seconds, and the reading accuracy is of the order of ~20%, so instantaneous reading for maximum power is not possible, and fine tuning and optimization may not be possible using such devices.

In addition to deciding a specific method for measuring power in general, different methods exist to measure dynamic power versus static leakage consumption. The static leakage consumption data is useful in order to have a floor for our low-power expectations, and to understand how much power the actual application is pulling versus what the device will pull in idle. We can then subtract that from the total power consumption we measure in order to determine the dynamic consumption the processor is pulling, and work to minimize that. There are various tools available in the industry to help in this area.Static power measurement Leakage consumption on a processor canusually be measured while the device is placed in a low-power mode,assuming that the mode shuts down clocks to all of the core subsystemsand peripherals. If the clocks are not shut down in low-power mode, thePLLs should be bypassed, and then the input clock should be shut down,thus shutting down all clocks and eliminating clock and PLL powerconsumption from the static leakage measurement.

Additionally,static leakage should be measured at varying temperatures since leakagevaries based on temperature. Creating a set of static measurements basedon temperature (and voltage) provides valuable reference points fordetermining how much dynamic power an application is actually consumingat these temperature/voltage points.

Dynamic power measurement The power measurements should separate the contribution of each majormodule in the device to give the engineer information about what effect aspecific configuration will have on a system’s power consumption. Asnoted above, dynamic power is found simply by measuring the total power(at a given temperature) and then subtracting the leakage consumptionfor that given temperature using the initial static measurements fromabove.

Initial dynamic measurement tests include runningsleep-state tests, debug-state tests, and a NOP test. Sleep-state anddebug-state tests will give the user insight into the cost of enablingcertain clocks in the system. A NOP test, as in a loop of NOP commands,will provide a baseline dynamic reading for your core’s consumption whenmainly using the fetch unit of the device, but no arithmetic units,address generation, bit mask, memory management, etc.

Whencomparing specific software power optimization techniques, we comparethe before and after power consumption numbers of each technique inorder to determine the effect of that technique.

Profiling your application’s power consumption
Beforeoptimizing an application for power, the programmer should get abaseline power reading of the section of code being optimized. Thisprovides a reference point for measuring optimizations, and also ensuresthat the alterations to code do in fact decrease total power, and notthe opposite. In order to do this, the programmer needs to generate asample power test which acts as a snapshot of the code segment beingtested.

This power test-case generation can be done by profilingcode performance using a high- end profiler to gain some baseunderstanding of the percentage of processing elements and memory used.We can demonstrate this by creating a new project in a standard toolsIDE (there are many available) with the profiler enabled, thencompiling, and running the project. The application will run from startto finish, at which point the user may select a profiler view and getany number of statistics.

Using relevant data such as thepercentage of ALUs used, AGUs used, code hot-spots, and knowledge ofmemories being accessed, we can get a general idea of where our codewill spend the most time (and consume the most power). We can use thisto generate a basic performance test which runs in an infinite loop,enabling us to profile the average “typical” power of an important codesegment.

As an example, using two main functions: func1 and func2. Profiling the example code, we can see from the Figure 13.3 that the vast majority of cycles are consumed by the func1 routine.

Figure 13.3: Profiling for hot spots

Thisroutine is located in M2 memory and reads data from cacheable M3 memory(meaning possible causing write back accesses to L2 and L1 cache). Byusing the profiler (as per Figure 13.4), information regarding thepercentage ALU and percentage AGU can be extracted.

Figure 13.4: Core component (% ALU, % AGU) utilization

Wecan effectively simulate this by turning the code into an infiniteloop, adjusting the I/O, and compiling at the same optimization level,and verifying that we see the same performance breakdown. Another optionwould be to write a sample test in assembly code to force certainALU/AGU usage models to match our profile, though this is not as preciseand makes testing of individual optimizations more difficult.

Wecan then set a break point, re-run our application, and confirm thatthe device usage profile is in line with our original code. If not, wecan adjust the compiler optimization level or our code until it matchesthe original application.

This method is quick and effective formeasuring core power consumption for various loads and, if we mirroredthe original application by properly using the profiler, this shouldaccount for stalls and other pipeline issues as the profiler providesinformation on total cycle count as well as instruction and VLESutilization. By having the infinite loop, testing is much easier as weare simply comparing steady-state current readings of optimized andnon-optimized code in the hope of getting lower numbers. We can use thisto measure numerous metrics such as average power over time, averagepower per instruction, average power per cycle, and energy (power *time) in joules for some time t . For measuring specificalgorithms and power-saving techniques, we will form small routinesusing similar methods and then optimize and measure the power savingsover time.

Using these tools will enable effectively measuringand confirming the knowledge shared in the next section of this text,which covers the software techniques for optimizing power consumption.

Part 2: Minimizing hardware power use
Part 3: Optimizing data flow and memory
Part 4: Peripheral and algorithmic optimization

Rob Oshana has 30 years of experience in the software industry, primarily focusedon embedded and real-time systems for the defense and semiconductorindustries. He has BSEE, MSEE, MSCS, and MBA degrees and is a SeniorMember of IEEE. Rob is a member of several Advisory Boards including theEmbedded Systems group, where he is also an international speaker. Hehas over 200 presentations and publications in various technology fieldsand has written several books on embedded software technology. He is anadjunct professor at Southern Methodist University where he teachesgraduate software engineering courses. He is a Distinguished Member ofTechnical Staff and Director of Global Software R&D for DigitalNetworking at Freescale Semiconductor.

Mark Kraeling isProduct Manager at GE Transportation in Melbourne, Florida, where he isinvolved with advanced product development in real-time controls,wireless, and communications. He’s developed embedded software for theautomotive and transportation industries since the early 1990s. Mark hasa BSEE from Rose-Hulman, an MBA from Johns Hopkins, and an MSE fromArizona State.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software engineering for embedded systems, by Robert Oshana and Mark Kraeling.

1 thought on “Optimizing embedded software for power efficiency: Part 1 – measuring power

  1. “Good article. Too bad that most often the project development tools are not including a profiler to see the way how the code is accessing the MCU.”

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.