Designing Embedded Software for Lower Power

If you think designing for low power is only a hardware effort, think again. Embedded software developers also need to design with power in mind.

The hottest feature in embedded systems today is portability. More and more, people are taking their entertainment, communications, and work with them wherever they go. But portability means batteries, and that means a design needs to use as little power as possible. Keeping power demand down seems like it's only a hardware problem, but software can have a significant impact. The keys to embedded software for low power are the right software architecture and the right code optimizations.

Most embedded developers are familiar with the concept of structuring their software to boost performance and lower memory requirements. Designing software to minimize a system's power consumption, however, is a relatively unknown tactic. One place to begin learning about the impact of software on power is to look at processor utilization.

Many embedded systems can be characterized as event-driven. The processor spends much of its time waiting for something to happen. The processor may be doing useful work during this wait, such as checking status or maintaining a real-time clock, but it is not doing the design's primary task.

The processor engages in the design's primary task only in response to a triggering event, such as a user key-press or the arrival of a data packet. The triggered task typically must be completed within a critical time window, with the result that the processor is fully engaged while completing the task. When the task is completed, however, the processor typically returns to waiting.

The alternation between high activity and relative idleness suggests a power-reduction approach. Rather than have the processor running background tasks while waiting, have it “sleep” to conserve power. As Figure 1 shows, power consumption in a processor's sleep mode can be as little as 5% of that in the fully active mode.

Figure 1:  The power consumption of CPUs in their sleep or standby mode can be as little as 5% of their normal operating power (Source, Green Hills Software)

CPU Sleep Gets Tricky

Many designs already use a similar approach to reduce the power demands of peripheral devices. Hardware on the peripherals shuts them down when not in active use. But to shut down a CPU when idle is a little trickier. The software architecture must ensure that background tasks still have an opportunity to run. To do so, the software must be designed to keep track of three things: idle time, sleep time, and scheduled tasks.

Keeping track of idle time tells the software if it's time to shut down the processor. One approach to tracking the processor's degree of idleness is to establish a counter that the program clears when it responds to an event. By then incrementing the counter during each loop through the background task or by using a “heartbeat” clock signal, the software can determine how long it has been relatively idle. When the idle count reaches a set threshold, the software enters a sleep mode and waits for an event interrupt to wake it.

While the approach seems simple, it does require careful thinking. The wait time before entering the sleep state needs to match the design's activity profile. It must be long enough that background tasks get enough cycles to perform their function, but be as short as practical to maximize the power savings.

In addition, software designers must consider what events will awaken the system. Signals like “heartbeat” clocks, for instance, should be masked to keep the processor from waking just to update a real-time clock. Instead, the system can activate a hardware counter to track the heartbeats and use the count to update the real-time clock when a significant event rouses the processor.

Keeping track of how long the processor was asleep before it awakened is important for more than the real-time clock; background tasks can also benefit. Many such tasks have a time constraint for completion or a minimum frequency of occurrence. A system status check, for instance, may be required to find and report errors within some deadline. When putting the processor to sleep halts such a task, knowing the duration of its sleep allows the software to determine if the task needs immediate attention in order to meet the deadline.

Knowing the sleep time also allows the software to provide accurate timing for pending tasks. Pending tasks come about if the design's operating system allows the time scheduling of task execution. Tasks that are to occur at regular intervals or at some fixed time after an event are examples of such scheduled tasks.

Sleeping On Schedule

When pending tasks exist, the software needs to account for a task's schedule before entering a sleep mode. Depending on the schedule, the software has two responses. If the task isn't due soon, the software can go ahead and enter the sleep state. It will, however, need to provide a way of activating the task on schedule, even though the processor is in a sleep mode.

Because the software needs to mask “heartbeat” clock interrupts to be able to stay in the sleep mode, some other mechanism has to be in place for measuring time intervals while the CPU is inactive. One approach is to establish a counter that generates an event when the task's scheduled time has arrived. The system routes the “heartbeat” clock to the counter and sets the countdown value before entering the sleep state. The counter will generate a one-time event when the scheduled task needs to be performed.

The counter ensures that scheduled tasks get completed regardless of how long the software might sleep. If the scheduled task is to occur relatively soon after the software plans to enter sleep mode, however, the software should hold off sleeping until the task is completed. Rousing the processor from sleep typically takes an extra burst of power, so sleeping only saves power if it is long enough to compensate for that burst. Thus, it may be more power efficient to wait for a scheduled task to be completed before entering sleep mode, rather than shut down and be re-activated right away.

Designing the software to allow the system to spend idle time in a low-power mode is not the only step software developers can take. They can optimize the software so it can accomplish its main task more quickly, increasing the time the system can stay in low power. This goal of minimizing execution time is one most real-time developers are already experienced in tackling. There are some variations, however, when it comes to power-aware optimizing.

The DSP Connection

Many of the power-saving techniques in the main text apply to both conventional processors and digital signal processors (DSPs). However, DSPs have some unique features that need consideration. Some can help developers when designing software for low power and some represent an additional challenge.

The added power benefit that DSPs bring to the party is parallelism. A typical DSP is able to carry out more than one operation at a time, greatly reducing code execution time. The DSP's parallelism is particularly useful when the DSP and a conventional processor are working in tandem. Simple scalar loops in the conventional processor, for instance, can often be structured so that the DSP's parallel execution paths will handle multiple loop iterations at the same time. The key to exploiting such opportunities is having a good DSP compiler that recognizes and exploits architecture's parallelism.

One thing to watch for in DSPs is the power drain of memory access. As with conventional processors, bus and memory access drains more power than operating out of cache, so structuring code to maximize cache hits is important. Many DSPs use Harvard architecture, however, with separate paths for data and code memory. This makes memory-aware coding twice as important, and more challenging. Both the code and the data structures need to be accounted for along with their interactions.

Of the two, data is probably the most important cache to maintain. For instance, if two algorithms operate on the same block of data but have a lot of intervening code, the algorithms are not likely to both be in cache at the same time. Ensuring that the data can remain in cache, however, may still save power in the long run. It will depend on the size of the data block and the likelihood of a data cache miss for some other algorithm if this data remains.

Power Optimizations Are Different

One variation is the amount of optimization that is beneficial. If execution time is the only consideration, developers can stop optimization once the software is fast enough. If a task is allocated 10 msec for completion, including safety margins and headroom, and the software can do it in 9 msec, it is generally not worth extra development effort to further improve performance. All that is gained is the nebulous value of added headroom.

When power is a concern, however, all speed gains show up as power reduction. Further, speed gains provide an opportunity to lower system clock-speed. If a task is allocated 10 msec and the software can do it in 5 msec, for instance, the system clock can be cut in half and still meet performance specifications. Because the power consumption of a CMOS circuit is directly proportional to frequency, cutting clock speed saves system power.

Of course, running something twice as long at half the power still burns up the same amount of energy, so simply lowering the clock rate may not provide much direct benefit. There is a hidden benefit, however. Many high-performance processors operate over a range of supply voltages, with higher voltage needed for faster clocks. Lowering the system clock, then, would allow a reduction in the processor's supply voltage. Because power varies with the square of the voltage, the longer operating period doesn't cancel out the power savings.

Optimizing Blindly Can Cost

From a power standpoint, then, it appears that the more optimization the better. Optimizing by hand, however, is tedious and error-prone. Designers need to utilize a good optimizing compiler, which can more than double software performance. The compiler needs to have more refined controls than a simple run-time switch for invoking speed optimizations. Blindly-applied compiler optimizations may actually increase the system's power utilization even though they shorten the average execution time.

The source of the increase is outside of the processor, in the memory and system bus. Cellular telephone developers report that the power demands of bus and memory can be as much as three times the power consumed by the processor itself. Optimizations that speed software execution at the expense of increased bus traffic, then, can be counter to the goal of power reduction.

Developers need to apply performance optimizations with bus transactions in mind. By analyzing code size, examining cache misses, and using a cycle-accurate simulator to track memory usage, software developers can determine if an optimization effort triggers power-consuming bus activity.

Along with optimizing code to minimize bus utilization, developers should look at data movement. Sometimes a simple re-ordering of data access patterns in an algorithm can have a major impact on data access. For example, an image-processing task that moves through an image row-by-row, followed by a task that goes column-by-column, will end up reading the image array from main memory twice unless it all fits in cache. If the second task can be restructured to also work row-by-row, however, the two tasks can operate on the same row successively before moving on to the next row. The result is that the array gets read from main memory once with the cache only needing to hold one row at a time.

Co-Design Complicates Optimization

Optimizing program code and data movement is easiest when the hardware architecture is fixed. When the hardware and software are in co-development as with a system-on-a-chip (SoC) design, however, the speed, size, and configuration of memory is open to negotiation. Minimizing total system power then becomes a matter of balancing the power savings of software optimizations with the power costs of additional hardware.

The complexity of striking the right balance becomes apparent in the example shown in Figure 2 . The standard compiled code for a simple FOR-NEXT barrel shift loop might come out looking like the first segment, occupying 6 bytes and requiring 33 cycles to complete. The second segment uses a loop-unrolling optimization to greatly boost performance, achieving the same 8-bit shift in only 8 cycles. However, it uses 8 bytes, or 33% more code memory. The third segment uses a partial unrolling of the loop, taking the middle ground at 7 bytes and 21 cycles.

Instruction Size Cycles Code uses 6 bytes.
Task completes in
33 cycles.
LDA A=8 2 2
Loop: ROL B 1 1
DEC A 1 1
BNE Loop 2 2
(1 if no branch)
Segment 1: Original assembly code

Instruction Size Cycles Code uses 8 bytes.
Task completes in
8 cycles.
ROL B 1 1
ROL B 1 1
ROL B 1 1
ROL B 1 1
ROL B 1 1
ROL B 1 1
ROL B 1 1
ROL B 1 1
Segment 2: Loop unrolled

Instruction Size Cycles Code uses 7 bytes.
Task completes in
21 cycles.
LDA A=4 2 2
Loop: ROL B 1 1
ROL B 1 1
DEC A 1 1
BNE Loop 2 2
(1 if no branch)
Segment 3: Partial unroll

Figure 2: Different compiler optimizations result in
varying requirements for instruction byte requirements
and number of instruction cycles

As simple as it is, this example reflects the range of results possible with various degrees of optimization. Finding the lowest power approach for a design then becomes an exercise in calculating the power used in storing, loading, and executing the code for all the possible hardware and software configurations. Performance optimization to speed execution and keep the CPU in sleep mode longer can save CPU power, but comes at the expense of additional memory needs. The right degree of optimization will depend on the relative power usage of the two hardware segments.

In addition to trading off CPU utilization and memory, SoC developers have the option of adding hardware accelerators to speed code execution and gain the corresponding benefits. Again, the right decision from a power standpoint will depend on the relative power usage of the software and hardware approaches.

Unfortunately there are few tools available to automate this decision-making process. Optimizing compilers exist, but their optimizations are geared for performance and code size. They do not automatically optimize for low power. Furthermore, no commercial tools are available that will calculate the power impact of software decisions. Developers are on their own when evaluating optimization approaches.

That situation may be changing. Considerable research is underway in academic and government programs to develop tools and techniques for power-aware computing. Penn State University, for instance, has developed a prototype tool (SimplePower) that will estimate the impact of software changes on system power. The Defense Advanced Research Projects Agency (DARPA) has even broader goals. Its Information Processing Technology Office has established the Power Aware Computing/Communications (PACC) program to fund the development of a suite of tools and techniques for lowering the power needs of hardware and software in portable systems.

The level of effort going into developing power-aware software techniques shows the increasingly important role software has in meeting a portable system's power goals. Whether it is through algorithm restructuring, code optimization, or hardware/software partitioning, software developers should strive for lower power. Hardware may be where the power is used, but ultimately it is the software that controls how and when the hardware uses that power.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.