Editor's note: In the final part in a series on how to manage your embedded software design’s power requirements, the authors discuss the how operations of the system’s peripheral functions can contribute to power savings, concluding with some techniques for algorithmic optimization.
When considering the impact of reading and writing of data has on an embedded system’s power utilization we cannot just think about memory access: we need to pull data into and out of the device as well. Here we will we will look at how to minimize power consumption in commonly used embedded processor (I/O) peripherals. Later we will talk about the various algorithmic techniques for power management.
On the first topic, things to consider include the peripheral’s burst size, speed grade, transfer width, and general communication modes. The main standard forms of peripheral communication for embedded processors include DMA (direct memory access), SRIO (serial rapid I/O), Ethernet, PCI Express, and RF antenna interfaces. I2C and UART are also commonly used, though mostly for initialization and debug purposes.
The fact that communication interfaces usually require their own PLLs/clocks increases the individual power consumption impact. The higher-clocked peripherals that we need to consider as the main power consumers are the DMA, SRIO, Ethernet, and PCI Express. Clock gating and peripheral low-power modes for these peripherals were discussed in the low-power modes section of this chapter, so this section will talk about how to optimize actual usage.
Although each protocol is different for the I/O peripherals and the internal DMA, they all share the fact that they are used to read/write data. As such, one basic goal is to maximize the throughput while the peripheral is active in order to maximize efficiency and the time the peripheral/device can be in a low-power state, thus minimizing the active clock times.
The most basic way to do this is to increase transfer/burst size. For DMA, the programmer has control over burst size and transfer size in addition to the start/end address (and can follow the alignment and memory accessing rules we discussed in earlier subsections of data path optimization). Using the DMA, the programmer can decide not only the alignment, but also the transfer “shape”, for lack of a better word. What this means is that using the DMA, the programmer can transfer blocks in the form of two-dimensional, three- dimensional, and four-dimensional data chunks, thus transferring data types specific to specific applications on the alignment chosen by the programmer without spending cycles transferring unnecessary data. Figure 13.12 demonstrates the data structure of a three- dimensional DMA.
The user programs the start address for data, the length for the first dimension, the offset for the second dimension, the number of transfers, followed by the offset for the third dimension and the number of transfers. At the end of the all transfers, the programmer may also program the DMA to interrupt the core to signal data transfer completion. Having the DMA intelligently moving the data in the format and organization needed by the user’s application helps optimize data flow and core processing by avoiding the need for the core to reorganize data or alter algorithms that are optimized for data in a specific format. This also simplifies the maintaining of certain alignments as the programmer can decide where each dimension of the data structure starts.
Other high-speed peripherals will generally also use a DMA, whether it be the system DMA or the peripheral’s own private DMA for data passing. In the case of the MSC8156 discussed earlier, the SRIO, PCI Express, and Ethernet controllers all have their own DMAs separate from the system DMA for data transfers. The basics still apply here: we want data transfers to be long (long bursts), we want bus accesses to be aligned, and, additionally, one more thing we want is optimal access to the system bus! We will discuss system bus optimization later in this section.
DMA of data vs. CPU
While on the topic of DMA, we need to consider whether the core should move data from internal core memory or whether a DMA should be utilized in order to save power. As the DMA hardware is optimized solely for the purpose of moving data, it will move data while consuming less power than the core (which is generally running at much higher frequencies than the DMA). As the core runs at such a higher frequency, not intended solely for data movement, etc., the core utilizes more dynamic power while incurring heavy stall penalties when accessing external memory.
As some external memory access and stalls are incurred for writing to peripheral registers when setting up the DMA, there is a point where data accesses are too small or infrequent to justify DMA usage. In general, when moving larger chunks of data, or data in a predictable manner, for the purpose of optimizing power consumption (and core performance) DMA should be utilized for maximum power savings and application efficiency.
For transactions and I/O that are not large enough to justify DMA we can consider caching, as this assists with stalls and requires virtually no core intervention. Generally speaking, using the cache is much simpler than DMA, so this is the generally accepted solution for unpredictable data I/O, while DMA should be used for larger memory transfers. Due to the overhead of programming the DMA, and the unique properties of data per application, the trade-off between power savings, performance, and program complexity from DMA to cache has to be done on a case-by-case basis. Peripherals with their own DMA generally require the programmer to use that DMA for peripheral interaction, which is a good habit to force the programmer into, as we have just discussed.
Just as the DMA peripheral is optimized for data movement and can do so more efficiently with less power consumption than the high-frequency embedded core, so are other peripherals acting as coprocessors able to perform special functions more efficiently than the core. In the case of the MSC8156, an on-board baseband coprocessor (accelerator) includes hardware for fast Fourier transforms, discrete Fourier transforms, and turbo Viterbi. When a chain of transforms can be offloaded onto the accelerator, depending on the cost of transferring data and transform sizes, the system is able to save power and cycles by offloading the core and having the coprocessor do this work as the coprocessor is running at a much lower frequency than the core, has fewer processing elements aimed at a single function, and also has automatic lower-power modes that are used when the coprocessor is not processing transforms, etc.
System bus configuration
System bus stalls due to lack of priority on the bus can cause a peripheral to actively wait unnecessarily for extra cycles when not set up properly. These extra active wait cycles mean more wasted power. Generally embedded processors will have system bus configuration registers which allow the programmer to configure the priority and arbitration per bus initiator port. In the case of the MSC8156, the system bus (called the CLASS) has 11 initiator ports and eight target ports (memory and register space). When the programmer understands the I/O needs of the application, it is possible to set up priority appropriately for the initiators that need the extra bandwidth on the bus so they can access memory and register space with minimal stalls.
There is not much of a trick to this; simply set priority based on I/O usage. Some devices such as the MSC815x series DSPs provide bus profiling tools which enable the programmer to count the number of accesses per initiator port to each target. This allows the programmer to see where congestion and bottlenecks are occurring in order to appropriately configure and tune the bus. Profiling tools also allow the programmer to see how many “priority upgrades” are needed per port. This means that the programmer can temporarily assign test priorities to each port, and if some ports constantly require priority upgrades the programmer can decide to set the starting priority of these ports one level up and re-profile.
Peripheral speed grades and bus width
Like with the system bus access, the peripheral’s external interface should be set up according to the actual needs of the system. The catch-22 of I/O peripherals is that some peripherals require being powered on all the time (so minimal to no use of low-power modes is available). If a communication port such as SRIO is dedicated to receiving incoming blocks of data to process, when no data is coming in clocks and low-power modes for the SRIO port are not an option. As such, there is a balancing game to be played here.
In testing software and power consumption, we found on the MSC8156 that running four lanes of SRIO at 3.125 GHz with 40% utilization (B4 Gbps of data) consumes a comparable amount of power, or even less, to running four lanes of SRIO at 2.5 GHz with 50% utilization (the same data throughput). So the user needs to test various cases or make use of the device manufacturer’s power calculator in order to make an informed decision. In a case like this, peripherals which have an auto-idle feature should make use of the higher- speed bus in order to maximize sleep times.
SRIO, PCI Express, Ethernet over SGMII, and some antenna interfaces make use of the same serial I/O hardware, so similar care should be taken here. All could be required to be held in an active mode as a form of “device wake” or signaling to the DSP core, meaning they may be restricted from going into sleep modes. In the case of antenna signaling, this is especially detrimental as the active antenna’s RF interface has to constantly consume power to emit signal. If possible, it is ideal to use an alternative method for waking the core in order to enable idle and sleep modes on the antenna.
Peripheral to core communication
When considering device waking, and general peripheral to core I/O, we have to consider how the peripheral interacts with the core processors. How does the core know that data is available? How often is the core notified that data is available? How does the core know when to send data over the peripheral? There are three main methods for managing this: polling, time-based processing, and interrupt processing.
This is by far the least efficient method of core-peripheral interaction as it has the core constantly awake and burning through high-frequency clock cycles (consuming active current) just to see if data is ready. The only advantage of using this method happens when the programmer is not concerned about power consumption. In this case, polling enables the core to avoid the context switches that occur during interrupt processing, thus saving some cycles in order to access data faster. Generally this is only used for testing maximum peripheral bandwidth as opposed to being used in a real application.
This works on the assumption that data will always be available at a certain interval. For example, if a processor is processing a GSM voice codec (AMR, EFR, HR, etc.) the core will know that samples will be arriving every 20 ms, so the core can look for the new audio samples on this time basis and not poll. This process allows the core to sleep, and use a timer for wake functionality, followed by performing data processing. The down side of this is the complexity and inflexibility of this model: set-up and synchronization require a lot of effort on the programmer’s side, and the same effect can be reached using simple interrupt processing.
The final core-to-peripheral communication mechanism is also the most commonly used one as it allows the benefits of the time-based processing without the complicated software architecture. We also briefly discussed using interrupt processing in the low-power modes section as a method for waking the core from sleep states: when new data samples and packets come in for processing, the core is interrupted by the peripheral (and can be woken if in sleep state) to start processing new data. The peripheral can also be used to interrupt the core when the peripheral is ready to transfer new data, so that the core does not have to constantly poll a heavily loaded peripheral to see when data is ready to send.
Power consumption results for polling vs. interrupt processing were shown in Figure 13.8 when comparing the baseline MJPEG vs. using WAIT for PD and using STOP for PD modes. When not using WAIT and STOP modes, the application would constantly check for new buffers without taking advantage of massive idle times in the application.
Of the three main areas of power optimization discussed here, algorithmic optimization requires the most work for a given amount of power savings. Algorithmic optimization includes optimization at the core application level, code structuring, data structuring (in some cases, this could be considered as data path optimization), data manipulation, and optimizing instruction selection.
Compiler optimization levels
In the data path section, we briefly discussed how the compiler can be used to optimize code for minimal size. The compiler may also be used to optimize code for maximum performance (utilizing the maximum number of processing units per cycle and minimizing the amount of time code is running). A key question is whether optimizing for performance will reduce power consumption. In general increasing the number of processing units will increase the power consumed per cycle, but the total power to perform a function over time will reduce as the number of cycles to perform the function is reduced. The question of when to optimize for performance versus code size generally still fits with the 80/20 rule (80% of cycle time is spent in 20% of the code), so, as mentioned in the data path section, the general rule here is to optimize the cycle-hungry (20%) portion of code for
performance, while focusing on minimizing code size for the rest. Fine tuning this is the job of the programmer and will require power measurement (as discussed earlier). The rest of this section will cover specific algorithmic optimizations, some of which may be performed by the performance optimizer in the compiler.
Instruction packing was included in the data path optimization section above, but may also be listed as an algorithmic optimization as it involves not only how memory is accessed, but also how code is organized.
Loop unrolling revisited
We briefly discussed using altering loops in code in order to optimize cache utilization before. As we discussed earlier, another method for optimizing both performance and power in embedded processors is via loop unrolling. This method effectively partially unravels a loop, as shown in the code snippets below:
for (i = 0; i < 100; i = i + 1)
for (k = 0; k < 10000; k = k + 1)
a[i] = 10 * b[k];
Loop unrolled by 4x:
for (i = 0; i < 100; i = i + 4)
for (k = 0; k < 10000; k = k + 4)
a[i] = 10 * b[k];
a[i + 1] = 10 * b[k + 1];
a[i + 2] = 10 * b[k + 2];
a[i + 3] = 10 * b[k + 3];
Unrolling code in this manner enables the compiler to make use of four MACs (multiply- accumulates) in each loop iteration instead of just one, thus increasing processing parallelization and code efficiency (more processing per cycle means more idle cycles available for sleep and low-power modes). In the above case, we increase the parallelization of the loop by four times, so we perform the same amount of MACs in 1 4 the cycle time, thus the effective active clock time needed for this code is reduced by 43. Measuring the power savings using the MSC8156, we find that the above example optimization (saving 25% cycle time by utilizing four MACs per cycle instead of one) enables the core to have a B48% total power saving during the time this routine is executed. Completely unrolling loops is not advisable as it is counterproductive to the code size minimization efforts we discussed in the data path section, which would lead to extra memory accesses and the possibility of increased cache miss penalties.
Another technique common to both embeddedprocessor performance optimization and embedded processor poweroptimization is software pipelining. Software pipelining is a techniquewhere the programmer splits up a set of interdependent instructions thatwould normally have to be performed one at a time so that the DSP corecan begin processing multiple instructions in each cycle. Rather thanexplaining in words, the easiest way to follow this technique is to seean example.
Say we have the following code segment:
for (i = 0; i < 100; i = i + 1)
a[i] = 10 * b[i];
b[i] = 10 * c[i];
c[i] = 10 * d[i];
Rightnow, although we have three instructions occurring per loop, thecompiler will see that the first instruction depends on the secondinstruction, and thus could not be pipelined with the second, nor canthe second instruction be pipelined with the third due tointerdependence: a[i] cannot be set to b[i] as b[i] is simultaneouslybeing set to c[i], and so on. So right now the DSP processor has toexecute the above loop 100 times with each iteration performing threeindividual instructions per cycle (not very efficient), for a total of300 cycles (best case) performed by MACs in the core of the loop. Withsoftware pipelining, we can optimize this in the following way:
First we see where we can parallelize the above code by unrolling the loop to some extent:
a[i] = 10 * b[i];
b[i] = 10 * c[i];
c[i] = 10 * d[i];
a[i + 1] = 10 * b[i + 1];
b[i + 1] = 10 * c[i + 1];
c[i + 1] = 10 * d[i + 1];
a[i + 2] = 10 * b[i + 2];
b[i + 2] = 10 * c[i + 2];
c[i + 2] = 10 * d[i + 2];
a[i + 3] = 10 * b[i + 3];
b[i + 3] = 10 * c[i + 3];
c[i + 3] = 10 * d[i + 3];
Usingthe above, we can see that certain instructions are not interdependent.The first assignment of array “a” relies on the original array “b”,meaning we can potentially assign “a” entirely before doing any otherinstructions. If we do this, this means that array “b” would be entirelyfree of dependence and could be completely assigned to the originalarray “c”. We can abstract this for “c” as well.
We can use thisidea to break the code apart and add parallelism by placing instructionstogether that can run in parallel when doing some assignment inadvance:
First, we have to perform our first instruction (no parallelism):
a[i] = 10 * b[i];
Then we can have two instructions performed in one cycle:
b[i] = 10 * c[i];
a[i + 1] = 10 * b[i + 1];
Herewe see that the first and second lines do not depend on each other, sothere is no problem with running the above in parallel as one executionset.
Finally, we reach the point where three instructions in our loop are all being performed in one cycle:
c[i] = 10 * d[i];
b[i + 1] 5 10 * c[i + 1];
a[i + 2] 5 10 * b[i + 2];
Nowwe see how to parallelize the loop and pipeline, the final softwarepipelined will first have some “set-up”, also known as loading thepipeline. This consists of the first sets of instructions we performedabove. After this we have our pipelined loop:
By pipelining the loop, we enabled the compiler to reduce the number of cycles for MACs from 300 to:
- 3 MACs that can be performed in 2 cycles for pipeline loading
- 100 cycles (3 MACs each) in the core of our loop
- 3 MACs that can be performed in 2 cycles for pipeline loading
fora total of 104 cycles or roughly 1/3 of the execution time, thusreducing the amount of time the core clocks must be active by 33 for thesame functionality! Similar to the loop unrolling case, the pipeliningcase has enabled us to save substantially: B43% total power over thetime this routine is executed.
An interesting technique is to eliminate recursive procedure calls in order to reduce function call overhead.
Recursiveprocedure calls require the function’s general context, etc., to bepushed onto the stack with each call. So in the classic case of thefactorial example (n!), this can be calculated using recursion with afunction as follows:
If this recursive factorial function is called with n = 100, there would be ~100 function calls entailing 100 branches tosubroutines (which are change of flow routines which affect the programcounter and software stack). Each change of flow instruction takeslonger to execute because not only is the core pipeline disrupted duringexecution, but every branch adds at least a return address to the callstack. Additionally, if multiple variables are being passed, these alsomust be pushed onto the stack.
This means that this recursivesubroutine requires 1003 individual writes to physical memory andrelated stall as writes/reads to memory will not be pipelined and 1003pipeline stalls due to change of flow.
We can optimize this by moving to a simple loop
int res = 1;
for(int i = 0; i < n; i+ +)
res* = i;
Thisfunction requires no actual writes to the stack/physical memory asthere are no function calls/jumps. As this function only involves asingle multiply, it qualifies as a “short loop” on certain devices,whereby the loop is entirely handled in hardware.
Thanks to thisfeature, there are no change of flow penalties, no loop overhead either,so this effectively acts like a completely unrolled loop of multiplies(minus the memory cost).
Compared to the recursive routine, using the loop for 100 factorial saves approximately:
100 changes of flow (pipeline cycle penalties)
100+ pushes to the stack (100X memory accesses).
For the above example, avoiding recursion savings can be estimated as follows.
Theloop method’s change of flow savings from avoiding pipeline disruptionsdepend on the pipeline length and if there is any branch predictionavailable in the core hardware. In the case of a 12-stage pipeline,refilling it would potentially be a 12-cycle penalty. As branch targetprediction is available on some processors, this may reduce some of thispenalty significantly, but not completely. We can multiply theestimated stall penalty by the factorial (iteration), which willindicate the additional active core clock cycles and thus active powerconsumption from the core due to recursion.
Thecost of recursion causing 100+ individual stack accesses is great, aseven in internal device memory there is potentially an initial delaypenalty for initial access. As these stack accesses will not bepipelined, initial memory delay is multiplied by the number of recursivecalls. If we assume the stack is stored low-latency internal memoryrunning at the core speed, initial latency could still be seen of theorder of anywhere from 8 to 20 cycles. A 10-cycle latency for initialaccess would not be a problem if subsequent accesses were pipelined,meaning 100 reads have a total core stall time of 10 cycles, but in thecase of recursion we have non-pipelined accesses and thus 10 3 100stalls, or 1000 additional core cycles of active clock-consuming power.
Inthe above example, removing recursion and moving to loops reduces thetotal energy (power over time) consumed by the processor to complete thefactorial function to less than half.
Oftenprogrammers will over-calculate mathematical functions, using too muchaccuracy (too much precision), which can lead to more complicatedprograms requiring more functional units and more cycles.
If16-bit integers could be used as the signal-processing application it isable to tolerate more noise, but if 32-bit integers are used instead,this could cause additional cycles for just a basic multiply. A 16-bitby 16-bit multiply can be completed in one cycle on most architectures,but a 32-bit by 32-bit may require more. Such is the case for the SC3400DSP core, which requires two cycles instead of one, so the programmeris doubling the cycle time for the operation needlessly (inefficientprocessing and additional clock cycles where the core is consumingactive dynamic power).
Low-power code sequences and data patterns
Anothersuggestion is to look at the specific instructions used for anoperation or algorithm. The programmer may be able to perform exactlythe same function with different commands while saving power, though theanalysis and work to do this is very time- consuming anddetail-oriented.
Different instructions activate differentfunctional units, and thus different power requirements. To accuratelyuse this, it requires the programmer to profile equivalent instructionsto understand the power trade-offs.
Obvious examples could beusing a MAC when only the multiply functionality is needed. Less obviouscomparisons, such as the power consumption difference between using asubtraction to clear a register versus the actual clear instruction,require the programmer to profile power consumption for eachinstruction, as we may not know internally how the hardware goes aboutclearing a register.
Rob Oshana has 30 years of experience in the softwareindustry, primarily focused on embedded and real-time systems for thedefense and semiconductor industries. He has BSEE, MSEE, MSCS, and MBAdegrees and is a Senior Member of IEEE. Rob is a member of severalAdvisory Boards including the Embedded Systems group, where he is alsoan international speaker. He has over 200 presentations and publicationsin various technology fields and has written several books on embeddedsoftware technology. He is an adjunct professor at Southern MethodistUniversity where he teaches graduate software engineering courses. He isa Distinguished Member of Technical Staff and Director of GlobalSoftware R&D for Digital Networking at Freescale Semiconductor.
Mark Kraeling is Product Manager at GE Transportation in Melbourne, Florida, where heis involved with advanced product development in real-time controls,wireless, and communications. He’s developed embedded software for theautomotive and transportation industries since the early 1990s. Mark hasa BSEE from Rose-Hulman, an MBA from Johns Hopkins, and an MSE fromArizona State.
Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software engineering for embedded systems , by Robert Oshana and Mark Kraeling.