Optimizing embedded software for power efficiency: Part 4 – Peripheral and algorithmic optimization
Editor's note: In the final part in a series on how to manage your embedded software design’s power requirements, the authors discuss the how operations of the system’s peripheral functions can contribute to power savings, concluding with some techniques for algorithmic optimization.
When considering the impact of reading and writing of data has on an embedded system’s power utilization we cannot just think about memory access: we need to pull data into and out of the device as well. Here we will we will look at how to minimize power consumption in commonly used embedded processor (I/O) peripherals. Later we will talk about the various algorithmic techniques for power management.
On the first topic, things to consider include the peripheral’s burst size, speed grade, transfer width, and general communication modes. The main standard forms of peripheral communication for embedded processors include DMA (direct memory access), SRIO (serial rapid I/O), Ethernet, PCI Express, and RF antenna interfaces. I2C and UART are also commonly used, though mostly for initialization and debug purposes.
The fact that communication interfaces usually require their own PLLs/clocks increases the individual power consumption impact. The higher-clocked peripherals that we need to consider as the main power consumers are the DMA, SRIO, Ethernet, and PCI Express. Clock gating and peripheral low-power modes for these peripherals were discussed in the low-power modes section of this chapter, so this section will talk about how to optimize actual usage.
Although each protocol is different for the I/O peripherals and the internal DMA, they all share the fact that they are used to read/write data. As such, one basic goal is to maximize the throughput while the peripheral is active in order to maximize efficiency and the time the peripheral/device can be in a low-power state, thus minimizing the active clock times.
The most basic way to do this is to increase transfer/burst size. For DMA, the programmer has control over burst size and transfer size in addition to the start/end address (and can follow the alignment and memory accessing rules we discussed in earlier subsections of data path optimization). Using the DMA, the programmer can decide not only the alignment, but also the transfer “shape”, for lack of a better word. What this means is that using the DMA, the programmer can transfer blocks in the form of two-dimensional, three- dimensional, and four-dimensional data chunks, thus transferring data types specific to specific applications on the alignment chosen by the programmer without spending cycles transferring unnecessary data. Figure 13.12 demonstrates the data structure of a three- dimensional DMA.
The user programs the start address for data, the length for the first dimension, the offset for the second dimension, the number of transfers, followed by the offset for the third dimension and the number of transfers. At the end of the all transfers, the programmer may also program the DMA to interrupt the core to signal data transfer completion. Having the DMA intelligently moving the data in the format and organization needed by the user’s application helps optimize data flow and core processing by avoiding the need for the core to reorganize data or alter algorithms that are optimized for data in a specific format. This also simplifies the maintaining of certain alignments as the programmer can decide where each dimension of the data structure starts.
Other high-speed peripherals will generally also use a DMA, whether it be the system DMA or the peripheral’s own private DMA for data passing. In the case of the MSC8156 discussed earlier, the SRIO, PCI Express, and Ethernet controllers all have their own DMAs separate from the system DMA for data transfers. The basics still apply here: we want data transfers to be long (long bursts), we want bus accesses to be aligned, and, additionally, one more thing we want is optimal access to the system bus! We will discuss system bus optimization later in this section.
DMA of data vs. CPU
While on the topic of DMA, we need to consider whether the core should move data from internal core memory or whether a DMA should be utilized in order to save power. As the DMA hardware is optimized solely for the purpose of moving data, it will move data while consuming less power than the core (which is generally running at much higher frequencies than the DMA). As the core runs at such a higher frequency, not intended solely for data movement, etc., the core utilizes more dynamic power while incurring heavy stall penalties when accessing external memory.
As some external memory access and stalls are incurred for writing to peripheral registers when setting up the DMA, there is a point where data accesses are too small or infrequent to justify DMA usage. In general, when moving larger chunks of data, or data in a predictable manner, for the purpose of optimizing power consumption (and core performance) DMA should be utilized for maximum power savings and application efficiency.
For transactions and I/O that are not large enough to justify DMA we can consider caching, as this assists with stalls and requires virtually no core intervention. Generally speaking, using the cache is much simpler than DMA, so this is the generally accepted solution for unpredictable data I/O, while DMA should be used for larger memory transfers. Due to the overhead of programming the DMA, and the unique properties of data per application, the trade-off between power savings, performance, and program complexity from DMA to cache has to be done on a case-by-case basis. Peripherals with their own DMA generally require the programmer to use that DMA for peripheral interaction, which is a good habit to force the programmer into, as we have just discussed.
Just as the DMA peripheral is optimized for data movement and can do so more efficiently with less power consumption than the high-frequency embedded core, so are other peripherals acting as coprocessors able to perform special functions more efficiently than the core. In the case of the MSC8156, an on-board baseband coprocessor (accelerator) includes hardware for fast Fourier transforms, discrete Fourier transforms, and turbo Viterbi. When a chain of transforms can be offloaded onto the accelerator, depending on the cost of transferring data and transform sizes, the system is able to save power and cycles by offloading the core and having the coprocessor do this work as the coprocessor is running at a much lower frequency than the core, has fewer processing elements aimed at a single function, and also has automatic lower-power modes that are used when the coprocessor is not processing transforms, etc.
System bus configuration
System bus stalls due to lack of priority on the bus can cause a peripheral to actively wait unnecessarily for extra cycles when not set up properly. These extra active wait cycles mean more wasted power. Generally embedded processors will have system bus configuration registers which allow the programmer to configure the priority and arbitration per bus initiator port. In the case of the MSC8156, the system bus (called the CLASS) has 11 initiator ports and eight target ports (memory and register space). When the programmer understands the I/O needs of the application, it is possible to set up priority appropriately for the initiators that need the extra bandwidth on the bus so they can access memory and register space with minimal stalls.
There is not much of a trick to this; simply set priority based on I/O usage. Some devices such as the MSC815x series DSPs provide bus profiling tools which enable the programmer to count the number of accesses per initiator port to each target. This allows the programmer to see where congestion and bottlenecks are occurring in order to appropriately configure and tune the bus. Profiling tools also allow the programmer to see how many “priority upgrades” are needed per port. This means that the programmer can temporarily assign test priorities to each port, and if some ports constantly require priority upgrades the programmer can decide to set the starting priority of these ports one level up and re-profile.
Peripheral speed grades and bus width
Like with the system bus access, the peripheral’s external interface should be set up according to the actual needs of the system. The catch-22 of I/O peripherals is that some peripherals require being powered on all the time (so minimal to no use of low-power modes is available). If a communication port such as SRIO is dedicated to receiving incoming blocks of data to process, when no data is coming in clocks and low-power modes for the SRIO port are not an option. As such, there is a balancing game to be played here.
In testing software and power consumption, we found on the MSC8156 that running four lanes of SRIO at 3.125 GHz with 40% utilization (B4 Gbps of data) consumes a comparable amount of power, or even less, to running four lanes of SRIO at 2.5 GHz with 50% utilization (the same data throughput). So the user needs to test various cases or make use of the device manufacturer’s power calculator in order to make an informed decision. In a case like this, peripherals which have an auto-idle feature should make use of the higher- speed bus in order to maximize sleep times.
SRIO, PCI Express, Ethernet over SGMII, and some antenna interfaces make use of the same serial I/O hardware, so similar care should be taken here. All could be required to be held in an active mode as a form of “device wake” or signaling to the DSP core, meaning they may be restricted from going into sleep modes. In the case of antenna signaling, this is especially detrimental as the active antenna’s RF interface has to constantly consume power to emit signal. If possible, it is ideal to use an alternative method for waking the core in order to enable idle and sleep modes on the antenna.
Peripheral to core communication
When considering device waking, and general peripheral to core I/O, we have to consider how the peripheral interacts with the core processors. How does the core know that data is available? How often is the core notified that data is available? How does the core know when to send data over the peripheral? There are three main methods for managing this: polling, time-based processing, and interrupt processing.
This is by far the least efficient method of core-peripheral interaction as it has the core constantly awake and burning through high-frequency clock cycles (consuming active current) just to see if data is ready. The only advantage of using this method happens when the programmer is not concerned about power consumption. In this case, polling enables the core to avoid the context switches that occur during interrupt processing, thus saving some cycles in order to access data faster. Generally this is only used for testing maximum peripheral bandwidth as opposed to being used in a real application.
This works on the assumption that data will always be available at a certain interval. For example, if a processor is processing a GSM voice codec (AMR, EFR, HR, etc.) the core will know that samples will be arriving every 20 ms, so the core can look for the new audio samples on this time basis and not poll. This process allows the core to sleep, and use a timer for wake functionality, followed by performing data processing. The down side of this is the complexity and inflexibility of this model: set-up and synchronization require a lot of effort on the programmer’s side, and the same effect can be reached using simple interrupt processing.
The final core-to-peripheral communication mechanism is also the most commonly used one as it allows the benefits of the time-based processing without the complicated software architecture. We also briefly discussed using interrupt processing in the low-power modes section as a method for waking the core from sleep states: when new data samples and packets come in for processing, the core is interrupted by the peripheral (and can be woken if in sleep state) to start processing new data. The peripheral can also be used to interrupt the core when the peripheral is ready to transfer new data, so that the core does not have to constantly poll a heavily loaded peripheral to see when data is ready to send.
Power consumption results for polling vs. interrupt processing were shown in Figure 13.8 when comparing the baseline MJPEG vs. using WAIT for PD and using STOP for PD modes. When not using WAIT and STOP modes, the application would constantly check for new buffers without taking advantage of massive idle times in the application.
Of the three main areas of power optimization discussed here, algorithmic optimization requires the most work for a given amount of power savings. Algorithmic optimization includes optimization at the core application level, code structuring, data structuring (in some cases, this could be considered as data path optimization), data manipulation, and optimizing instruction selection.
Compiler optimization levels
In the data path section, we briefly discussed how the compiler can be used to optimize code for minimal size. The compiler may also be used to optimize code for maximum performance (utilizing the maximum number of processing units per cycle and minimizing the amount of time code is running). A key question is whether optimizing for performance will reduce power consumption. In general increasing the number of processing units will increase the power consumed per cycle, but the total power to perform a function over time will reduce as the number of cycles to perform the function is reduced. The question of when to optimize for performance versus code size generally still fits with the 80/20 rule (80% of cycle time is spent in 20% of the code), so, as mentioned in the data path section, the general rule here is to optimize the cycle-hungry (20%) portion of code for
performance, while focusing on minimizing code size for the rest. Fine tuning this is the job of the programmer and will require power measurement (as discussed earlier). The rest of this section will cover specific algorithmic optimizations, some of which may be performed by the performance optimizer in the compiler.
Instruction packing was included in the data path optimization section above, but may also be listed as an algorithmic optimization as it involves not only how memory is accessed, but also how code is organized.
Loop unrolling revisited
We briefly discussed using altering loops in code in order to optimize cache utilization before. As we discussed earlier, another method for optimizing both performance and power in embedded processors is via loop unrolling. This method effectively partially unravels a loop, as shown in the code snippets below:
for (i = 0; i < 100; i = i + 1)
for (k = 0; k < 10000; k = k + 1)
a[i] = 10 * b[k];
Loop unrolled by 4x:
for (i = 0; i < 100; i = i + 4)
for (k = 0; k < 10000; k = k + 4)
a[i] = 10 * b[k];
a[i + 1] = 10 * b[k + 1];
a[i + 2] = 10 * b[k + 2];
a[i + 3] = 10 * b[k + 3];
Unrolling code in this manner enables the compiler to make use of four MACs (multiply- accumulates) in each loop iteration instead of just one, thus increasing processing parallelization and code efficiency (more processing per cycle means more idle cycles available for sleep and low-power modes). In the above case, we increase the parallelization of the loop by four times, so we perform the same amount of MACs in 1 4 the cycle time, thus the effective active clock time needed for this code is reduced by 43. Measuring the power savings using the MSC8156, we find that the above example optimization (saving 25% cycle time by utilizing four MACs per cycle instead of one) enables the core to have a B48% total power saving during the time this routine is executed. Completely unrolling loops is not advisable as it is counterproductive to the code size minimization efforts we discussed in the data path section, which would lead to extra memory accesses and the possibility of increased cache miss penalties.