Optimizing embedded software for power efficiency: Part 2 – Minimizing hardware power - Embedded.com

Optimizing embedded software for power efficiency: Part 2 – Minimizing hardware power

Editor's note: In the second in a series on how managing your embedded software design’s power requirements, the authors provide some tips on the various hardware features incorporated into an microprocessor or DSP and how to take advantage of them. Excerpted from Software engineering for embedded systems .

Data flow optimization focuses on working to minimize the power cost of utilizing different memories, buses, and peripherals where data can be stored or transmitted by taking advantage of relevant features and concepts. Algorithmic optimization refers to making changes in code to affect how the cores process data, such as how instructions or loops are handled.

Hardware optimization, as discussed here, focuses more on how to optimize clock control and power features provided in the microprocessor or peripheral circuits.

Hardware support
Low power modes. DSP applications normally work on tasks in packets, frames, or chunks. For example, in a media player, frames of video data may come in at 60 frames per second to be decoded, while the actual decoding work may take the processor orders of magnitude less than 1/60th of a second, giving us a chance to utilize sleep modes, shut down peripherals, and organize memory, all to reduce power consumption and maximize efficiency.

We must also keep in mind that the power-consumption profile varies based on application. For instance, two differing hand-held devices, an MP3 player and a cellular phone, will have two very different power profiles.

The cellular phone spends most of its time in an idle state, and when in a call is still not working at full capacity during the entire call duration as speech will commonly contain pauses which are long in terms of the processor’s clock cycles.

For both of these power profiles, software-enabled low-power modes (modes/features/ controls) are used to save power, and the question for the programmer is how to use them efficiently. A quick note to the reader: different device documents may refer to features discussed in this section such as gating and scaling in various ways, such as low-power modes, power saving modes, power controls, etc. The most common modes available consist of power gating, clock gating, voltage scaling, and clock scaling.

Power gating. This uses a current switch to cut off a circuit from its power supply rails during standby mode, to eliminate static leakage when the circuit is not in use. Using power gating leads to a loss of state and data for a circuit, meaning that using this requires storing necessary context/state data in active memory. As embedded processors are moving more and more towards being full SoC solutions with many peripherals, some peripherals may be unnecessary for certain applications. Power gating may be available to completely shut off such unused peripherals in a system, and the power savings attained from power gating depend on the specific peripheral on the specific device in question.

It is important to note that in some cases, documentation will refer to powering down a peripheral via clock gating, which is different from power gating. It may be possible to gate a peripheral by connecting the power supply of a certain block to ground, depending on device requirements and interdependence on a power supply line. This is possible in software in certain situations, such as when board/system-level power is controlled by an on-board IC, which can be programmed and updated via an I2C bus interface. As an example, the MSC8156 DSP (Figure 13.5 ) has this option for the MAPLE DSP baseband accelerator peripheral and a portion of M3 memory.

Figure 13.5: 8156 six-core DSP processor

Clock gating As the name implies, this shuts down clocks to a circuit or portion of a clock tree in a device. As dynamic power is consumed during state change triggered by clock toggling (as we discussed in the introductory portion of this chapter), clock gating enables the programmer to cut dynamic power through the use of a single (or a few) instructions. Clocking of a processor core like a DSP is generally separated into trees stemming from a main clock PLL into various clock domains as required by design for core, memories, and peripherals, and DSPs generally enable levels of clock gating in order to customize a power-saving solution.

Examples of low-power modes
The Freescale MSC815x. These DSPs provide various levels of clock gating in the core subsystem and peripheral areas. Gating clocks to a core may be done in the form of STOP and WAIT instructions. STOP mode gates clocks to the DSP core and the entire core subsystem (L1 and L2 caches, M2 memory, memory management, debug and profile unit) aside from internal logic used for waking from the STOP state.

In order to safely enter STOP mode, as one may imagine, care must be taken to ensure accesses to memory and cache are all complete, and no fetches/prefetches are under way.

The recommended process is:

  • Terminate any open L2 prefetch activity.
  • Stop all internal and external accesses to M2/L2 memory.
  • Close the subsystem slave port window (peripheral access path to M2 memory) by writing to the core subsystem slave port general configuration register.
  • Verify slave port is closed by reading the register, and also testing access to the slave port (at this point, any access to the core’s slave port will generate an interrupt). Ensure STOP ACK bit is asserted in General Status Register to show subsystem is in STOP state.
  • Enter STOP mode.

STOP state can be exited by initiating an interrupt. There are other ways to exit from STOP state, including a reset or debug assertion from external signals.

The WAIT state gates clocks to the core and some of the core subsystem aside from the interrupt controller, debug and profile unit, timer, and M2 memory, which enables faster entering and exiting from WAIT state, but at the cost of greater power consumption. To enter WAIT state, the programmer may simply use the WAIT instruction for a core. Exiting WAIT, like STOP, may also be done via an interrupt.

A particularly nice feature of these low-power states is that both STOP and WAIT mode
can be exited via either an enabled or a disabled interrupt. Wake-up via an enabled interrupt follows the standard interrupt handling procedure: the core takes the interrupt, does a full context switch, and then the program counter jumps to the interrupt service routine before returning to the instruction following the segment of code that executed the WAIT (or STOP) instruction.

This requires a comparatively large cycle overhead, which is where disabled interrupt waking becomes quite convenient. When using a disabled interrupt to exit from either WAIT or STOP state, the interrupt signals the core using an interrupt priority that is not “enabled” in terms of the core’s global interrupt priority level (IPL), and when the core wakes it resumes execution where it left off without executing a context switch or any ISR. An example using a disabled interrupt for waking the MSC8156 is provided at the end of this section.

Clock gating to peripherals is also enabled, where the user may gate specific peripherals individually as needed. This is available for the MSC8156’s serial interface, Ethernet controller (QE), DSP accelerators (MAPLE), and DDR. As with STOP mode, when gating clocks to any of these interfaces, the programmer must ensure that all accesses are completed beforehand. Then, via the System Clock Control register, clocks to each of these peripherals may be gated. In order to come out of the clock gated modes, a Power on Reset is required, so this is not something that can be done and undone on the fly in a function, but rather a setting that is decided at system configuration time.

Additionally, partial clock gating is possible on the high-speed serial interface components (SERDES, OCN DMA, SRIO, RMU, PCI Express) and ddr so that they may be temporarily put in a “doze state” in order to save power, but still maintain the functionality of providing an acknowledge to accesses (in order to prevent internal or external bus lockup when accessed by external logic).

Texas Instruments C6000. Another popular DSP family on the market is the C6000 series DSP from Texas Instruments (TI). They provide a few levels of clock gating, depending on the generation of C6000. For example, the previous generation C67x floating-point DSP has low- power modes called “power-down modes”. These modes include PD1, PD2, PD3, and “peripheral power down”, each of which gates clocking to various components in the silicon.

For example, PD1 mode gates clocks to the C67x CPU (processor core, data registers, control registers, and everything else within the core aside from the interrupt controller). The C67x can wake up from PD1 via an interrupt into the core. Entering power-down mode PD1 (or PD2 / PD3) for the C67x is done via a register write (to CSR). The cost of entering PD1 state is B9 clock cycles plus the cost of accessing the CSR register. As this power-down state only affects the core (and not cache memories), it is not comparable to the Freescale’s STOP or WAIT state.

The two deeper levels of power down, PD2 and PD3, effectively gate clocks to the entire device (all blocks which use an internal clock: internal peripherals, the CPU, cache, etc.). The only way to wake up from PD2 and PD3 clock gating is via a reset, so PD2 and PD3 would not be very convenient or efficient to use mid-application.

Clock and voltage control
Some devices have the ability to scale voltage or clock, which may help optimize the power scheme of a device/application. Voltage scaling, as the name implies, is the process of lowering or raising the power. Earlier in Part 1 in this series VRMs were introduced as one method. The main purpose of a VRM (voltage regulator module) is to control the power/voltage supply to a device. Using a VRM, voltage scaling may be done through monitoring and updating voltage ID (VID) parameters.

In general, as voltage is lowered, frequency/processor speed is sacrificed, so generally voltage would be lowered when the demand from a DSP core or a certain peripheral is reduced.

The TI C6000 devices provide a flavor of voltage scaling called SmartReflex, which enables automatic voltage scaling through a pin interface which provides VID to a VRM. As the pin interface is internally managed, the software engineer does not have much influence over this, so we will not cover any programming examples for this.

Clock control is available in many processors, which allows the changing of the values of various PLLs in runtime. In some cases, updating the internal PLLs requires relocking the PLLs, where some clocks in the system may be stopped, and this must be followed by a soft reset (reset of the internal cores). Because of this inherent latency, clock scaling is not very feasible during normal heavy operation, but may be considered if a processor’s requirements over a long period of time are reduced (such as during times of low call volume during the night for processors on a wireless base station).

When considering clock scaling, we must keep the following in mind: during normal operation, running at a lower clock allows for lower dynamic power consumption, assuming clock and power gating are never used. In practice, running a processor at a higher frequency allows for more “free” cycles, which, as previously noted, can be used to hold the device in a low-power/sleep mode — thus offsetting the benefits of such clock scaling.

Additionally, updating the clock for custom cases is time-intensive, and for some processors, not an option at all — meaning clock frequency has to be decided at device reset/power-on time, so the general rule of thumb is to enable enough clock cycles with some additional headroom for the real-time application being run, and to utilize other power optimization techniques. Determining the amount of headroom varies from processor to processor and application to application — at which point it makes sense to profile your application in order to understand the number of cycles required for a packet/frame, and the core utilization during this time period.

Once this is understood, measuring the power consumption for such a profile can be done, as demonstrated earlier in this chapter in the profiling power section. Measure the average power consumption at your main frequency options. (for example this could be 800 MHz and 1 GHz), and then average in idle power over the headroom slots in order to get a head- to-head comparison of the best-case power consumption.

In summary, when using these techniques consider available block functionality when in low-power mode:

  • When in low-power modes, we have to remember that certain peripherals will not be available to external peripherals, and peripheral buses may also be affected. As noted earlier in this section, devices may take care of this, but this is not always the case. If power gating a block, special care must be taken regarding shared external buses, clocks, and pins.
  • Additionally, memory states and validity of data must be considered. We will cover this when discussing cache and DDR in the next section.

Consider the overhead of entering and exiting low-power modes:

  • When entering and exiting low-power modes, in addition to overall power savings, the programmer must ensure the cycle overhead of actually entering and exiting the low power mode does not break real time constraints.
  • Cycle overhead may also be affected by the potential difference in initiating a low power mode by register access as opposed to by direct core instructions.

Low-power example
To demonstrate low power usage, we willrefer to a Motion JPEG (MJPEG) application. In this application, rawimage frames are sent from a PC to an embedded DSP over Ethernet. EachEthernet packet contains 1 block of an image frame. A full raw QVGAimage uses B396 blocks plus a header. The DSP encodes the image in realtime (adjustable from 1 to 301 frames per second), and sends the encodedMotion JPEG video back over Ethernet to be played on a demo GUI in thePC. The flow and a screenshot of this GUI are shown in Figure 13.6 .

Figure 13.6: DSP operating system Motion JPEG application

TheGUI will display not only the encoded JPEG image, but also the coreutilization (as a percentage of the maximum core cycles available).

Forthis application, we need to understand how many cycles encoding aframe of JPEG consumes. Using this we can determine the maximum framerate we can use and, in parallel, also determine the maximum down timewe have for low-power mode usage.

If we are close to the maximumcore utilization for the real-time application, then using low-powermodes may not make sense (may break real-time constraints).

Asnoted in previous chapters, we could simply profile the application tosee how many cycles are actually spent per image frame, but this isalready handled in the MJPEG demo’s code using the core cycle countersin the OCE (on-chip emulator). The OCE is a hardware block on the DSPthat the profiler utilizes to get core cycle counts for use in codeprofiling.

The MJPEG code in this case counts the number ofcycles a core spends doing actual work (handling an incoming Ethernetinterrupt, dequeueing data, encoding a block of data into JPEG format,enqueueing/sending data back over Ethernet).

The number of corecycles required to process a single block encode of data (and supportingbackground data movement) is measured to be of the order of 13,000cycles. For a full JPEG image (B396 image blocks and Ethernet packets),this is approximately 5 million cycles.

So 1 JPEG frame a secondwould work out to be 0.5% of a core’s potential processing power,assuming a 1 GHz core that is handling all Ethernet I/O, interruptcontext switches, etc.

Inthis example the DSP has up to six cores, and only one core would haveto manage Ethernet I/O; in a full multicore system, utilization per coredrops to a range of 3 to 7%. A master core acts as the manager of thesystem, managing both Ethernet I/O, intercore communication, and JPEGencoding, while the other slave cores are programmed to solely focus onencoding JPEG frames. Because of this intercore communication andmanagement, the drop in cycle consumption from one core to four or sixis not linear.

Based on cycle counts from the OCE, we can run asingle core, which is put in a sleep state for 85% of the time, or amulticore system which uses sleep state up to 95% of the time.

Thisapplication also uses only a portion of the SoC peripherals (Ethernet,JTAG, a single DDR, and M3 memory). So we can save power by gating thefull HSSI System (Serial Rapid IO, PCI Express), the MAPLE accelerator,and the second DDR controller. Additionally, for our GUI demo, we areonly showing four cores, so we can gate cores 4 and 5 without affectingthis demo as well.

Based on the above, and what we have discussed in this section, here is the plan we want to follow:

At application start up:

  • Clock gate the unused MAPLE accelerator block (MAPLE described later in this chapter).
  • NOTES:
  • MAPLE power pins share a power supply with core voltage. If the power supply to MAPLE was not shared, we could completely gate power. Due to shared pins on the development board, the most effective choice we have is to gate the MAPLE clock.
  • MAPLE automatically goes into a doze state, which gates part of the clocks to the block when it is not in use. Because of this, power savings from entirely gating MAPLE may not be massive.
  • Clock gate the unused HSSI (high-speed serial interface).

NOTES:

  • We could also put MAPLE into a doze state, but this gates only part of the clocks. Since we will not be using any portion of these peripherals, complete clock gating is more power efficient.
  • Clock gate the unused second DDR controller. NOTES:
  • When using VTB, the OS places buffer space for VTB in the second DDR memory, so we need to be sure that this is not needed.

During application runtime:

  • At runtime, QE (Ethernet Controller), DDR, interconnect, and cores 1—4 will be active. Things we must consider for these components include:
  • The Ethernet Controller cannot be shut down or put into a low power state — as
  • this is the block that receives new packets (JPEG blocks) to encode. Interrupts from the Ethernet controller can be used to wake our master core from low-power mode. Active core low-power modes:
  • WAIT mode enables core power savings, while allowing the core to be
  • woken up in just a few cycles by using a disabled interrupt to signal exit from WAIT.
  • STOP mode enables greater core savings by shutting down more of the subsystem than WAIT (including M2), but requires slightly more time to wake due to more hardware being re-enabled. If data is coming in at high rates,
  • and the wake time is too long, we could get an overflow condition, where packets are lost. This is unlikely here due to the required data rate of the application.
  • The first DDR contains sections of program code and data, including parts of the Ethernet handling code. (This can be quickly checked and verified by looking at the program’s .map file.) Because the Ethernet controller will be waking the master core from WAIT state, and the first thing the core will need to do out of this state is to run the Ethernet handler, we will not put DDR0 to sleep.

We can use the main background routine forthe application to apply these changes without interfering with theRTOS. This code segment is shown in Figure 13.7 with power-down- related code.

Note that the clock gating must be done by only one core as these registers are system level and access is shared by all cores.

Thiscode example demonstrates how a programmer using the OS can make use ofthe interrupt APIs in order to recover from STOP or wait state withoutactually requiring a context switch. In the MJPEG player, as notedabove, raw image blocks are received via Ethernet (with interrupts), andthen shared via shared queues (with interrupts). The master core willhave to use context switching to read new Ethernet frameshere, but slave cores only need to wake up and go to the MessageHandlerfunction.

Figure 13.7: Code segment with power-down-related code

We take advantage of this fact by enabling only higher-priority interrupts before going to sleep:

osHwiSwiftDisable(); osHwiEnable(OS_HWI_PRIORITY10);

Thenwhen a slave core is asleep, if a new queue message arrives on aninterrupt, the core will be woken up (on context switch), and standardinterrupt priority levels will be restored. The core will then go andmanage the new message without context switch overhead by calling theMessageHandler() function. In order to verify our power savings, we willtake a baseline power reading before optimizing across the relevantpower supplies, and then measure the incremental power savings of eachstep.

The processor board has power for cores, accelerators,HSSI, and M3 memory connected to the same power supply, simplifying datacollection. Since these supplies and DDR are the only blocks we areoptimizing, we shall measure improvement based on these supplies alone.

Figure 13.8 provides a visual on the relative power consumed by the relevant powersupplies (1V: core, M3, HSSI, MAPLE accelerators, and DDR) across thepower-down steps used above. Note that actual power numbers are notprovided to avoid any potential non-disclosure issues.

Figure 13.8: Power consumption savings in PD modes

Thefirst two bars provide reference points — indicating the powerconsumption for these supplies using a standard FIR filter in a loop andthe power consumption when the cores are held in debug state (notperforming any instructions, but not in a low-power mode). With oursteps we can see that there was nearly a 50% reduction in powerconsumption across the relevant supplies for the Motion JPEG demo withthe steps laid out above, with each step providing B5% reduction inpower, with the exception of the STOP and WAIT power modes, which arecloser to 15—20% savings.

One thing to keep in mind is that,while the MJPEG demo is the perfect example to demonstrate low-powermodes, it is not highly core-intensive, so as we progress throughdifferent optimization techniques, we will be using other examples asappropriate.

Part 1: Measuring power
Part 3: Optimizing data flow and memory
Part 4: Peripheral and algorithmic optimization

Rob Oshana has 30 years of experience in the software industry, primarily focusedon embedded and real-time systems for the defense and semiconductorindustries. He has BSEE, MSEE, MSCS, and MBA degrees and is a SeniorMember of IEEE. Rob is a member of several Advisory Boards including theEmbedded Systems group, where he is also an international speaker. Hehas over 200 presentations and publications in various technology fieldsand has written several books on embedded software technology. He is anadjunct professor at Southern Methodist University where he teachesgraduate software engineering courses. He is a Distinguished Member ofTechnical Staff and Director of Global Software R&D for DigitalNetworking at Freescale Semiconductor.

Mark Kraeling isProduct Manager at GE Transportation in Melbourne, Florida, where he isinvolved with advanced product development in real-time controls,wireless, and communications. He’s developed embedded software for theautomotive and transportation industries since the early 1990s. Mark hasa BSEE from Rose-Hulman, an MBA from Johns Hopkins, and an MSE fromArizona State.

Used with permission from Morgan Kaufmann, a division of Elsevier, Copyright 2012, this article was excerpted from Software engineering for embedded systems , by Robert Oshana and Mark Kraeling.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.