The rule of thumb in embedded system design has been that addinghardware increases power demands. The careful use of hardwareaccelerators, however, inverts the rule: adding hardware can reducepower.
By analyzing algorithms and implementing appropriate accelerators inprogrammable logic, developers can increase a design's performancewhile reducing power consumption in an embedded computing system.
Test results show that accelerators extend tradeoff options from asmuch as 200-fold performance improvement for the same power to the sameperformance with a 90% power reduction.
Programmable logic has, somewhat undeservedly, maintained areputation from its early history as being a power-hungry approach tologic design. The rules of thumb have been that power consumption in anintegrated circuit is roughly proportional to the chip's area for agiven process technology, and a design implemented in programmablelogic tends to be larger than if implemented in hard-wired logic. Butthese two factors, although suggestive, are misleading.
Far more significant than area-related power dependency is thefrequency-related power dependency of an integrated circuit. BecauseCMOS circuits draw most of their current when transistors switchstates, the frequency at which a circuit operates has a much greaterimpact on power consumption than simple chip size.
The higher the frequency, the greater the power demand. This opensthe possibility that designers can reduce chip power consumption byadding circuitry, if the result of adding hardware is a significantreduction in clock speed.
For years, embedded processors have relied on custom hardwarefunctions to accelerate common algorithms such as graphics or signalprocessing to accomplish more work per clock cycle. While this approachincreases system performance, it does not reduce the system clock ordynamic power consumption. If hardware can be applied to acceleratesoftware algorithms AND reduce the clock frequency, power can be savedwhile meeting system performance.
Not all functions are equally well suited to trading circuits forfrequency, however. Sequential processes, where one step must becompleted before the next begins, typically see little benefit fromadded circuitry.
Functions that can operate in parallel, on the other hand, can runmuch faster when hardware is available to execute several stepssimultaneously. This translates into greater performance for a givenclock speed, but also into a lower clock speed for a given performancelevel. Thus, the addition of hardware to a chip design can lower powerdemands while maintaining performance.
To demonstrate the types of power savings that designers can achieve, alost cost FPGA-based design example has been developed using a 50 MHzAltera EP3C25F324 with 25K logic elements (LEs), 66 M9K memory blocks(0.6 Mbits), 16 18×18 multiplier blocks, and four PLLs. The designexecuted the Mandelbrot algorithm for calculating fractals, using asits baseline the Nios II embedded processor.
Despite the relatively small size of the FPGA used, themicroprocessor only occupied a portion of the FPGA's resources. Thisleft room for implementing additional hardware to accelerate thealgorithm's execution (Figure 1 below ).
Testing evaluated the processor alone and the processor with as manyas five hardware accelerators. Larger members of the Cyclone III andStratix III product families, which have many times the test device'scapacity, would provide even more extensive tradeoff opportunities.
|Figure1. Typical Block Diagram of a Processor System|
Baseline tests showed that the Nios II processor operating alonerequired 435 million clock cycles to complete the calculations for oneMandelbrot frame. Adding a single hardware accelerator brought theexecution requirement down to 4.9 million clock cycles – nearly a90-fold improvement in performance – without a measurable increase inpower demand.
Adding 4 more hardware accelerators yielded incremental improvementsas much as 435 times the performance of the processor alone. The powerconsumed by the additional accelerators was only 90% greater than theCPU alone (Figure 2 below ).
|Figure2. Effect of Adding Hardware Accelerators on System Performance (left)and Power Consumption (right)|
Reducing System Clock Frequency
A 435x performance increase creates abundant computational headroomthat can now be traded for lower power. One way to approach thisreduction would be to slow the clock for the entire design.
The results for the example show that even with a singleaccelerator, the entire design can be run at 1 MHz and still achievegreater performance than the CPU alone running at 80 MHz (Figure 3 below ).
Meanwhile, the power savings were significant. The design with CPUand one accelerator running at 1 MHz used only 12 mW compared to 132 mWfor the CPU alone running at 80 MHz while still achieving nearly twicethe performance. If the 5 accelerator design is considered, power isreduced to less than 1/5th the CPU alone while the performance isincreased by over 5 times.
|Figure3. Effects of Reducing System Clock Frequency|
Reducing Accelerator Clock Frequency
In many applications, however, acceleration hardware is only effectivefor speeding part of the algorithm. In such cases, slowing the clockeverywhere in the design might adversely impact performance in otherfunctions. A more likely scenario is that the application softwarerequires the processor to run at the higher clock frequency. In thiscase, power reduction may still be achieved by reducing the acceleratorclock frequency.
Developers can evaluate the effect of different clocking speeds forvarious hardware blocks on performance and power. The FPGAs used inthis design example allow multiple clocking domains, so the CPU and itsacceleration hardware can each operate at their optimum speed. Byadjusting domain clock speeds independently, developers can readilydetermine the minimum power required while still achieving the desiredperformance.
|Figure4. Effects of Reducing Accelerator Clock Frequency|
Consider the case where an embedded designer wants the processor toexecute code at 80 MHz while off-loading the heavy computationalalgorithms to hardware running at a lower clock frequency. In the testcase the embedded processor running application code at 80 MHz with 5hardware accelerators running at 1 MHz increased system performance by6 times while still reducing system power by 55% (Figure 4 above ).
Developing Hardware Accelerators
The first step to take is examining subroutines to look forcomputational or state-machine algorithms. These are the types mostlikely to benefit from acceleration.
Once developers have identified candidate subroutines, standardsoftware profiling tools will provide the details needed to identifyacceleration opportunities. Loops that take a long time to execute aregood acceleration opportunities.
Creating hardware accelerators, while a familiar task for hardwaredevelopers, may pose a daunting task for software developers modelingalgorithms in C. The ability to translate ANSI C algorithms into thecorresponding logic is a key element in creating effective accelerationhardware (Figure 5 below ).
|Figure5. Hardware Accelerator Design Flow|
The application example was enabled using an advanced tool suite,including the C to the Hardware Acceleration Compiler (C2H) for theembedded processor and the SOPC Builder system development tool. Thetools automated the conversion of the embedded processor C source codeto a hardware accelerator implemented in HDL.
The accelerator with the process system was also integrated. Oncethe HDL is generated for the system, evaluation can take place usingHDL simulation tools, or it can be run directly in FPGA hardware.Either way, the results will likely be evaluated; the code modified anditerated until the desired performance and power for the system isachieved.
Taking Advantage of Parallelism
In addition to creating accelerator hardware designs for key algorithmsteps, developers should consider using multiple copies of suchaccelerators. Multiple copies are relatively easy to implement in adesign once the HDL code is available, and the time required to makethe evaluation is often worthwhile. When the algorithm is highlyparallel, as in the Mandelbrot example, the parallel accelerators canamplify the opportunities for power savings, as the exampledemonstrated.
By working in this manner, designers can put to rest the rule ofthumb that more circuitry means more power. It also opens thepossibility to a wide range of design exploration in terms of power andperformance trade-offs and frees designers to apply supposedlypower-hungry FPGAs in a host of new applications where small formfactor and battery-powered operation are essential.
As an Embedded Systems Specialist, Rod Frazer supports customersusing Altera'sIP-based embedded processor solutions including the Nios II processorand the C-to-Hardware Acceleration Compiler (C2H), as well as Altera'sembedded system integration tool SOPC Builder. Mr. Frazer joined Alterain March 2001. He has over 20 years of hardware and software designexperience.