Optimizing compilers for embedded DSP software - Embedded.com

Optimizing compilers for embedded DSP software


Many of today's DSP applications are subject to real-time constraints. Many applications will eventually grow to a point where they are stressing the available CPU and memory resources. Understanding the workings of the DSP architecture, compiler, and the application algorithms can speed up applications, sometimes by an order of magnitude. This article will summarize some of the techniques that can improve the performance of your code in terms of cycle count, memory use, and power consumption.

What Is Optimization?
Optimization is a procedure that seeks to maximize or minimize one or more performance indices. These indices include:

  • Throughput (execution speed)
  • Memory usage
  • I/O bandwidth
  • Power dissipation

Since many DSP systems are real-time systems, at least one (and probably more) of these indices must be optimized. It is difficult (and usually impossible) to optimize all these performance indices at the same time. For example, making the application faster may require more memory and vice versa. The designer must weigh each of these indices and make the best tradeoff.

Determining which index or set of indices are important to optimize depends on the goals of the application developer. For example, optimizing for performance means that the developer can use a slow or less expensive DSP to do the same amount of work. In some embedded systems, cost savings like this can have a significant impact on the success of the product. The developer can alternatively choose to optimize the application to allow the addition of more functionality.

This may be very important if the additional functionality improves the overall performance of the system, or if the developer can add more capability to the system such as an additional channel of a base station system.

Optimizing for memory use can also lead to overall system cost reduction. Reducing the application size leads to a lower demand for memory which reduces overall system cost. And finally, optimizing for power means that the application can run longer on the same amount of power. This is important for battery powered applications. This type of optimization also reduces the overall system cost with respect to power supply requirements and other cooling functionality required.

The tricky part to optimizing DSP applications is to understand the tradeoff between the various performance indices. For example, optimizing an application for speed often means a corresponding decrease in power consumption but an increase in memory usage.

Optimizing for memory may also result in a decrease in power consumption due to fewer memory accesses but an offsetting decrease in code performance. The various tradeoffs and system goals must be understood and considered before attempting any form of application optimization.

Make The Common Case Fast
The fundamental rule in computer design as well as programming real-time DSP-based systems is “make the common case fast, and favor the frequent case.” This is really just Amdahl's Law that says the performance improvement to be gained using some faster mode of execution is limited by how often you use that faster mode of execution. So don't spend time trying to optimize a piece of code that will hardly ever run. You won't get much out of it, no matter how innovative you are. Instead, if you can eliminate just one cycle from a loop that executes thousands of times, you will see a bigger impact on the bottom line.

1. Make the Common Case Fast – DSP Architectures
DSP architectures are designed to make the common case fast. Considering many DSP applications are composed from a standard set of DSP building blocks such as filters, Fourier Transforms, and convolutions. These algorithms all share a common characteristic; they perform multiplies and adds over and over again (Figure 1 ).

This is generally referred to as the Sum of Products (SOP). DSP chip designers have developed hardware architectures that allow the efficient execution of algorithms with SOPs. This is done using specialized instructions such as single cycle multiple and accumulate (MAC), architectures that all multiple memory accessed in a single cycle (Harvard achitectures, Figure 2 ) and special hardware that handles loop counting with very little overhead.

1. DSP algorithms are composed of iterations of multiplies and adds, as shown in the Discrete Fourier Transform formula…

…and the Filter algorithm

2. Harvard Architecture. The separation of program and data provides increased performance for DSP applications

(Editor's note : This paper was presented at ESC Boston 2006. For more papers from this conference, go to Embedded.com.)

To read more, go to “Algorithms and compilers

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.