Back to the Basics: Tested and effective methods for speeding up DSP algorithmsMany electronics and digital products use digital signal processing (DSP) algorithms in them. Before any algorithm gets into the product, an optimization process is applied to it. The optimization process ensures that the algorithm uses the least possible resources from the system, while delivering the performance the design requires. The optimization is performed to save memory space, clock cycles, and power levels.
Reducing the required memory means shrinking the algorithm foot-print, and reducing the cycle consumption means speeding the algorithm. In most cases power savings have resulted with just these two kinds of optimizations.
But understanding algorithm, processor architecture, compiler and fine 'C' and assembly language practice is required to efficiently optimize an algorithm. Since early days, assembly coding has been used to achieve these goals.
But with the availability of powerful 'C' compliers it is now possible for most embedded programmers to successfully perform both source and algorithm level optimization if they follow some basic tested and effective guidelines.
This article describes the optimization process basics and various practical 'C' level skills needed to accelerate an algorithm. Processor and complier specific options are excluded as they are typically found in the respective vendor documents.Understand the process before optimizing
At first it is to be understood that the optimization is a step-by-step process, based on a detailed understanding of the algorithm, CPU, memory structure and complier. So careful observations and some patience are necessary preconditions before starting. Here are a few preparatory steps before undertaking the task of speeding up the DSP algorithm:
Step #1: Perform a detailed code walk-through. This helps in understanding the algorithm, code quality and identifies the possible areas of major, potential and minor optimizations. An experienced programmer also uses this step to estimate the time required in the optimization and to forecast the possible level of performance improvement.
Step #2: Pick the various test vectors which entirely assesses the algorithm behavior. Then generate an output vector set from the algorithm before start optimizing it.
Step #3: Decide on the verification method for the optimized code. It can be either bit-exactness, or combination of root-mean-square (RMS), and signal-to-noise ratio (SNR) values. The engineer has to choose the verification method(s) wisely, working out the formulae needed and setting up a work bench accordingly.
Step #4: Automate the verification procedure to reduce the physical efforts and time. For instance, generate a script that once it has run would take the input vectors one by one and generate the output vectors subsequently without the programmer's attention. Or, create an Excel work-sheet where the RMS/SNR formulae have already been fed to save manual efforts and computation time.
Step #5: Measure the complexity of the critical modules and overall algorithm. Use one notation such as MCPS. (See the method to calculate MCPS ).
Step #6: Develop an observation matrix which will collect the analysis of each optimization stage. Record all the trials on the matrix along with their outcome. Note down all the initial observations from the code walk-through and numbers from MCPS calculation on the observation matrix.
Step #7: Chose audio tools which will help in evaluating the output vectors quality and debugging in the intermediate optimization stages. Tools which are capable in representing audio/speech signal in both time and frequency domains are useful.
Step #8: Use version control, to keep the versions of optimized code handy. If something goes bad while experimenting, just go back to the previous version and start afresh.
Generally the speed optimization techniques discussed in the rest of this article fall into two categories:
1) Platform (complier and
processor architectural) specific optimization, with:
a) Different optimization levels/switches offered by compiler, employing various sets of techniques. The highest level intelligently generates the fastest code by using combination all the techniques.
b) Intrinsics/DSP extensions, which replace the common operations such as add and mult, as well as saturate /special algorithms such as FFT.
c) Pragmas, which are related to data alignment, data allocation and hardware loops are used frequently.
d) Assembly coding, which is used to write the critical code.
2) Source and algorithmic level optimization, the main focus of this article, which are generic to 32-bit processors and could be applied to a wide range of algorithms.
These two methodologies may look totally unrelated and separate at first glance. But they are actually closely coupled. Eventually all the techniques help the complier generate efficient code, which either minimizes the resource requirement need for the algorithm or maximizes the concurrent use of available processor resources.
'C' source and algorithm level
To optimize your DSP application at the C-code and algorithm level there are a number of techniques available including 1) reducing memory accesses, 2) proper handling of the data, 3) reducing the number of math operations, 4) tightly controlling live variables, 5) using pre-calculated stored values, 6) making the logic more precise, and 7) effective use of loop handling.
Before using any of these techniques, here are some tips on how to minimize error and to improve optimization:
* Group all "like data" declarations together, listing 32 bit data first. This allows the contiguous data memory allocation and ensures smooth memory operations.
* Declare local variables near to the usage-area. Thist helps complier in understanding their lifecycle and better register allocation.
* Assign the smallest selection value to most commonly used path. This is most useful for operations related to control code and switch statements. For if-then-else statements, place the most common path at the start of the chain.
* Replace the small and frequently used functions in the macros. Basic functions to saturate, normalize, division, etc. may fall under this category.
Technique #1: Reduce memory
The memory related operations such as read and write are the most expensive in terms of cycles. Even with today's advanced processors, each external memory "load and store" can effectively take multiple cycles. So to speed up any algorithm it is essential to keep a tab on the memory accesses. A few important techniques for achieving this goal include:
* Fetch once, use max. Write/re-arrange the code such that once a set of variables are fetched from memory to registers they are operated on immediately and maximally before jump to operation on another other set of variables. This will avoid the variables being pushed back to memory by complier due to the register requirement by other variable sets. Precaution has to be taken to avoid any possible stalls due to consecutive operations on the variables.
* Fetch and store the 8 or 16-bit data as 32-bit. This reduces the memory accesses up to 75% for a particular piece of code. But there has to be additional instructions that will unpack the 32-bit fetched data to 8 or 16-bit and vice versa. This extra code containing some shifts turn to be cheaper when compared to memory accesses because the barrel shifter in the DSPs provide the required shifts at no cost.
A good example of this a Fast fourier transform (FFT) butterfly similar to that shown in the Code Snippets 1.1 and 1.2 below. The first code snippet reads the 16 bit coefficients from the table whereas the second snippet is optimized to fetch a 16-bit coefficients pair as single 32-bit data thereby reducing the memory accesses by half.
* Use #define for constant data wherever possible. This avoids loading the constants at runtime and saves significant cycles, especially with constant arrays.
Technique #2: Handle data properly
It is advisable to declare the local data-width as that of the processor. This will prevent the formatting/sign extension done by complier every time it accesses the smaller data-width variable from memory.
Global data also requires careful handling. Globals find space on data memory and if the memory is externally located, operations become very costlier. Some recommendations to optimize the operations relating to globals include:
* Avoid accessing/operating globals directly. Point them via local pointers. This is useful in critical code and loops. Also, keep the pointer size the same as that of processor data-width.
* Work on a local copy of the global data and updating this later after the operations are completed. Sometimes global variable never happen to get allocated on the internal registers, which results in extensive memory accesses.
Technique #3: Reduce math operations
Math operations such as multiplication and division must be used minimally. Though compliers are intelligent enough to multiply and divide with shifts wherever possible, it is better that programmer specifies shifts to help compiler producing the best optimized code.
For example, the equation below shows that some cycles in the equations can be saved if any of the conditions is true:
It is also possible to replace multiplication/division by any number with the shift and add/minus combination.
Technique #4: Control live
Every processor has limited user registers. The compiler does makes its best attempt at accommodating all the variables in the execution context to those registers. But if there are more variables than there are registers available, the compiler has no other option than to locate the variable onto the stack.
In situations where the variables do not find space on registers and accesses repeatedly from the memory, additional burdens are placed on the DSP engine. To avoid this restrict the use of live variables so that they fit on the internal registers and avoid repetitive data access from memory. This can be achieved through good programming techniques such as:
* Reusing the temporary variables. The same variable can be used to temporarily hold more than one intermediate result in the code.
* Packing more than one live variable in to a single variable. This helps in reducing the number of variables The aim should be to efficiently use each bit of the data.
* Splitting a complex module with many variables. This limits the number of live variables in the particular execution context and allows efficient register allocation. It may even compensate for the extra function overhead and improve the overall speed.
* Analyzing the disassembly and re-arrange the instructions. This helps in reducing the memory accesses and limiting the live variables.
Technique #5: Use precalculated
rather than on-the-fly computations
There are number of situations where it will be possible to remove a piece of code or a formula in many places and instead use pre-calculated stored values. While this requires additional memory from the system, it significantly reduces the calculation overhead at run time and therefore speeds the algorithm. Also:
* Some trigonometric functions such as sin, cosine and log should be avoided with values instead pre-calculated and stored in a look-up table.
* In the case of FFT speed-ups, use the generated indexes rather than using scrambler logic in the code.
Technique #6: Change the logic -
precision is always better
If a piece of code can be re-written in a better way, go ahead and make the logic more precise. Make an effort to rethink and re-do frequently used code. Filter kernels, basic mathematical operations and nested loops are some of the few places where efficient coding is required.
For example, the code in Snippets 3.1 and 3.2 below show first two stages of the macro to calculate the 13-bit Division. Here significant cycles are saved by avoiding some explicit shifts between two stages. Those shifts are now clubbed with an ADD_32 operation. The DSP's barrel shifter will execute them in parallel with CPU.
Technique #7: Loop handling
Optimizations related to loop handling result in positive effects most of the times. But they must be practiced as the last optimization stage due to some methods such as loop-unrolling and splitting which dramatically increase the code size and make the code unreadable. The most effective loop related optimizations include:
* Count-to-zero loops. These always save at least a cycle per iteration because comparison with zero is free.
* Function calls and control statements. Very expensive, these are to be avoided in the critical loops. Code Snippet 4.1 below is a piece of code from the power calculation of an error signal, with the optimized version following it in Code snippet 4.2.
* Merging. If there are two or more consecutive loops doing same iterations, they can be merged together to save an entire comparing and branching phase.
* Splitting. If the loop has many live variables, sometimes it is better to split the tasks in two pieces. This helps efficient register allocation of live variables and avoids costly memory operations.
* Unrolling. This reduces the loop overhead and is suggested as the last practice. Beware of the fact that loops unrolling makes the loop unreadable and sometimes prevent the compiler in performing other optimizations.
* Denoting the loop counts via constants. This helps the complier decide whether it is worth optimizing the loop by looking the count.
* Nesting loops. There are many opportunities for optimization here. Replacing the inner-most loop with the do-while saves at least one cycle per iteration.
Code optimization is a requirement-driven, systematic and thoughtful process. Attentive observation and an experimental approach helps in achieving the impressive numbers. Note that more than one technique can be applied successfully on the same piece of code.
Speed optimization not only helps algorithm run faster but it also adds value to the system for which the algorithm is developed, such as running the same application at a lower frequency. You can then choose either to go for more features with the frequency resulting from the optimization or simply cut down the core frequency to reduce the system power consumption.
However, some speed optimizations may adversely affect the system memory and power consumption. The practitioner has to evaluate such effects and try to balance these trade-offs. Note that the best-in-class performance can be achieved by applying a suitable combination of processor, complier, and source level optimization techniques.
Nitin Jain works with MindTree's Research and DSP groups in Bangalore on end-to-end product realizations based on either new or optimized versions of existing speech/audio algorithms. He holds an engineering degree in electronics and communications and can be reached at email@example.com.
 To read about the intricacies involved in working with speech algorithms in embedded designs, go to "Integrating and evaluating speech algorithms."
 For additional information about DSP, go to "More about designing with Embedded DSPs."