Back to the Basics: Tested and effective methods for speeding up DSP algorithms - Embedded.com

Back to the Basics: Tested and effective methods for speeding up DSP algorithms

Many electronics and digital products use digital signal processing (DSP)algorithms in them. Before any algorithm gets into theproduct, an optimization process is appliedto it. The optimizationprocess ensures that the algorithm uses the leastpossible resources from the system, while delivering the performancethe design requires. The optimization is performed to save memoryspace, clock cycles, and power levels.

Reducing the required memory means shrinking the algorithmfoot-print, and reducing the cycle consumption means speeding thealgorithm. In most cases power savings have resulted with just thesetwo kinds of optimizations.

But understanding algorithm, processor architecture, compiler and fine 'C' and assembly language practice is required to efficiently optimize an algorithm.Since early days, assembly coding has been used to achieve these goals.

But with the availability of powerful 'C' compliers it is nowpossible for most embedded programmers to successfully perform bothsource and algorithm level optimization if they follow some basictested and effective guidelines.

This article describes the optimization process basics and variouspractical 'C' level skills needed to accelerate an algorithm. Processorand complier specific options are excluded as they are typically foundin the respective vendor documents.

Understand the process beforeoptimizing
At first it is to be understood that the optimization is a step-by-stepprocess, based on a detailed understanding of the algorithm, CPU,memory structure and complier. So careful observations and somepatience are necessarypreconditions before starting. Here are a few preparatory steps beforeundertaking the task of speeding up the DSP algorithm:

Step #1: Perform a detailedcode walk-through. This helps in understanding the algorithm,code quality and identifies the possible areas of major, potential andminor optimizations. An experienced programmer also uses this step toestimate the time required in the optimization and to forecast thepossible level of performance improvement.

Step #2: Pick the various testvectors which entirely assesses the algorithm behavior. Thengenerate an output vector set from the algorithm before startoptimizing it.

Step #3: Decide on the verification method for theoptimized code. It can be either bit-exactness, orcombination of root-mean-square (RMS), and signal-to-noise ratio (SNR)values. The engineer has to choose the verification method(s) wisely,working out the formulae needed and setting up a work benchaccordingly.

Step #4: Automate theverification procedure to reduce the physical efforts and time. For instance, generate a script that once it has run would take theinput vectors one by one and generate the output vectors subsequentlywithout the programmer's attention. Or, create an Excel work-sheetwhere the RMS/SNR formulae have already been fed to save manual effortsand computation time.

Step #5: Measure the complexityof the critical modules and overall algorithm. Use one notationsuch as MCPS. (See the method to calculate MCPS[1]) .

Step #6: Develop an observationmatrix which will collect the analysis of each optimizationstage. Record all the trials on the matrix along with their outcome.Note down all the initial observations from the code walk-through andnumbers from MCPS calculation on the observation matrix.

Step #7: Chose audio tools whichwill help in evaluating the output vectors quality and debugging in theintermediate optimization stages. Tools which are capable inrepresenting audio/speech signal in both time and frequency domains areuseful.

Step #8: Use version control , tokeep the versions of optimized code handy. If something goes bad whileexperimenting, just go back to the previous version and start afresh.

Generally the speed optimization techniques discussed in the rest ofthis article fall into two categories:

1) Platform (complier andprocessor architectural) specific optimization , with:
    a) Different optimization levels/switches offered by compiler, employingvarious sets of techniques. The highest level intelligently generatesthe fastest code by using combination all the techniques.
    b) Intrinsics/DSP extensions, which replace the common operations such asadd and mult, as well as saturate /special algorithms such as FFT.
    c) Pragmas,which are related to data alignment, data allocation and hardware loopsare used frequently.
    d) Assemblycoding, which is used to write the critical code.

2) Source and algorithmiclevel optimization , the main focus of this article, which aregeneric to 32-bit processors and could be applied to a wide range ofalgorithms.

These two methodologies may look totally unrelated and separate atfirst glance. But they are actually closely coupled. Eventually all thetechniques help the complier generate efficient code, which eitherminimizes the resource requirement need for the algorithm or maximizesthe concurrent use of available processor resources.

'C' source and algorithm leveloptimization
To optimize your DSP application at the C-code and algorithm levelthere are a number of techniques available including 1) reducing memory accesses, 2) proper handling of the data, 3) reducing the number of mathoperations, 4) tightlycontrolling live variables, 5) using pre-calculated stored values, 6) making the logic more precise, and 7) effective use of loop handling.

Before using any of these techniques, here are some tips on how tominimize error and to improve optimization:

* Group all”like data” declarations together, listing 32 bit data first. This allows the contiguous data memory allocation and ensures smoothmemory operations.

* Declare localvariables near to the usage-area. Thist helps complier inunderstanding their lifecycle and better register allocation.

* Assign thesmallest selection value to most commonly used path. This ismost useful for operations related to control code and switchstatements. For if-then-else statements, place the most common path atthe start of the chain.

* Replace thesmall and frequently used functions in the macros. Basicfunctions to saturate, normalize, division, etc. may fall under thiscategory.

Technique #1: Reduce memoryaccesses.
The memory related operations such as read and write are the mostexpensive in terms of cycles. Even with today's advanced processors,each external memory “load and store” can effectively take multiplecycles. So to speed up any algorithm it is essential to keep a tab onthe memory accesses. A few important techniques for achieving this goalinclude:

* Fetch once,use max. Write/re-arrange the code such that once a set ofvariables are fetched from memory to registers they are operated onimmediately and maximally before jump to operation on another other setof variables. This will avoid the variables being pushed back to memoryby complier due to the register requirement by other variable sets.Precaution has to be taken to avoid any possible stalls due toconsecutive operations on the variables.

* Fetch andstore the 8 or 16-bit data as 32-bit. This reduces the memoryaccesses up to 75% for a particular piece of code. But there has to beadditional instructions that will unpack the 32-bit fetched data to 8or 16-bit and vice versa. This extra code containing some shifts turnto be cheaper when compared to memory accesses because the barrelshifter in the DSPs provide the required shifts at no cost.

A good example of this a Fast fourier transform (FFT) butterflysimilar to that shown in the CodeSnippets 1.1 and 1.2 below. Thefirst code snippet reads the 16 bit coefficients from the table whereasthe second snippet is optimized to fetch a 16-bit coefficients pair assingle 32-bit data thereby reducing the memory accesses by half.

* Use #definefor constant data wherever possible. This avoids loading theconstants at runtime and saves significant cycles, especially withconstant arrays.

Technique #2: Handle data properly
It is advisable to declare the local data-width as that of theprocessor. This will prevent the formatting/sign extension done bycomplier every time it accesses the smaller data-width variable frommemory.

Global data also requires careful handling. Globals find space ondata memory and if the memory is externally located, operations becomevery costlier. Some recommendations to optimize the operations relatingto globals include:

* Avoidaccessing/operating globals directly. Point them via local pointers.This is useful in critical code and loops. Also, keep the pointer sizethe same as that of processor data-width.

* Work on alocal copy of the global data and updating this later after theoperations are completed. Sometimes global variable never happen to getallocated on the internal registers, which results in extensive memoryaccesses.

Technique #3: Reduce math operations
Math operations such as multiplication and division must be usedminimally. Though compliers are intelligent enough to multiply anddivide with shifts wherever possible, it is better that programmerspecifies shifts to help compiler producing the best optimized code.

For example, the equation below shows that some cycles in theequations can be saved if any of the conditions is true:

It is also possible to replace multiplication/division by any numberwith the shift and add/minus combination.

Technique #4: Control livevariables
Every processor has limited user registers. The compiler does makes itsbest attempt at accommodating all the variables in the executioncontext to those registers. But if there are more variables than thereare registers available, the compiler has no other option than tolocate the variable onto the stack.

In situations where the variables do not find space on registers andaccesses repeatedly from the memory, additional burdens are placed onthe DSP engine. To avoid this restrict the use of live variables sothat they fit on the internal registers and avoid repetitive dataaccess from memory. This can be achieved through good programmingtechniques such as:

* Reusing thetemporary variables. The same variable can be used totemporarily hold more than one intermediate result in the code.

* Packing morethan one live variable in to a single variable . This helps inreducing the number of variables The aim should be to efficiently useeach bit of the data.

* Splitting acomplex module with many variables. This limits the number oflive variables in the particular execution context and allows efficientregister allocation. It may even compensate for the extra functionoverhead and improve the overall speed.

* Analyzing thedisassembly and re-arrange the instructions. This helps inreducing the memory accesses and limiting the live variables.

Technique #5: Use precalculatedrather than on-the-fly computations
There are number of situations where it will be possible to remove apiece of code or a formula in many places and instead usepre-calculated stored values. While this requires additional memoryfrom the system, it significantly reduces the calculation overhead atrun time and therefore speeds the algorithm. Also:

* Sometrigonometric functions such as sin, cosine and log should be avoidedwith values instead pre-calculated and stored in a look-up table.

* Inthe case of FFT speed-ups, use the generated indexes rather than usingscrambler logic in the code.

Technique #6: Change the logic -precision is always better
If a piece of code can be re-written in a better way, go ahead and makethe logic more precise. Make an effort to rethink and re-do frequentlyused code. Filter kernels, basic mathematical operations and nestedloops are some of the few places where efficient coding is required.

For example, the code in Snippets3.1 and 3.2 below show first two stages of the macro tocalculate the 13-bit Division. Here significant cycles are saved byavoiding some explicit shifts between two stages. Those shifts are nowclubbed with an ADD_32 operation. The DSP's barrel shifter will executethem in parallel with CPU.

Technique #7: Loop handling
Optimizations related to loop handling result in positive effects mostof the times. But they must be practiced as the last optimization stagedue to some methods such as loop-unrolling and splitting whichdramatically increase the code size and make the code unreadable. Themost effective loop related optimizations include:

* Count-to-zeroloops. These always save at least a cycle per iteration becausecomparison with zero is free.

* Functioncalls and control statements. Very expensive, these are tobe  avoided in the critical loops. Code Snippet 4.1 below is a piece ofcode from the power calculation of an error signal, with the optimizedversion following it in Code snippet4.2.


* Merging. If there are two or more consecutive loops doing same iterations, theycan be merged together to save an entire comparing and branching phase.

* Splitting. If the loop has many live variables, sometimes it is better to splitthe tasks in two pieces. This helps efficient register allocation oflive variables and avoids costly memory operations.

* Unrolling. This reduces the loop overhead and is suggested as the last practice.Beware of the fact that loops unrolling makes the loop unreadable andsometimes prevent the compiler in performing other optimizations.

* Denoting theloop counts via constants. This helps the complier decidewhether it is worth optimizing the loop by looking the count.

* Nesting loops. There are many opportunities for optimization here. Replacing theinner-most loop with the do-while saves at least one cycle periteration.

Conclusion
Code optimization is a requirement-driven, systematic and thoughtfulprocess. Attentive observation and an experimental approach helps inachieving the impressive numbers. Note that more than one technique canbe applied successfully on the same piece of code.

Speed optimization not only helps algorithm run faster but it alsoadds value to the system for which the algorithm is developed, such asrunning the same application at a lower frequency. You can then chooseeither to go for more features with the frequency resulting from theoptimization or simply cut down the core frequency to reduce the systempower consumption.

However, some speed optimizations may adversely affect the systemmemory and power consumption. The practitioner has to evaluate sucheffects and try to balance these trade-offs. Note that thebest-in-class performance can be achieved by applying a suitablecombination of processor, complier, and source level optimizationtechniques.

Nitin Jain works with MindTree's Research and DSP groupsin Bangalore on end-to-end product realizations based on either new oroptimized versions of existing speech/audio algorithms. He holds anengineering degree in electronics and communications and can be reachedat nitin_jain@mindtree.com.

References
[1] To read about theintricacies involved in working with speech algorithms in embeddeddesigns, go to “Integratingand evaluating speech algorithms.”

[2] For additionalinformation about DSP, go to “Moreabout designing with Embedded DSPs.”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.