 How to code fast, accurate math functions on DSP parallel devices - Embedded.com

# How to code fast, accurate math functions on DSP parallel devices

When programming for parallel engines, the requirements for efficient algorithms change drastically. Many mathematical algorithms, such as root solvers, are fast on single-engine processors but aren't suitable for parallel engines because the algorithm's steps must be performed in sequence and not simultaneously. This article presents some general techniques for efficient coding of mathematical functions on parallel reduced instruction set computer (RISC) devices. A fixed-point square root function implemented using these techniques compares favorably with an industry benchmark (square root function from Texas Instruments' C24x math library).

Communications digital signal processors (DSPs) are going parallel. In cellular baseband processing, for instance, newer devices have a high degree of parallelism to perform matrix and vector processing on high-rate data. When you're programming mathematical functions (such as square root or inverse), the requirements for efficient algorithms are drastically different from conventional DSP programming. Many mathematical algorithms (such as root solvers) are fast on single-engine processors but aren't suitable for parallel engines because the algorithm steps have sequential dependence and can't be performed simultaneously.

This article presents some general techniques for efficient coding of mathematical functions on single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) DSP engines. These techniques are particularly appropriate for reduced instruction set computer (RISC) devices with an efficient MAC (multiply-accumulate) operation.

The alternatives
Several possible approaches to DSP coding of mathematical functions include:

• Lookup table (LUT) or interpolated LUT: Function values ƒn º ƒ(xn ) for a closely-spaced set {xn } of function arguments are stored in a table.
• Polynomial approximation: ƒ(x) is approximated by a single polynomial over the entire domain.
• Recursive methods: For instance, the function ƒ(x) = 1/2x can be estimated recursively using the relation:
yn = (1 – yn-1 ·x )·yn-1 ·2

Then yn converges to 1/2x as n ®

• Spline approximation (piecewise polynomial approximation): The domain of ƒ(x) is divided into sub-intervals, and the function is approximated by a low-order (often cubic) polynomial on each sub-interval.

Compared with the other three alternatives, the polynomial approximation approach is most amenable to parallelization. For instance, the computation of a 7th order polynomial:

P º a7 · ×7 +a6 · ×6 +a5 · ×5 +a4 · ×4 +a3 · ×3 +a2 · ×2 +a1 · ×+a0 · ×

can be accomplished in five cycles on a four-element SIMD machine as shown in Table 1 (assuming the appropriate I/O connections). The blank cells indicate that the operation in that cell is not needed.

 Step # Element 1 Element 2 Element 3 Element 4 1 b3 = a7 ·x+a6 b2 = a5 ·x+a4 b1 = a3 ·x+a2 b0 = a1 ·x+a0 2 y=x2 y=x2 y=x2 y=x2 3 c3 = b3 ·y+b2 _____ c1 = b1 ·y+b0 _____ 4 z=y2 z=y2 z=y2 z=y2 5 ans = c3 ·z+c1 _____ _____ _____

Table 1: Computing a 7th order polynomial in five cycles with SIMD machine

Alternatively, the computation can be done in three cycles on a five-element MIMD machine as shown in Table 2.

 Step # Element 1 Element 2 Element 3 Element 4 Element 5 1 b3 = a7 ·x+a6 b2 = a5 ·x+a4 b1 = a3 ·x+a2 b0 = a1 ·x+a0 y=x2 2 c3 = b3 ·y+b2 _____ c1 = b1 ·y+b0 _____ z=y2 3 Ans= c3 ·z+c1

Table 2: Computing a 7th order polynomial in three cycles with MIMD machine

Besides supporting parallelization, the polynomial approach has other significant advantages:

Low memory requirements: Only the polynomial coefficients need to be stored, and usually on the order of 10 (or fewer) coefficients are required for very accurate (16 bit) polynomial approximations of mathematical functions on the fixed point range [-1,1). This gives polynomial approximation a huge advantage over LUT implementation: a pure LUT implementation with 16-bit accuracy requires storing roughly 220 bits of information.

Universal applicability: Polynomial approximation may be validly applied to any continuous function. This is a consequence of the Weierstrass approximation theorem, which is discussed in Atkinson's book, “An Introduction to Numerical Analysis.” The Weierstrass approximation theorem may be stated as follows:

Let ƒ :[a,b ]® R be a continuous function defined on the closed interval [a, b ]. Then for any ε>0, there exists a polynomial function pε (x ) such that |ƒ(x) pε (x )| < ε for all x in [a,b ].

Recursive methods are only applicable to special functions for which easily computable recursion relations can be derived and cannot be used for a general continuous function.

Linear (nonbranching) code flow: No branching instructions are required for polynomial computation. The same instructions are executed, regardless of the input to the function. For this reason, you can compute multiple function values for multiple inputs straightforwardly on a highly parallelized device. In contrast, spline approach has less inherent parallelism, since different polynomials must be computed for different function inputs.

Designer polynomials
If we focus on the polynomial approach, a key element of efficient DSP coding of mathematical functions is finding the best possible polynomial approximation. The sense of “best” here must be defined more precisely. On one hand, the polynomial should have as small an order as possible: the smaller the order, the fewer the terms, and the fewer steps in the computation. On the other hand, the polynomial should be as “accurate” as possible for the given order (the precise meaning of “accurate” will be discussed below). In summary, we're looking for a polynomial with minimum order of maximum accuracy. In practice, this usually means finding the polynomial of minimum order that meets a given accuracy criterion.

As we've indicated, “accurate” must be more precisely defined. Several different accuracy criteria are reasonable, including:

(1) Minimize maximum absolute error
(2) Minimize mean absolute error
(3) Minimize mean squared error

In the case of practical coding, the programmer usually designs for the “worst case” scenario, which here corresponds to Criterion 1. Accordingly, this is the criterion we'll use in the following discussion.

The polynomial of a given order D which minimizes the maximum absolute error from a given continuous function ƒ(x) is called the minimax polynomial for ƒ(x) of order D . Figure 1 shows the advantage of minimax polynomial approximation over Taylor series approximation, in the case of ƒ(x) = sqrt(x ) for x in the interval [1/2, 1]. The Taylor series approximation is very good near the expansion point (in this case the expansion point is x = 3/4), but errors increase rapidly near the edge of the interval. On the other hand, the minimax polynomial error is fairly uniform over the entire range. Figure 1: Approximation error for ƒ(x) = sqrt(x) on [0.5 1], for minimax and Taylor polynomial approximation

The Remez (or Remes) algorithm (which is also discussed in Atkinson's book  ) gives a constructive procedure for finding minimax polynomials. Some readers may be familiar with the Remez algorithm as applied filter design. There is a slight difference between the two applications, because filtering is concerned with minimax approximation by trigonometric polynomials [polynomials in cos(x ) or sin(x )], while in our application we're interested in polynomials in the variable x .

Computational packages that perform the Remez algorithm may be obtained from the web. One that I highly recommend is the COCA (Complex Linear Chebyshev Approximation) package, authored by Bernd Fischer and Jan Modersitzki.  The package is written in MATLAB.

Approximation method
Using the minimax polynomial is only a part (albeit a key part) of an overall polynomial approximation method. An overall method for polynomial approximation of mathematical functions using fixed-point DSPs is described as follows:

Minimize function domain
The smaller the function domain, the more accurate the minimax polynomial approximation for a given polynomial order. Hence, it's beneficial to choose the smallest possible function domain so that inputs can conveniently be transformed to lie in the domain.

Consider for instance the function ƒ(x) = 1/x . A suitable domain for ƒ(x) is [0.5 1), which may be seen as follows. Any fixed-point input may be scaled to this domain by taking the absolute value and then left shifting by n , where n +1 is the number of leading zeros in a signed binary fractional representation of |x | (many DSPs have a special instruction that counts the number of leading zeros). Mathematically the rescaled argument is given by:

xrescaled = 2n ·sign(x)·x.

Once xrescaled -1 is found, then x-1 may be recovered by:

x-1 = sign(x)·2n · xrescaled -1 .

In the TI C24x math library the domain [0.5 1) is also used for the function ƒ(x) = sqrt(x ).  However, it's not convenient to recover sqrt(x ) from sqrt(x rescaled ), if x is rescaled to [0.5 1) as in the previous example. A more convenient scaling is as follows:

xrescaled = 22 n /2 ·x ,

where x rescaled now lies in [0.25 1) if x lies in positive fixed-point range [0 1). This rescaling amounts to two left shifts by n /2 . Once sqrt(x rescaled ) has been computed, then sqrt(x ) may be recovered by:

sqrt(x ) = 2- n /2 ·sqrt(x rescaled ),

which is a simple right shift.by n /2 .

Maximally stretch domain
Second, stretching the domain (via linear transformation) to as large an interval as possible will reduce the size of the polynomial coefficients, which may be necessary to put them within fixed-point range. For instance, in the case of ƒ(x) = sqrt(x ) on [0.25 1), the minimax polynomial coefficients are:

[0.17 1.72 -2.17 2.35 -1.45 0.37]

which cannot be realized directly in fixed point. However, if we stretch the domain by first applying the following linear transformation to x:

y = 2·(x – 0.5).

then ƒ(x) is equivalent to ƒ stretch (y ), where;

ƒ stretch (y ) = sqrt( 0.5·y + 0.5) for y in [-0.5 1).

the coefficients of the 5th order minimax polynomial for ƒ stretch are:

[ 0.71 0.35 -0.09 0.05 -0.03 0.01]

which all lie in fixed-point range.

Of course, domain stretching is not an operation that can be parallelized and thus will cost an additional cycle. Hence it may not always be advisable.

An alternative way to deal with coefficient overflow is to implement a scaled-down version of the function. For instance, the TI C24x math library computes 0.5·sqrt(x ) rather than sqrt(x ), because then the polynomial coefficients are all less than one. However, the downscaling always leads to some loss of accuracy (in this case, one bit).

We already described the minimax polynomial and how it may be computed using the COCA package. Of course, the actual coefficients must be fixed-point approximations to these theoretical best coefficients. It's not always true that simple rounding of the floating-point coefficients yields the best fixed-point coefficients. The fixed-point arithmetic may introduce further biases that add to the error. Figure 2 shows that considerable additional accuracy may be obtained by adjusting fixed-point coefficients. Figure 2: Errors for rounded fixed-point minimax and adjusted version, compared to floating-point minimax approximation error

Once the algorithm is coded, the coefficients should be fine-tuned to optimize overall algorithm performance.

Poor man's double precision
Once you've computed a single-precision function value using polynomial methods, it's easy to obtain double precision using recursion relations. For instance, a single iteration of the recursion relation for 1/2x shown above:

yn = (1 – yn-1 ·xyn-1 ·2

is sufficient to obtain a double-precision approximation yn from a single-precision value yn-1 . In the polynomial computations, it's sufficient to use a single-precision version of x : the double-precision value is only required for the final recursion relation.

Performance
Figure 3 shows the accuracy of a single-precision square root algorithm on (0 1), which was implemented according to the techniques described in the previous section. Figure 4 shows the accuracy of the corresponding algorithm in the TI C24x library. Both algorithms use 16 bits for the coefficients; both use a 5th order polynomial approximation (TI on [1/2 1), ours on [1/4 1) for reasons described above). Both have 16-bit multiply and 32-bit accumulation register. The maximum error for our algorithm is only about 1/3 as large as TI's; while the mean error for our algorithm is about 1.9, which is roughly 1/4 that of TI's. Figure 3: Errors for fixed-point polynomial approximation of ƒ(x) =sqrt(x), implemented using the techniques described in this paper Figure 4: Errors for fixed-point polynomial approximation of ƒ(x) =sqrt(x), from TI C24x Fixed Point Library

Chris Thron has been at Freescale (formerly Motorola) since 1998, working on various aspects of baseband receiver processing for 3G base stations for the cdma2000, UMTS, and TD-SCDMA standards. He has two patents granted and three patents pending. Thron has a BS in mathematics from Princeton University, a PhD in mathematics (probability) from the University of Wisconsin-Madison, and a PhD in physics from the University of Kentucky. You can reach him at .

This site uses Akismet to reduce spam. Learn how your comment data is processed.