How to code fast, accurate math functions on DSP parallel devices
When programming for parallel engines, the requirements for efficient algorithms change drastically. Many mathematical algorithms, such as root solvers, are fast on singleengine processors but aren't suitable for parallel engines because the algorithm's steps must be performed in sequence and not simultaneously. This article presents some general techniques for efficient coding of mathematical functions on parallel reduced instruction set computer (RISC) devices. A fixedpoint square root function implemented using these techniques compares favorably with an industry benchmark (square root function from Texas Instruments' C24x math library).
Communications digital signal processors (DSPs) are going parallel. In cellular baseband processing, for instance, newer devices have a high degree of parallelism to perform matrix and vector processing on highrate data. When you're programming mathematical functions (such as square root or inverse), the requirements for efficient algorithms are drastically different from conventional DSP programming. Many mathematical algorithms (such as root solvers) are fast on singleengine processors but aren't suitable for parallel engines because the algorithm steps have sequential dependence and can't be performed simultaneously.
This article presents some general techniques for efficient coding of mathematical functions on single instruction multiple data (SIMD) or multiple instruction multiple data (MIMD) DSP engines. These techniques are particularly appropriate for reduced instruction set computer (RISC) devices with an efficient MAC (multiplyaccumulate) operation.
The alternatives
Several possible approaches to DSP coding of mathematical functions include:
 Lookup table (LUT) or interpolated LUT: Function values ƒ_{n} º ƒ(x_{n} ) for a closelyspaced set {x_{n} } of function arguments are stored in a table.
 Polynomial approximation: ƒ(x) is approximated by a single polynomial over the entire domain.
 Recursive methods: For instance, the function ƒ(x) = 1/2x can be estimated recursively using the relation:
y_{n} = (1 – y_{n1} ·^{x} )·y_{n1} ·2
Then y_{n} converges to 1/2x as n ® ∞
 Spline approximation (piecewise polynomial approximation): The domain of ƒ(x) is divided into subintervals, and the function is approximated by a loworder (often cubic) polynomial on each subinterval.
Compared with the other three alternatives, the polynomial approximation approach is most amenable to parallelization. For instance, the computation of a 7th order polynomial:
P º a_{7} · ×^{7} +a_{6} · ×^{6} +a_{5} · ×^{5} +a_{4} · ×^{4} +a_{3} · ×^{3} +a_{2} · ×^{2} +a_{1} · ×+a_{0} · ×
can be accomplished in five cycles on a fourelement SIMD machine as shown in Table 1 (assuming the appropriate I/O connections). The blank cells indicate that the operation in that cell is not needed.
Step #

Element 1

Element 2

Element 3

Element 4

1

b_{3} = a_{7} ·x+a_{6}

b_{2} = a_{5} ·x+a_{4}

b_{1} = a_{3} ·x+a_{2}

b_{0} = a_{1} ·x+a_{0}

2

y=x^{2}

y=x^{2}

y=x^{2}

y=x^{2}

3

c_{3} = b_{3} ·y+b_{2}

_____

c_{1} = b_{1} ·y+b_{0}

_____

4

z=y^{2}

z=y^{2}

z=y^{2}

z=y^{2}

5

ans = c_{3} ·z+c_{1}

_____

_____

_____

Table 1: Computing a 7th order polynomial in five cycles with SIMD machine
Alternatively, the computation can be done in three cycles on a fiveelement MIMD machine as shown in Table 2.
Step #

Element 1

Element 2

Element 3

Element 4

Element 5

1

b_{3} = a_{7} ·x+a_{6}

b_{2} = a_{5} ·x+a_{4}

b_{1} = a_{3} ·x+a_{2}

b_{0} = a_{1} ·x+a_{0}

y=x^{2}

2

c_{3} = b_{3} ·y+b_{2}

_____

c_{1} = b_{1} ·y+b_{0}

_____

z=y^{2}

3

Ans= c_{3} ·z+c_{1}

Table 2: Computing a 7th order polynomial in three cycles with MIMD machine
Besides supporting parallelization, the polynomial approach has other significant advantages:
Low memory requirements: Only the polynomial coefficients need to be stored, and usually on the order of 10 (or fewer) coefficients are required for very accurate (16 bit) polynomial approximations of mathematical functions on the fixed point range [1,1). This gives polynomial approximation a huge advantage over LUT implementation: a pure LUT implementation with 16bit accuracy requires storing roughly 2^{20} bits of information.
Universal applicability: Polynomial approximation may be validly applied to any continuous function. This is a consequence of the Weierstrass approximation theorem, which is discussed in Atkinson's book, “An Introduction to Numerical Analysis.”^{[1]} The Weierstrass approximation theorem may be stated as follows:
Let ƒ :[a,b ]® R be a continuous function defined on the closed interval [a, b ]. Then for any ε>0, there exists a polynomial function p_{ε} (x ) such that ƒ(x) p_{ε} (x ) < ε for all x in [a,b ].
Recursive methods are only applicable to special functions for which easily computable recursion relations can be derived and cannot be used for a general continuous function.
Linear (nonbranching) code flow: No branching instructions are required for polynomial computation. The same instructions are executed, regardless of the input to the function. For this reason, you can compute multiple function values for multiple inputs straightforwardly on a highly parallelized device. In contrast, spline approach has less inherent parallelism, since different polynomials must be computed for different function inputs.
Designer polynomials
If we focus on the polynomial approach, a key element of efficient DSP coding of mathematical functions is finding the best possible polynomial approximation. The sense of “best” here must be defined more precisely. On one hand, the polynomial should have as small an order as possible: the smaller the order, the fewer the terms, and the fewer steps in the computation. On the other hand, the polynomial should be as “accurate” as possible for the given order (the precise meaning of “accurate” will be discussed below). In summary, we're looking for a polynomial with minimum order of maximum accuracy. In practice, this usually means finding the polynomial of minimum order that meets a given accuracy criterion.
As we've indicated, “accurate” must be more precisely defined. Several different accuracy criteria are reasonable, including:
(1) Minimize maximum absolute error
(2) Minimize mean absolute error
(3) Minimize mean squared error
In the case of practical coding, the programmer usually designs for the “worst case” scenario, which here corresponds to Criterion 1. Accordingly, this is the criterion we'll use in the following discussion.
The polynomial of a given order D which minimizes the maximum absolute error from a given continuous function ƒ(x) is called the minimax polynomial for ƒ(x) of order D . Figure 1 shows the advantage of minimax polynomial approximation over Taylor series approximation, in the case of ƒ(x) = sqrt(x ) for x in the interval [1/2, 1]. The Taylor series approximation is very good near the expansion point (in this case the expansion point is x = 3/4), but errors increase rapidly near the edge of the interval. On the other hand, the minimax polynomial error is fairly uniform over the entire range.
Figure 1: Approximation error for ƒ(x) = sqrt(x) on [0.5 1], for minimax and Taylor polynomial approximation
The Remez (or Remes) algorithm (which is also discussed in Atkinson's book ^{[1]} ) gives a constructive procedure for finding minimax polynomials. Some readers may be familiar with the Remez algorithm as applied filter design. There is a slight difference between the two applications, because filtering is concerned with minimax approximation by trigonometric polynomials [polynomials in cos(x ) or sin(x )], while in our application we're interested in polynomials in the variable x .
Computational packages that perform the Remez algorithm may be obtained from the web. One that I highly recommend is the COCA (Complex Linear Chebyshev Approximation) package, authored by Bernd Fischer and Jan Modersitzki. ^{[2]} The package is written in MATLAB.
Approximation method
Using the minimax polynomial is only a part (albeit a key part) of an overall polynomial approximation method. An overall method for polynomial approximation of mathematical functions using fixedpoint DSPs is described as follows:
Minimize function domain
The smaller the function domain, the more accurate the minimax polynomial approximation for a given polynomial order. Hence, it's beneficial to choose the smallest possible function domain so that inputs can conveniently be transformed to lie in the domain.
Consider for instance the function ƒ(x) = 1/x . A suitable domain for ƒ(x) is [0.5 1), which may be seen as follows. Any fixedpoint input may be scaled to this domain by taking the absolute value and then left shifting by n , where n +1 is the number of leading zeros in a signed binary fractional representation of x  (many DSPs have a special instruction that counts the number of leading zeros). Mathematically the rescaled argument is given by:
x_{rescaled} = 2^{n} ·sign(x)·x.
Once x_{rescaled} ^{1} is found, then x^{1} may be recovered by:
x^{1} = sign(x)·2^{n} · x_{rescaled} ^{1} .
In the TI C24x math library the domain [0.5 1) is also used for the function ƒ(x) = sqrt(x ). [3] However, it's not convenient to recover sqrt(x ) from sqrt(x _{rescaled} ), if x is rescaled to [0.5 1) as in the previous example. A more convenient scaling is as follows:
x_{rescaled} = 2^{2n /2} ·x ,
where x _{rescaled} now lies in [0.25 1) if x lies in positive fixedpoint range [0 1). This rescaling amounts to two left shifts by n /2. Once sqrt(x _{rescaled} ) has been computed, then sqrt(x ) may be recovered by:
sqrt(x ) = 2n /2 ·sqrt(x _{rescaled} ),
which is a simple right shift.by n /2 .
Maximally stretch domain
Second, stretching the domain (via linear transformation) to as large an interval as possible will reduce the size of the polynomial coefficients, which may be necessary to put them within fixedpoint range. For instance, in the case of ƒ(x) = sqrt(x ) on [0.25 1), the minimax polynomial coefficients are:
[0.17 1.72 2.17 2.35 1.45 0.37]
which cannot be realized directly in fixed point. However, if we stretch the domain by first applying the following linear transformation to x:
y = 2·(x – 0.5).
then ƒ(x) is equivalent to ƒ _{stretch} (y ), where;
ƒ _{stretch} (y ) = sqrt( 0.5·y + 0.5) for y in [0.5 1).
the coefficients of the 5th order minimax polynomial for ƒ _{stretch} are:
[ 0.71 0.35 0.09 0.05 0.03 0.01]
which all lie in fixedpoint range.
Of course, domain stretching is not an operation that can be parallelized and thus will cost an additional cycle. Hence it may not always be advisable.
An alternative way to deal with coefficient overflow is to implement a scaleddown version of the function. For instance, the TI C24x math library computes 0.5·sqrt(x ) rather than sqrt(x ), because then the polynomial coefficients are all less than one. However, the downscaling always leads to some loss of accuracy (in this case, one bit).
Adjust fixedpoint coefficients
We already described the minimax polynomial and how it may be computed using the COCA package. Of course, the actual coefficients must be fixedpoint approximations to these theoretical best coefficients. It's not always true that simple rounding of the floatingpoint coefficients yields the best fixedpoint coefficients. The fixedpoint arithmetic may introduce further biases that add to the error. Figure 2 shows that considerable additional accuracy may be obtained by adjusting fixedpoint coefficients.
Figure 2: Errors for rounded fixedpoint minimax and adjusted version, compared to floatingpoint minimax approximation error
Once the algorithm is coded, the coefficients should be finetuned to optimize overall algorithm performance.
Poor man's double precision
Once you've computed a singleprecision function value using polynomial methods, it's easy to obtain double precision using recursion relations. For instance, a single iteration of the recursion relation for 1/2x shown above:
y_{n} = (1 – y_{n1} ·x )·y_{n1} ·2
is sufficient to obtain a doubleprecision approximation y_{n} from a singleprecision value y_{n1} . In the polynomial computations, it's sufficient to use a singleprecision version of x : the doubleprecision value is only required for the final recursion relation.
Performance
Figure 3 shows the accuracy of a singleprecision square root algorithm on (0 1), which was implemented according to the techniques described in the previous section. Figure 4 shows the accuracy of the corresponding algorithm in the TI C24x library. Both algorithms use 16 bits for the coefficients; both use a 5^{th} order polynomial approximation (TI on [1/2 1), ours on [1/4 1) for reasons described above). Both have 16bit multiply and 32bit accumulation register. The maximum error for our algorithm is only about 1/3 as large as TI's; while the mean error for our algorithm is about 1.9, which is roughly 1/4 that of TI's.
Figure 3: Errors for fixedpoint polynomial approximation of ƒ(x) =sqrt(x), implemented using the techniques described in this paper
Figure 4: Errors for fixedpoint polynomial approximation of ƒ(x) =sqrt(x), from TI C24x Fixed Point Library
Chris Thron has been at Freescale (formerly Motorola) since 1998, working on various aspects of baseband receiver processing for 3G base stations for the cdma2000, UMTS, and TDSCDMA standards. He has two patents granted and three patents pending. Thron has a BS in mathematics from Princeton University, a PhD in mathematics (probability) from the University of WisconsinMadison, and a PhD in physics from the University of Kentucky. You can reach him at .
ENDNOTES
[1] K.A. Atkinson, An Introduction to Numerical Analysis . John Wiley and Sons: New York, 1978.
[2] Bernd Fischer and Modersitzki, Jan. “COCA User's Guide: Description of a collection of MATLAB function files for the computation of linear Chebyshev approximations in the complex plane,” http://www.math.muluebeck.de/workers/modersitzki/COCA/coca5.html
[3] “C24x FixedPoint Math Library,” Texas Instruments documentation, SPRC068, May 2002, http://focus.ti.com/docs/tool/toolfolder.jhtml?PartNumber=SPRC068.