# Approximating reality

In the early embedded days Real Programmers wrote their code in assembly language. They knew exactly what the program did and how many clock cycles it took to run.

Then customers wanted systems that worked with real-world data. Floating point reared its very ugly head. Anyone can do integer math in assembly, but floats are tough indeed.

Possibly the first solution was written by one C. E. Ohme. He contributed the source for a floating-point library to Intel's Insite library, a free service that far predated the open-source movement. This was long before the Internet so Intel sent hard-copy listings to developers. You could also order the source on medium–punched paper tape!

Ohme's package did the four basic bits of arithmetic on 32-bit floats, astonishingly using only 768 bytes of ROM. That's remarkable considering the 8008's limited instruction set, which had a hardware stack only seven levels deep.

Then customers wanted systems that did real-world computations. Roots and trig, logs and exponentiation were suddenly part of many embedded systems.

Today we hardly think about these constructs. Need to take the cosine of a number? Include **math.h** and use **double cos(double)** . Nothing to it. But that's not so easy in assembly language unless the processor has built-in floating-point instructions, a rare thing in the embedded space even now. So developers crafted a variety of packages that computed these more complex functions. To this day these sorts of routines are used in compiler runtime libraries. They're a great resource that gives us a needed level of abstraction from their grimy computational details.

But the packages unfortunately isolate us from these grimy computational details, which can lead to significant performance penalties. A fast desktop with built-in floating-point hardware can return a cosine in a few tens of nanoseconds. But the same function often takes many milliseconds on embedded systems.

Or even longer. Everyone recognizes that to perform a double-precision trig function (as required by the C standard) would take about a week on an 8051, so most of those compilers cheat and return shorter floats.

Compiler vendors are in the unenviable position of trying to provide a runtime library that satisfies most users. Yet in the world of embedded systems we often have peculiar requirements that don't align with those of the rest of the world. What if you need a really speedy cosine but don't require a lot of precision? A look-up table is fast, but when accurate to more than a couple of decimal digits grows in size faster than congressional pork.

**Approximations**

Most of the complicated math functions we use compute what is inherently not computable. There's no precise solution to most trig functions, square roots, and the like, so libraries employ approximations of various kinds. We, too, can employ various approximations to meet peculiar embedded requirements.

College taught us that any differentiable function can be expressed by a Taylor series, such as:

Like all approximations used in runtime packages, this is a polynomial that's clear, simple, and easy to implement. Unhappily, it takes dozens of terms to achieve even single-precision float accuracy, which is an enormous amount of computation. Plenty of other series expansions exist, many of which converge on an accurate answer much faster than the Taylor series. Entire books have been written on approximations and one could spend years mastering the subject. But we're engineers, not mathematicians. We need cookbook solutions now. Happily, computer scientists have come up with many.

The best reference on the subject is the suitably titled *Computer Approximations* (second edition published in 1978 by John Hart et al.).^{1} It contains thousands of approximations for all of the normal functions like roots, exponentials, and trig, as well as approaches for esoteric relations like Bessel functions and elliptic integrals. It's a must-have for those of us working on real-time resource-constrained systems.

But it's out of print. You can pry my copy out of my cold, dead fingers when my wetwear finally crashes for good. As I write this there are six copies available on Amazon Marketplace for about $250 each. Long ago I should have invested in Google, Microsoft, and *Computer Approximations* !^{2}

Be warned: it's written by eggheads for eggheads. The math behind the derivation of the polynomials goes over my head.

Tables of polynomial coefficients form the useful meat of the book. Before each class of function (trig, roots, and so forth) there's a handful of pages with suggestions about using the coefficients in a program. Somewhere, enigmatically buried in the text like an ancient clue in a Dan Brown thriller, an important point about actually using the coefficients lies lost. So here's how to interpret the tables.

**How to interpret tables**

An “index” vectors you to the right set of tables. An index for square roots looks like this:

Which means: the set of coefficients labeled “0033” is for a polynomial of degree 4, accurate to 2.56 decimal digits for inputs between and 1. Outside that range all bets are off. The polynomial is of the form *P* (*x* ). (Several other forms are used, like *P* (*x* )/*Q* (*x* ), so that bit of what seems like pedantry is indeed important.)

This is the corresponding table of coefficients:

P00 (+ 0) +0.14743 837P01 (+ 1) +0.19400 802P02 (+ 1) -0.26795 117P03 (+ 1) +0.25423 691P04 (+ 0) -0.95312 89

The number in parenthesis is the power of 10 to apply to each number. Entry *P* 02, prefixed by (+1), is really -2.6795117.It took me a long time to figure that out.Knowing this, the polynomial for the square root of *x* , over that limited range, accurate to 2.56 digits is:

^{x }to 24.78 decimal digits of accuracy.

*k*till

*x*× 2

^{–2k }is in the required range. Then use the polynomial to take the square root, and adjust the answer by 2

^{k }. Listing 1 implements the algorithm.

**Listing 1 Code to reduce the range of a square root**

`/***************************************************** reduce_sqrt - The square root routines require an input argument in* the range[0.25, 1].This routine reduces the argument to that range.** Return values:* - "reduced_arg", which is the input arg/2^(2k)* - "scale", which is the sqrt of the scaling factor, or 2^k** Assumptions made:* - The input argument is > zero* - The input argument cannot exceed +2^31 or be under 2^-31. ** Possible improvements:* - To compute a* 2^(-2k)) we do the division (a/(2^2k)). It's* much more efficient to decrement the characteristic of "a" (i.e., monkey* with the floating point representation of "a") the appropriate number* of times. But that's far less portable than the less efficient approach* shown here.* * How it works:* The algorithm depends on the following relation:* sqrt(arg)=(2^k) * sqrt(arg * 2^(-2k))** We pick a "k" such that 2^(-2k) * arg is less than 1 and greater than 0.25. ** The ugly "for" loop shifts a one through two_to_2k while shifting a long* version of the input argument right two times, repeating till the long* version is all zeroes. Thus, two_to_2k and the input argument have this relationship:** two_to_2k input argument sqrt(two_to_2k)* 4 1 to 4 2* 16 5 to 16 4* 64 17 to 64 8* 256 65 to 256 16** Note that when we're done, and then divide the input arg by two_to_2k, the* result is in the desired [0.25, 1] range.** We also must return "scale", which is sqrt(two_to_2k), but we prefer* not to do the computationally expensive square root. Instead note* (as shown in the table above) that scale simply doubles with each iteration.** There's a special case for the input argument being less than the [0.25,1]* range. In this case we multiply by a huge number, one sure to bring* virtually every argument up to 0.25, and then adjust* "scale" by the square root of that huge number.*/***************************************************void reduce_sqrt(double arg, double *reduced_arg, double *scale){ long two_to_2k; // divisor we're looking for: 2^(2k) long l_arg; // long (32 bit) version of input argument const double huge_num=1073741824; // 2**30 const double sqrt_huge_num= 32768; // sqrt(2**30) if(arg>=0.25){ // shift arg to zero while computing two_to_2k as described above l_arg=(long) arg; // l_arg is long version of input arg for(two_to_2k=1, *scale=1.0; l_arg!=0; l_arg>>=2, two_to_2k<=2, *scale*="2.0);" *reduced_arg="arg/(double)" two_to_2k;="" normalize="" input="" argument="" to="" [0.25,="" 1]="" }else="" {="" for="" small="" arguments:="" arg="arg*huge_num;" make="" the="" number="" big="" l_arg="(long)" arg;="" l_arg="" is="" long="" version="" of="" input="" arg="" for(two_to_2k="1," *scale="1.0;" l_arg!="0;" l_arg="">>=2, two_to_2k<=2, *scale*="2.0);" *scale="*scale/sqrt_huge_num;" *reduced_arg="arg/(double)" two_to_2k;="" normalize="" input="" argument="" to="" [0.25,="" 1]="" };="" };=""> `

Here's a sampling of interesting polynomials for square roots. Note that they can all be used with the range-reduction scheme I've described.With just two terms, using the form *P* (*x* ) over the range [1/100, 1], this one is accurate to 0.56 decimal digits and is very fast:

P00 (+0) +0.11544 2P01 (+1) +0.11544 2

A different form gives 3.66 digits over 1/4 to 1:

P00 (- 1) +0.85805 2283P01 (+ 1) +0.10713 00909P02 (+ 0) +0.34321 97895Q00 (+ 0) +0.50000 08387Q01 (+ 1) +0.1

The following yields 8.95 digits accuracy using , but works over the narrower 1/2 to 1 range. Use the usual range-reduction algorithm to narrow *x* to 0.25 to 1, and then double the result if it's less than 1/2. Scale the result by the square root of two:

P00 (+ 0) +0.29730 27887 4025P01 (+ 1) +0.89403 07620 6457P02 (+ 2) +0.21125 22405 69754P03 (+ 1) +0.59304 94459 1466Q00 (+ 1) +0.24934 71825 3158Q01 (+ 2) +0.17764 13382 80541Q02 (+ 2) +0.15035 72331 29921Q03 (+ 1) +0.1

Hart's book also contains coefficients for cube roots. Reduce the argument's range to 0.5 to 1 (generally useful with his cube root approximations) using the relation (see Listing 2 for details):

**Listing 2 Code to reduce the range of a cube root**

`/***************************************************** reduce_cbrt - The cube root routines require an input argument in* the range[0.5, 1].This routine reduces the argument to that range.** Return values:* - "reduced_arg", which is the input arg/2^(3k)* - "scale", which is the cbrt of the scaling factor, or 2^k** Assumptions made:* - The input argument cannot exceed +/-2^31 or be under +/-2^-31. ** Possible improvements:* - To compute a* 2^(-3k)) we do the division (a/(2^3k)). It's* much more efficient to decrement the characteristic of "a" (i.e., monkey* with the floating point representation of "a") the appropriate number* of times. But that's far less portable than the less efficient approach* shown here.* - Corections made for when the result is still under 1/2 scale by two. It's more* efficient to increment the characteristic.* * How it works:* The algorithm depends on the following relation:* cbrt(arg)=(2^k) * cbrt(arg * 2^(-3k))** We pick a "k" such that 2^(-3k) * arg is less than 1 and greater than 0.5. ** The ugly "for" loop shifts a one through two_to_3k while shifting a long* version of the input argument right three times, repeating till the long* version is all zeroes. Thus, two_to_3k and the input argument have this relationship:** two_to_3k input argument cbrt(two_to_3k)* 8 1 to 7 2* 64 8 to 63 4* 512 64 to 511 8* 4096 512 to 4095 16** Note that when we're done, and then divide the input arg by two_to_3k, the* result is between [1/8,1]. The following algorithm reduces it to [1/2, 1]:** if (reduced_arg is between [1/4, 1/2])* multiply it by two and correct the scale by the cube root of two.* if (reduced_arg is between [1/4, 1/2])* multiply it by two and correct the scale by the cube root of two.** Note that the if the argument was between [1/8, 1/4] both of those "if"s* will execute.** We also must return "scale", which is cbrt(two_to_3k), but we prefer* not to do the computationally expensive cube root. Instead note* (as shown in the table above) that scale simply doubles with each iteration.** There's a special case for the input argument being less than the [0.5,1]* range. In this case we multiply by a huge number, one sure to bring* virtually every argument up to 0.5, and then adjust* "scale" by the cube root of that huge number.** The code takes care of the special case that the cube root of a negative* number is negative by passing the absolute value of the number to the* approximating polynomial, and then making "scale" negative.****************************************************/void reduce_cbrt(double arg, double *reduced_arg, double *scale){ long two_to_3k; // divisor we're looking for: 2^(3k) long l_arg; // long (32 bit) version of input argument const double huge_num=1073741824; // 2**30 const double cbrt_huge_num= 1024; // cbrt(2**30) const double cbrt_2= 1.25992104989487; // cbrt(2) *scale=1.0; if(arg<0){ if="" negative="" arg,="" abs(arg)="" and="" set="" arg="fabs(arg);" scale="" to="" -1="" so="" the="" polynomial="" routine="" *scale="-1.0;" will="" give="" a="" negative="" result="" };="" if(arg="">=0.5){ // shift arg to zero while computing two_to_3k as described above l_arg=(long) arg; // l_arg is long version of input arg for(two_to_3k=1; l_arg!=0; l_arg>>=3, two_to_3k<=3, *scale*="2.0);" *reduced_arg="arg/(double)" two_to_3k;="" normalize="" input="" argument="" to="" [0.5,="" 1]=""><0.5){ if="" in="" the="" range="" [1/8,1/2]="" correct="" it="" *reduced_arg*="2;" *scale/="cbrt_2;" };=""><0.5){ if="" in="" the="" range="" [1/4,1/2]="" correct="" it="" *reduced_arg*="2;" *scale/="cbrt_2;" };="" }else="" {="" for="" small="" arguments:="" arg="arg*huge_num;" make="" the="" number="" big="" l_arg="(long)" arg;="" l_arg="" is="" long="" version="" of="" input="" arg="" for(two_to_3k="1;" l_arg!="0;" l_arg="">>=3, two_to_3k<=3, *scale*="2.0);" *scale="*scale/cbrt_huge_num;" *reduced_arg="arg/(double)" two_to_3k;="" normalize="" input="" argument="" to="" [0.25,="" 1]=""><0.5){ if="" in="" the="" range="" [1/8,1/2]="" correct="" it="" *reduced_arg*="2;" *scale/="cbrt_2;" };=""><0.5){ if="" in="" the="" range="" [1/4,1/2]="" correct="" it="" *reduced_arg*="2;" *scale/="cbrt_2;" };="" };="" };=""> `

With just two terms in the form *P* (*x* ) these coefficients give a cube root accurate to 1.24 digits over the range 1/8 to 1:

P00 (+ 0) +0.45316 35P01 (+ 0) +0.60421 81

Adding a single term improves accuracy to 3.20 digits over 1/2 to 1:

P00 (+ 0) +0.49329 5663P01 (+ 0) +0.69757 0456P02 (+ 0) -0.19150 216

Kick accuracy up by more than a notch, to 11.75 digits over 1/2 to 1 using the form :

P00 (+ 0) +0.22272 47174 61818P01 (+ 1) +0.82923 28023 86013 7P02 (+ 2) +0.35357 64193 29784 39P03 (+ 2) +0.29095 75176 33080 76P04 (+ 1) +0.37035 12298 99201 9Q00 (+ 1) +0.10392 63150 11930 2Q01 (+ 2) +0.16329 43963 24801 67Q02 (+ 2) +0.39687 61066 62995 25Q03 (+ 2) +0.18615 64528 78368 42Q04 (+ 1) +0.1

I've coded a number of square and cube root approximations in C with test routines to check their accuracy. Find them here at www.embedded.com/code. Although all of the routines use doubles to ensure fair comparisons, the lower-precision ones will work fine with floats.

There's another very valuable resource that discusses these sorts of approximations and more – for instance, integer versions. That's Jack Crenshaw's wonderful MATH Toolkit for REAL-TIME Programming. It's a lot easier read than Hart's book, and covers other aspects of these issues.

More next month!

**Jack G. Ganssle** is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .

**Endnotes:**

1. Hart, John F., E. W. Cheney, Charles L. Lawson, Hans J. Maehly, Charles K. Mesztenyi, John R. Rice, Henry G. Thacher, and Christoph Witzgall. *Computer Approximations. 2nd edition.* Malabar, FL: Robert Krieger Publishing Company, 1978.

2. The original 1968 version, published by Wiley, is listed as $25 on the Amazon Marketplace. The second edition from 1978 is listed at $263.