Improve your root-mean calculations -

Improve your root-mean calculations


Real-time digital systems often require the calculation of a root-mean, such as a root-mean square (RMS) level or average magnitude of a complex signal. While averaging can be efficiently implemented by most microprocessors, the square root may not be–especially with low-cost hardware. If the processor doesn't implement a fast square root function, it must be implemented in software; although this yields accurate results, it may not be efficient.

One common method for computing the square root is Newton's method, which iteratively converges on a solution using an initial estimate. Since we're computing the square root of a slowly varying average value, the previous root-mean value makes a good estimate. Furthermore, we can combine the iterative Newton's method with a first-order recursive averager, resulting in a super-efficient method for computing the root-mean of a signal.

In this article, I'll develop and present three efficient recursive algorithms for computing the root-mean, illustrating each method with signal flow diagrams and example code. To some degree, each of these methods trades hardware complexity for error. I'll compare the computational performance and error of each method and suggest suitable hardware for each implementation.

The root-mean is computed as the square root of the average over time of its input. This average may be recursive or non-recursive, and I'll briefly review the general case for both.

Non-recursive average
The non-recursive average, or moving average , is the weighted sum of N inputs: the current input and N -1 previous inputs. In digital filtering terminology, this is called a finite impulse response , or FIR filter:


The most common use of the moving average typically sets the weights such that an =1/N . If we were to plot these weights versus time, we would see the “window” of the input signal that is averaged at a given point in time. This 1/N window is called a rectangular window because its shape is an N -by-1/N rectangle.

There is a trick for computing the 1/N average so that all N samples need not be weighted and summed with each output calculation. Since the weights don't change, you can simply add the newest weighted input and subtract the N th weighted input from the previous sum:


While this technique is computationally efficient, it requires storage and circular-buffer management of N samples.

Of course, many other window shapes are commonly used. Typically, these window shapes resemble, or are a variation of, a raised cosine between –π/2 and π/2. These windows weight the samples in the center more than the samples near the edges. Generally speaking, you should only use one of these windows when there is a specific need to, such as applying a specific filter to the signal. The disadvantage of these windows is that computational complexity and storage requirements increase with N .

Recursive average
The recursive average is the weighted sum of the input, N previous inputs, and M previous outputs:


The simplest of these in terms of computational complexity and storage (while still being useful) is the first-order recursive average. In this case, the average is computed as the weighted sum of the current input and the previous output. The first-order recursive average also lends itself to an optimization when combined with the computation of the square root, which we'll discuss shortly.

In contrast to the non-recursive average, the first-order recursive average's window is a decaying exponential (Figure 1). Technically, the recursive average has an infinite window, since it never decays all the way to zero. In digital filtering terminology, this is known as an infinite impulse response, or IIR filter.

View the full-size image

From Figure 1, we see that earlier samples are weighted more than later samples, allowing us to somewhat arbitrarily define an averaging time for the recursive average. For the first-order case, we define the averaging time as the time at which the impulse response has decayed to a factor of 1/e , or approximately 37%, of its initial value. An equivalent definition is the time at which the step response reaches 1–(1/e ), or approximately 63%, of its final value. Other definitions are possible but will not be covered here. The weighting of the sum determines this averaging time; to ensure unity gain, the sum of the weights must equal one. As a consequence, only one coefficient needs to be specified to describe the averaging time.

Therefore, for first-order recursive averaging, we compute the mean level as:


where x (n ) is the input, m (n ) is the mean value, and a is the averaging coefficient. The averaging coefficient is defined as:


where t is the averaging time, and fS is the sampling frequency. The root-mean may then be calculated by taking the square root of Equation 4:


where y (n ) is the root-mean.

Efficient computation methods
Googling “fast square root” will get you a plethora of information and code snippets on implementing fast square-root algorithms. While these methods may work just fine, they don't take into account the application in which the square root is required. Oftentimes, you may not need exact precision to the last bit, or the algorithm itself can be manipulated to optimize the computation of the square root. I present a few basic approaches here.

Only calculate it when you need it
Probably the simplest optimization is to only calculate the square root when you absolutely need it. Although this may seem obvious, it can be easily overlooked when computing the root-mean on every input sample. When you don't need an output value for every input sample, it makes more sense to compute the square root only when you read the output value.

One example of an application when this technique can be used is RMS metering of a signal. A meter value that is displayed visually may only require an update every 50 to 100ms, which may be far less often than the input signal is sampled. Keep in mind, however, that recursive averaging should still be computed at the Nyquist rate.

Recall that:


If you'll be computing the logarithm of a square root, it's far less computationally expensive to simply halve the result instead. A common example of this optimization is the calculation of an RMS level in dB, which may be simplified as follows:


Newton's Method
Newton's Method (also called the Newton-Rapson Method) is a well known iterative method for estimating the root of an equation.1 Newton's Method can be quite efficient when you have a reasonable estimate of the result. Furthermore, if accuracy to the last bit is not required, the number of iterations can be fixed to keep the algorithm deterministic.

We may approximate the root of f (x ) by iteratively calculating:


If we wish to find , then we need to find the root to the equation f (y )=y 2m . Substituting f (y ) into Equation 9, we get:


Rearranging Equation 9, we get:


where y (n ) is the approximation of the square root of m (n ).

Equation 11 requires a divide operation, which may be inconvenient for some processors. As an alternative, we can calculate and multiply the result by m to get . Again using Newton's Method, we find that we may iteratively calculate the reciprocal square root as:


and calculate the square root as:


Although Newton's Method for the reciprocal square root eliminates the divide operation, it can be problematic for fixed-point processors. Assuming that m(n) is a positive integer greater than 1, yr(n) will be a positive number less than one–beyond the range of representation for integer numbers. Implementation must be accomplished using floating-point or mixed integer/fractional number representation.

Root-mean using Newton's Method
A subtle difference between Equations 10 and 11 is that m becomes m (n ), meaning that we're attempting to find the square root of a moving target. However, since m (n ) is a mean value, or slowly varying, it can be viewed as nearly constant between iterations. Since y (n ) will also be slowly varying, y (n -1) will be a good approximation to y (n ) and require fewer iterations–one, we hope–to achieve a good estimate.

To calculate the root-mean, one may simply apply Newton's Method for calculating the square root to the mean value. As long as the averaging time is long compared to the sample period (t &62;&62; 1/fS ), one iteration of the square root calculation should suffice for reasonable accuracy. This seems simple enough, but we can actually improve the computational efficiency, which will be discussed in one of the following sections.

Using reciprocal square root
Unlike the iterative square-root method, however, the iterative reciprocal square-root requires no divide. This implementation is best suited for floating-point processing, which can efficiently handle numbers both greater and less than one. We present this implementation as a signal flow diagram in Figure 2. The averaging coefficient, a , is defined by Equation 5, and z –1 represents a unit sample delay.

View the full-size image

A code listing for a C++ class that implements the computation in Figure 2 is presented in Listing 1. In this example class, initialization is performed in the class constructor, and each call to CalcRootMean() performs one iteration of averaging and square-root computation.

Listing 1. C++ class that computes the root-mean using Newton's Method for the reciprocal square root

static const double Fs = 48000.0;   // sample ratestatic double AvgTime = 0.1;       // averaging time  
class RecipRootMean{   public:  
   double Mean;   double RecipRootMean;   double AvgCoeff;  
   RecipRootMean()   {      AvgCoeff = 1.0 - exp( -1.0 / (Fs * AvgTime) );      Mean = 0.0;      RecipRootMean = 1.0e-10;      // 1 > initial RecipRootMean > 0   }   ~RecipRootMean() {}  
   double CalcRootMean(double x)   {      Mean += AvgCoeff * (x - Mean);      RecipRootMean *= 0.5 * ( 3.0 - (RecipRootMean * RecipRootMean * Mean) );      return RecipRootMean * Mean;   }};  

Using direct square root
Let's go back and take a closer look at Equation 11. Newton's method converges on the solution as quickly as possible without oscillating around it, but if we slow this rate of convergence, the iterative equation will converge on the square root of the average of its inputs. Adding the averaging coefficient results in the following root-mean equation:


where a is defined by Equation 5. Now y (n ) converges to the square root of the average of x (n ). An equivalent signal-flow representation of Equation 14 is presented in Figure 3. Here, an additional y (n –1) term is summed so that only one averaging coefficient is required. Note that x (n ) and y (n –1) must be greater than zero.

A code listing for a C++ class that implements the computation shown in Figure 3 is presented in Listing 2. As in the previous example, initialization is performed in the class constructor, and each call to CalcRootMean() performs one iteration of averaging/square-root computation.

Listing 2. C++ class that implements the floating-point version of Figure 3

static const double Fs = 48000.0;   // sample ratestatic double AvgTime = 0.1;        // averaging time  
class NewtonRootMean{   public:  
   double RootMean;   double AvgCoeff;  
   NewtonRootMean()   {      RootMean = 1.0;               // > 0 or divide will fail      AvgCoeff = 0.5 * ( 1.0 - exp( -1.0 / (Fs * AvgTime) ) );   }   ~NewtonRootMean() {}  
   double CalcRootMean(double x)   {      RootMean += AvgCoeff * ( ( x / RootMean ) - RootMean );      return RootMean;   }};  

With some care, Figure 3 may also be implemented in fixed-point arithmetic as shown in Listing 3. In this example, scaling is implemented to ensure valid results. When sufficient word size is present, x is scaled by nAvgCoeff prior to division to maximize the precision of the result.

Listing 3. C++ class that implements the fixed-point version of Figure 3

static const double Fs = 48000.0;                 // sample ratestatic double AvgTime = 0.1;                        // averaging timestatic const unsigned int sknNumIntBits = 32;  // # bits in intstatic const unsigned int sknPrecisionBits = sknNumIntBits / 2;static const double skScaleFactor = pow(2.0, (double)sknPrecisionBits);static const unsigned int sknRoundOffset = (unsigned int)floor( 0.5 * skScaleFactor );  
class IntNewtonRootMean{   public:  
   unsigned int nRootMean;   unsigned int nScaledRootMean;   unsigned int nAvgCoeff;   unsigned int nMaxVal;  
   IntNewtonRootMean()   {      nRootMean = 1;    // >0 or divide will fail      nScaledRootMean = 0;      double AvgCoeff = 0.5 * ( 1.0 - exp( -1.0 / (Fs * AvgTime) ) );      nAvgCoeff = (unsigned int)floor( ( skScaleFactor * AvgCoeff ) + 0.5 );      nMaxVal = (unsigned int)floor( ( skScaleFactor / AvgCoeff ) + 0.5 );   }   ~IntNewtonRootMean() {}  
   unsigned int CalcRootMean(unsigned int x)   {      if ( x < nmaxval="" )="" {="" nscaledrootmean="" +="(" (="" navgcoeff="" *="" x="" )="" nrootmean="" )="" -="" (="" navgcoeff="" *="" nrootmean="" );="" }="" else="" {="" nscaledrootmean="" +="nAvgCoeff" *="" (="" (="" x="" nrootmean="" )="" -="" nrootmean="" );="" }="">  
      nRootMean = ( nScaledRootMean + sknRoundOffset ) >> sknPrecisionBits;      return nRootMean;   }};  

Divide-free RMS using normalization
Now we'll look at the special case of computing an RMS value on fixed-point hardware that does not have a fast divide operation, which is typical for low-cost embedded processors. Although many of these processors can perform division, they do so one bit at a time, requiring at least one cycle for each bit of word length. Furthermore, care must be taken to insure that the RMS calculation is implemented with sufficient numerical precision. With fixed-point hardware, the square of a value requires twice the number of bits to retain the original data's precision.

With this in mind, we manipulate Equation 14 into the following:


Although the expression x (n )2y (n –1)2 must be calculated with double precision, this implementation lends itself to a significant optimization. Note that a /2y (n –1) acts like a level-dependent averaging coefficient. If a slight time-dependent variance in the averaging time can be tolerated–which is often the case–1/y (n –1) can be grossly approximated. On a floating-point processor, shifting the averaging coefficient to the left by the negative of the exponent approximates the divide operation. This process is commonly referred to as normalization . Some fixed-point DSPs can perform normalization by counting the leading bits of the accumulator and shifting the accumulator by that number of bits.2 In both cases, the averaging coefficient will be truncated to the nearest power of two, so the coefficient must be multiplied by 3/2 to round the result. This implementation is shown in Equation 16.


Figure 4 is the signal-flow diagram that represents Equation 16. Just as in Figure 3, x (n ) and y (n –1) must be greater than zero.

View the full-size image

A sample code listing that implements Figure 4 is shown in Listing 4. This assembly-language implementation is for the Freescale (formerly Motorola) DSP563xx 24-bit fixed-point processor.

Listing 4. Freescale DSP563xx assembly implementation of divide-free RMS using normalization

RMS; r4:      address of output bits 24-47 [y_msw(n)]; r4+1:    address of output bits 0-23 [y_lsw(n)]; x0:      input [x(n)]  
FS          equ   48000.0                   ;sampling rate in HzAVG_TIME    equ   0.1                       ;averaging time in secondsAVG_COEFF   equ   @XPN(-1.0/(FS*AVG_TIME))  ;calculate avg_coeff  
   move     #>AVG_COEFF,x1                  ;load avg_coeff   move     y:(r4)+,a                       ;get y_msw(n-1)   move     y:(r4),a0                       ;get y_lsw(n-1)   clb      a,b                             ;b=number of leading bits in y(n-1)   mpy      x0,x0,a a,x0                    ;a=x(n)^2                                            ;x0=y_msw(n-1)   mac      -x0,x0,a x0,y1                  ;a=x(n)^2-y_msw(n-1)^2                                            ;y1=y_msw(n-1)   normf    b1,a                            ;normalize x(n)^2-y_msw(n-1)^2 by y_msw(n-1)   move     a,x0                            ;x0=[x(n)^2-y_msw(n-1)^2]norm(y_msw(n-1))   mpy      x1,x0,a y:(r4),y0               ;a= AVG_COEFF                                            ;*[x(n)^2-y_msw(n-1)^2]norm(y_msw(n-1)),                                            ;y0=y_lsw(n-1)   add      y,a                             ;a=y(n-1)+avg_coeff                                            ;*[x(n)^2-y_msw(n-1)^2]norm(y_msw(n-1))}   move     a0,y:(r4)-                      ;save y_lsw(n)   move     a,y:(r4)                        ;save y_msw(n)   rts  

Of course, this method can be implemented even without fast normalization. You can implement a loop to shift x (n )2y (n –1)2 to the left for each leading bit in y (n –1). This will be slower but can be implemented with even the simplest of processors.

Higher Order Averaging
Higher order recursive averaging may be accomplished by inserting additional averaging filters before the iterative square root. These filters may simply be one or more cascaded first-order recursive sections. First-order sections have the advantage of producing no overshoot in the step response. In addition, there is only one coefficient to adjust and quantization effects (primarily of concern for fixed-point implementation) are far less than that of higher-order filters.

The implementer should be aware that cascading first-order sections changes the definition of averaging time. A simple but gross approximation that maintains the earlier definition of step response is to simply divide the averaging time of each first-order section by the total number of sections. However, it is the implementer's responsibility to verify that this approximation is suitable for the application.

Second-order sections may also be used, if you want (for example) a Bessel-Thomson filter response. If second-order sections are used, it's best to choose an odd-order composite response since the averaging square-root filter approximates the final first-order filter with Q =0.5. Care must be taken to minimize the overshoot of this averaging filter. Adjusting the averaging time of this filter in real time will prove more difficult, since there are a number of coefficients that must be adjusted in unison to ensure stability.

Three methods of calculating the RMS level are compared in Figure 5. The averaging time is set to 100ms, and the input is one second of 1/f noise with a 48kHz sampling frequency. The first trace is the true RMS value calculated using Equation 6. The second trace is the RMS calculation using Equation 14. The third trace is the no-divide calculation of Equation 16. The fourth trace is the RMS value using the reciprocal square-root method of Equation 13.

View the full-size image

For the most part, the four traces line up nicely. All four approximations appear to converge at the same rate as the true RMS value. As expected, the largest deviation from the true RMS value is the approximation of Equation 16. This approximation will have the greatest error during large changes in the level of the input signal, although this error is temporary: the optimized approximation will converge upon the true RMS value when the level of the input signal is constant.

The errors between the three approximations and the true RMS value are shown in Figure 6. The error of the RMS approximation using Equation 14 slowly decreases until it is below 1E–7, which is sufficient for 24-bit accuracy. The optimized approximation of Equation 16 is substantially worse, at about 1E–4, but still good enough for many applications. The approximation that uses the reciprocal square root is “in the noise”–less than 1E–9. For highly critical floating-point applications, this is the efficient method of choice.

View the full-size image

As you would expect, the errors discussed above will be worse with shorter averaging times and better with longer averaging times. Table 1 summarizes the approximate error versus averaging time of these three methods, along with suitable hardware architecture requirements.

View the full-size image

Suitable for average reader
By combining recursive averaging with Newton's method for calculating the square root, you'll gain a very efficient method for computing the root-mean. Although the three methods I presented here are developed for different hardware and each, to some degree, trades off hardware capabilities for error, most of you should find one of these methods suitable for your application.

Brian Neunaber is currently digital systems architect of software and firmware at QSC Audio Products. He has designed real-time digital audio algorithms and systems for QSC and St. Louis Music and has an MSEE from Southern Illinois University. You may contact him at .

1. D. G. Zill. Calculus with Analytic Geometry, 2nd ed. , PWS-Kent, Boston, pp. 170-176, 1988.

2. Motorola. DSP56300 Family Manual, Rev. 3, Motorola Literature Distribution, Denver, 2000.

Reader Response

In 1996, an article in Dr Dobbs journal presented a square root algorithm that uses only shifts and adds–no multiplies and certainly no divides. It can be found here: (requires registration).

I've been using this in every job and every project ever since–it's by far the fastest integer square root around, and I'm amazed that it's still so little known.

Try it out–you'll never look back!

– Paul Hills
Landis+Gyr Ltd
United Kingdom

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.