# Can you give me an estimate?

About 18 months ago, I wrote a column on the least squares fit (“Why all the math,” www.embedded.com/4027019) where I emphasized the technique to fit a curve to noisy data. It's not the only way to do that, of course. If you've ever worked with any embedded system that had a sensor measuring a real-world parameter, you know all about dealing with noise. The usual way is to pass the signal through a low-pass filter, which can be anything from a simple first-order lag to a finite impulse response (FIR) or infinite-impulse response (IIR) filter with scores, if not hundreds, of terms.

So what's the difference between a least squares fit and a low-pass filter? They both extract a signal from the noise, right?

Jack Crenshaw's Estimation Series Part 1: Why all the math? Part 2: Can you give me an estimate? Part 3: Estimation interruptus Part 4: The normal distribution |

One clue to the difference lies in the word, *signal* . As soon as you say the word, you invoke an image of something like a voltage that's changing with time. When you're processing a signal, you don't have the luxury of waiting until you've got a bunch of data points. You have to accept and process the data as it's coming in. Most days, we call such processing *real time* .

By contrast, the term *least squares* invokes an image of *batch processing* , where you operate on a whole set of data items after they've been collected. The notion of time is not explicitly involved. Indeed, when you're applying the least squares fit, you can shuffle the data indiscriminately, and the technique will still give you the same solution.

As far as we know, the first application of the least squares fit was invented by Carl Friedrich Gauss in 1795, at the ripe old age of 18. He used it in 1801 to predict the motion of the minor planet, Ceres. I think it's safe to say that Gauss's analysis was anything *but* real time. Unless, of course, you accept a clock rate measured in days or weeks.

The distinction between a filter and a least squares fit gets a lot more fuzzy if we set up the least squares fit so that we can process the data *sequentially* , as it comes in. That concept, sequential processing, is going to be our main focus in this and future columns.

But there's another distinction between filtering and fitting that's much more fundamental and profound than the way you process the data. That distinction lies in what you know—or think you know—about the process that generates the data.

When I'm developing a system to filter noise out of a signal, I don't have any idea what's happening in the real world, to create that signal. At the level of the analog-to-digital (A/D) conversion, it's just a voltage that comes from somewhere, corrupted by noise. My job is only to extract the signal from the noise.

The least squares fit is different. We apply it when we think we know something about the process that generated the data. If I'm applying a linear regression to a data set, it's because I think that one element of the set's data pair depends on the other. And not just depends, but has a linear relationship, which I can graph as a straight line. The purpose of the least squares fit is not just to filter the noise, but to discover the coefficients of that straight line relationship.

Of course, the regression doesn't have to be linear. Using least squares, I can fit a quadratic, or a polynomial of any order. I can fit a logarithmic relationship, or a sum of logarithmic terms. I can even fit a sum of sine waves (in which case I've done a Fourier analysis). It really doesn't matter what we think the relationship is; it only matters that we think that there is one.

The point is, when I'm applying a least squares fit I'm doing more than just filtering noise. I have a mental *model* of what's going on in the system that's generating the data. Presumably, that model includes coefficients whose values are unknown. The job of the least squares fit is to give me a best *estimate* of those coefficients. For obvious reasons, this process is often called *state estimation* , and it's this discipline that will occupy our interest in the rest of this column, and several more. If things work out right, the series will culminate in that dream of all estimators, the justly famous *Kalman Filter* . I'll leave it to you to ponder why the pinnacle technique of state estimation is called, not an estimator, but a filter.**Your score was merely average**

In this new series, I expect to cover a lot of ground, with math that's going to get heavier and heavier as we go along. The last thing I want to do is to leave some of you behind in the starting blocks. So in the spirit of No Reader Left Behind, I'm going to begin with the most ridiculously simple example I can think of. If things go as planned, I'll continue to escalate in such small steps that, like the proverbial frog simmering in the proverbial stew pot, you'll be cooking along with state estimation without really noticing how you got there. I'll start by defining a set of 10 numbers:

Your challenge is to compute their average value. Hey, you know how to do this. First, add all the numbers:

Then divide by the number of terms:

There. That wasn't so hard, was it?

Note the bar over the **y** , which is the symbol usually used to indicate an average value. It's not the only symbol for the average. Another oneis 〈y〉, which is a little harder to miss. I actually prefer 〈y〉, precisely *because* it's harder to miss. But we'll stick with the “bar” notation here. Note also that, while **y** is a set of values, y is only a single scalar number.

We can generalize the process with a little math notation (stay with me, now):

or, more generally, for a set of *N* values:

Even better, the shorthand notation of the summation symbol:

Careful, now. I can see some of your eyes starting to glaze over, and your pupils shrinking down to pinpoints. This isn't rocket science (yet). Equation 6 says nothing at all that isn't said by Equation 5. It's just a shorter way of saying it. Because mathematicians and physicists are a lazy lot, we tend to go for the shorter way when given the chance. Stated in words, the summation sign of Equation 6 says “let the index i take on all the integer values from 1 through *N* . For each *i* , add the term *y _{i} * to the sum.”

If the summation sign makes you nervous, just remember that you can always expand it back out into the explicit sum of Equation 5. It just takes a little longer to write, that's all.

One last point: The average value of any set is often called its *mean* . Even more specifically, its *arithmetic mean* (because there are other kinds of means).

Why introduce another term? Can it be that we're so lazy we'd rather write four letters than seven? I guess it must be so. Deal with it.**Some do it sequentially**

The problem with taking an average using Equation 6 (or 5) is that the computation is basically a batch process. To add up all those numbers, we have to have them all available. In an embedded computer, this means that we must keep the potentially huge set of terms stored in memory.

Is there a better way? Of course. To see how, let's look at how the average value changes as new data comes in. Let *m _{n} * be the mean of the first

*n*values of the set. As the data comes in, we'll have:

But each of the sums in these equations are merely the sum in the previous mean, plus one new term. We can write:

And, in general:

As you can see, we don't need to keep any of the old elements of **y** around. In fact, we don't need to keep any of the old elements of **m** around, either. We only need the latest value of *m* , plus the newest element of *y* . The new value of *m* can overwrite the old value. In a software implementation, we only need to have two persistent, scalar variables: the past value of *m* , plus the integer counter, *k* .

Better yet, let's not keep the old value of *m* , but the running sum. This lets us avoid the multiplication by *n* . Let:

Then:

Writing the software is almost easier than describing what it must do. Listing 1 shows a snippet of code that works in either C or C++. It's not perfect, and it's not production quality. Because it has static variables, it won't work if you're asking it to find more than one average per program. If there were ever a case for a C++ class, this is it. That task, I'm leaving “as an exercise for the student.” But the code I've shown should give you the idea.

**Click on image to enlarge.**

**We need a variance**

That task—of computing an average—wasn't too hard, was it? I hope it didn't tax your brains too much. Here's your next challenge: For the same data set, compute the variance and standard deviation.

Whoa! You want me to . . . what? Did we cover that in class?

Not yet, but we'll do it now. In general, the data in our sample data set **y** tells us more than just its mean, or average, value. It also tells us something about the reliability of each data element. In other words, it gives us insight into how much noise hides inside the data. It gives us the *statistics* .

In **Figure 1** , I've plotted the data set and also the mean value. As you can see, the data items themselves bounce around quite a bit. As often happens, none of them are actually equal to the mean value. For each value of **n** , the value in the set differs from the mean by an amount called the *residual* .

It would be nice if we had some single scalar measure—we might even call it a *variance* —of the quality of the data. Clearly, this measure would have to involve all the members of the data set. We could try adding all the residuals together, or computing their average value, but that wouldn't work. Because the residuals can be positive or negative, they could cancel each other, leaving us with a false impression. If, for example, the data values alternated around the mean, then the sum of any pair—or all the pairs—would be zero, and so could be the average residual. That would tell us nothing about the real quality of the data set.

A measure that does work, though, is to take the sum of the *squares* of the residuals. This, of course, is the same measure that's used in linear regression and similar curve fits. Can we say “least *squares* ”? Duh. More precisely, let's define *variance* as the mean of the squared residuals:

Once we have the variance, the *standard deviation* is easy. It's just the square root of the variance:

This quantity, standard deviation, is supposed to be a measure of the amount of noise in our signal. Take a look at **Figure 2** . This is the same graph as Figure 1, but I've added the two horizontal, dashed lines a distance σ above and below the average value. You get the impression that new measurements of y are usually going to lie in the band between these two limits. Are the limits absolute? Certainly not. As you can see, five of the 10 points lie outside the band. Even so, these lines do seem to say something about the “scatter” in the data, don't they? From the defining equations, it's clear that the size of σ is going to depend on this scatter. If all the values *y _{i} * are nearly equal, then σ will be small, and the band will be more narrow. In the limit, when there is no noise, or scatter, at all, all the measurements will be equal, the value of σ will go to zero, and so the band will have zero width.

You'll note that I haven't said anything, so far, about probabilities, or distribution functions, or any of those terms that relate to statistics. And I won't, in this column. We'll have plenty of time for that, later. For now, you only need to get the concept that the standard deviation is a measure of the amount of scatter, or noise, in the data.**A sequential approach**

We started this discussion by considering the batch processing of the average, assuming that we had the entire data set as embodied in **y** . If we look at the problem of computing the variance from the same perspective, it clearly isn't much harder. First, sum all the elements of **y** , and divide by *N* to get the mean y . Then use that in Equation 12 to calculate the residuals term by term. Square them, add them, and divide by *N* to get the variance.

The question is, can we do the same thing for variance that we did for the mean? Can we come up with a sequential process for the variance?

At first glance, it may seem unlikely. The problem is that *all* the residuals, from *e* _{1} through *e _{n} * , depend on the mean, which changes each time we add even one more data element. From Equation 13, it sure looks like we must recalculate the entire variance from scratch.

But looks can be deceiving. You can see the process with more clarity if we expand the polynomial in Equation 13. Write:

Now let's split up the three terms so that we sum them separately:

Because we know that *y * itself contains a summation, it's tempting to substitute that sum here. But that would be a mistake; it leads to more complexity, while we're looking for simplicity.

The real trick is to recognize that, regardless how we calculate *y * , it's just a number by the time it shows up in Equation 16. It's not an indexed number, and it isn't changing as we generate the sums. As far as the summations are concerned, it's a mere constant, and we can boost it out of the summation process.

Doing that, we can write:

Check that last sum. Is that cool, or what? It simply says that we should add 1 to itself *N* times. Guess what result we'll get. That's right, it's *N* . The second sum is interesting, also. Does that sum:

look familiar? It's the same sum that's used to generate the mean in the first place. See Equations 6 and 10. From them, we can see that:

Substitute this into Equation 17 to get:

Or:

Or simply:

We're now in a position to calculate the variance, and therefore the standard deviation, sequentially. We'll have to maintain another sum; call it:

Each time we get a new value of *y _{i} * , we can now compute the running sum

*s*as before, plus the running sum of squares,

_{n}*u*. From

_{n}*s*, compute the latest value of

_{n}*y*. Plug that into Equation 20 to get a running value of

*V*and therefore of σ.

As before, it's almost easier to show you the code than it is to describe the process. See **Listing 2** .

**Click on image to enlarge.**

Next, look at **Figure 3** , a graph like Figure 2 only you see the sequential values of the average and the error band, and the way they evolve as new data comes in.

Perhaps you've seen graphs with error bars showing the expected deviations from each measurement. That's what I'm trying to display here, only Microsoft Excel won't let me show error bars. So the large black dashes will have to do.

**Caveat calculator**

I need to mention one or two practical problems. Take a look at the running sums in Equations 10 and 21. See any potential problems?

Jack Crenshaw's Estimation Series Part 1: Why all the math? Part 2: Can you give me an estimate? Part 3: Estimation interruptus Part 4: The normal distribution |

That's right. The two sums can grow without limit. If you process the input signal at a high enough rate, and for long enough, sooner or later both sums are going to overflow. Presumably, the sum of the squares is going to overflow first.

We can do certain things to alleviate the problem. First, note carefully that the measurements *y _{i} * can have units. If you're measuring the distance to Alpha Centauri and using units of millimeters, you're going to have overflow problems a lot sooner. One thing you can do is to choose your units more carefully and normalize them to some standard value—Astronomical Unit (AU), for example.

Or you might split up the computation into stages. At some point, where the sums are starting to get out of hand, you might just store away all the values to date and start a new set of computations.

The easiest thing to do is to normalize all the numbers by the number of samples. That is, maintain running averages rather than running sums. We can do that easily, as shown in Equation 9.

Interestingly enough, one approach to the problem is to ignore it. This potential for overflow is an excellent argument for computing everything in floating point, even in small, integer-only computers. Better yet, compute in double precision, where the maximum value is a staggering 10^{308} .

How big is that? Imagine that you're sampling a signal at the rather high (for embedded systems) rate of 1000 Hz. Imagine also that you've chosen your units wisely, so that all the measurements are of order 1. How long will it take for your filter to overflow?

It's 10^{308} ÷1000=10^{305} sec = 3×10^{297} yr.

That's 10^{287} times the current age of the universe!

This range is head and shoulders different from, say, the Y2K problem. The range of a double precision number is so staggeringly large, I think it's safe to say we won't see the sums overflow in our lifetimes.

So one acceptable solution to the overflow problem is simply to ignore it. As long as you're using floating point—or even better, double precision—arithmetic, the overflow is simply not going to happen.

Another problem in the calculation is more challenging. It lies in Equation 20. Suppose that the average value of the data set, **y** , are reasonably large, but the noise is very small. The residuals in Equation 12 are all going to be small, as is their sum in Equation 13.

But we aren't summing residuals in Equation 20—we're summing the squares of the measurements themselves. This sum can be a whole lot larger. For that reason, we can end up, in Equation 20, subtracting nearly equal numbers and thereby introducing a loss of precision.

There is no quick cure for this problem. If you need the benefits of sequential processing, you'll have to accept that your end result may not be as accurate as you might like.

You can take quite a number of steps to fix the problem, but none are straightforward, and all are very much problem dependent. You might choose, for example, to reference the measurement to some expected value, other than zero. In that Alpha Centauri example, a baseline of four light-years might help.

Like most problems, this one can be solved but don't expect your solution to one problem to work in all possible other problems. In the real world of processing real-world signals, you can expect to tailor the solution to the problem. How surprising is *that* ?

**Coming soon** This month, I've shown you how to calculate the mean, variance, and standard deviation of a set of measurements. I showed you how to do it the classical way, processing all the data in batch mode, and I've shown you the tricks to do the same calculation sequentially, as the data comes in. I studiously avoided talking about statistics, so we clearly have more things to talk about. I'll do that next month. See you then.

* Jack Crenshaw is a systems engineer and the author of * Math Toolkit for Real-Time Programming

*. He holds a PhD in physics from Auburn University. E-mail him at jcrens@earthlink.net.*

Thanks for this excellent introduction. I wish my professor had explained it this way to us.