You need math to estimate a curve. Here's part two in Jack's series on state estimation.
| Jack Crenshaw's Estimation Series
Part 1: Why all the math?
Part 2: Can you give me an estimate?
Part 3: Estimation interruptus
Part 4: The normal distribution
Back in March, I wrote the first column (“Can You Give Me An Estimate,” www.embedded.com/4213575) in a new series on the math of parameter and state estimation. Because so much time has elapsed between that column and this one, I should give you a brief summary of what we did there. But first, I'd like to spend a little more time giving you the background and motivation for what we're doing.
It's all in the graph
At the heart of all estimation theory lies the concept of the least squares fit. Even when this technique isn't formally used, it's lurking in the background. And the best way to present the concept is in the context of graphs and curve fits. In this view, it's all about drawing pretty graphs, even when the data is noisy.
Picture this: You've instrumented some real-world system and collected a bunch of data, which you've organized into a set of ordered pairs:
You choose to present the data in a graph.
Note carefully that in making this choice–indeed, by organizing the data into pairs in the first place–you've assumed that the x and y values are somehow related. There's a functional relationship between them, which we can formalize as:
Traditionally, we call x the independent variable and plot it on the horizontal axis. We call y the dependent variable and plot it on the vertical axis. The distinction between the two can get a little fuzzy, but in one respect they're fundamentally different. The function f (x ) must be single-valued . That is, for a given value of x , there must be one and only one value of y . Otherwise our math doesn't work.
I should mention that our graph need by no means be restricted to two dimensions (2D). We've all seen 3D graphs of the form:
Mathematically, we can conceive of any number of independent variables, although unless you're Einstein, you're going to have a hard time visualizing the graph. More concisely, we can write:
where x is now a vector of independent variables. The only dependent variable is still y , and it must always have a scalar value.
Ok, let's get back to your graph. You carefully plot the raw data, but when you're finished, you see that the points are scattered all over the page, as in Figure 1 .
If you were very, very naïve, you might think, “Wow, this system is a lot more complicated than I thought.” Then you might start blindly connecting the data points with straight line segments or, worse yet, a set of spline curves passing precisely through each point.
But you're not so naïve. You suspect that the system is changing in a smooth manner, but the measurements are corrupted by noise. You'd like to show what the graph would look like if the noise weren't there.
To do this, you draw a smooth curve through the data, such as the blue curve in Figure 2 .
The obvious next question is, how do we get this smooth curve? Is it merely a matter of aesthetics, or could there be some math involved?
Back in the day, I used to draw such curves freehand, armed with nothing more than a sharp eye, a sharp pencil, and a steady hand.
The problem with such an ad hoc approach is that someone else would surely come up with a different curve. And since the method is purely ad hoc, who's to say which curve is better? Before we start running a beauty contest for curves, we had better come up with something more than mere aesthetics to judge them. That's where the method of least squares comes in.
The least squares fit
I've addressed the method of least squares before, in “Why All the Math” (June 2009, www.embedded.com/4027019).
The basic idea is this: Assume a form for the function f (x ). For every value xi of your data set, this function defines a predicted value:
Note that these values don't appear anywhere in your data set. They're fictitious values, representing the value y would take on at xi , if the noise weren't there.
The difference between yi and ŷ i represents the error between f (x ) and the measured data, at the point xi . It's the residual :
Ideally, of course, we'd like all the residuals to be zero, but that's not possible with the data we're given. The best we can hope for is that the entire set of residuals is minimized in some sense.
Now, when we're talking about minimizing something, it almost always involves minimizing a single scalar value, called the penalty function . Clearly, this penalty function must involve all the residuals, giving no preference for one residual over the other. Ideally, we'd like this penalty function to be really, really simple.
We could choose to simply add all the residuals, but that doesn't work.
Some of the residuals are positive, some negative, and we don't want them cancelling each other. Each non-zero residual should increase the value of the penalty function. We might choose a sum of the absolute values, but that doesn't work either. An absolute value is non-linear, so it messes up any math we might use to do our minimization.
The least squares fit is based on the very next simplest function we can think of, which is the sum of the squares of the residuals. Define the variance :
Then the best curve fit is the one that minimizes V .
Who defines f(x)?
Until now, we haven't said anything about the nature of f (x ) except that it must be single-valued. Now let's talk about how we get a reasonable function. There are really only two possibilities:
- If I'm doing an ad hoc curve fit, as I did in Figure 2, it's because I don't know, or don't care, anything about what's really going on in the system under study. I only want the graph to look pretty. I can elect to draw the curve freehand, as I did in Figure 2. In this case, the function f (x ) still exists, I just don't know what it is.
On the other hand, if I want to use the least squares method to improve the curve fit, I must assume a function f (x , even if it makes no sense in the context of the system being measured. I'm not at all limited, however, as to the nature of f (x ). Any formula will do, as long as the result remains single-valued.
I could, for example, choose a straight line equation of the form:
To a mathematician, this is a first-order polynomial in x . To business folk and Excel users, it's a trendline, and its solution is called a linear regression .
Of course, the function needn't be limited to straight lines. It could be a polynomial of any order. Or not a polynomial at all. I might choose, for example, a sum of exponentials, or a sum of sine waves, otherwise known as a Fourier series. I could even use a Bezier curve or a set of cubic splines.
See, a spline fit normally gives a set of cubic functions that pass precisely through each of the data points. As I mentioned earlier, this would be a terrible choice for the data of Figure 1. Try it, and you will find that the curve oscillates wildly. But you can still use a spline fit, if you relax the requirement that the curves have to pass through the data points. If you assume a (hopefully small) number of mesh points that don't necessarily correspond to data points, you can move these mesh points around to minimize the variance.
The same is true with a Bezier curve. Every such curve has a set of pivot points, or guide points. We can conceive of optimizing the curve fit by moving these points around.
Although we can use any function f (x ), no matter how bizarre or complicated, there's a reason why most curve fits involve polynomials: If we use a polynomial, the math for minimizing the variance has an analytical solution.
- On the other hand, I might know something, or a lot, about the math associated with the system under study. I once worked with a gadget that measured gas concentration in the air. In that case, the math was defined by Beer's law, which leads to an exponential function.
The math describing our system might be a set of algebraic equations. That was the case with my gas sniffer. Many times, though, the math can be described only in a set of differential equations. If, for example, the system obeys Newton's laws of motion, you can be very sure that it's described by a set of second order, ordinary differential equations (ODEs).
Again, it really doesn't matter what the math is, except in the sense that some equations are easier to work with than others.
In this second case, where we know, or think we know, the math that describes the system, we say that we have a model of the system. You're going to be hearing a lot about models in the months to come.
Change what , exactly?
I've been talking a lot about optimizing a curve fit, by adjusting the function f (x ) to minimize the variance. But the process of minimizing anything involves adjusting something to get a better fit. But what is it that we're adjusting?
Equation 8 should give you the clue. The function f (x ) doesn't depend only on x . It also depends on the two coefficients, a and b . It's these coefficients that we must change to alter the variance. When we're dealing with polynomials, we usually think of the coefficients as constants, but in this case, they're the variables. Consider: We can't change the xi or y i . The only things we can change are the ŷi , and those only by changing the values of the coefficients.
To formalize this dependence on the coefficients, we should really write:
Or, more generally:
where c is now a vector of the coefficients. When we're optimizing the goodness of the curve fit, we're doing so by adjusting the values of the elements of c . We'll see how in a moment.
Back to the average, part 2
In my previous column, I told you that there are two ways to process data. The first way is the one we've used in our example: Collect all the data first, then process it all at once. This is called batch processing . There's usually no notion of real-time processing involved. The batch processing is something you do “off line,” after all the data's been collected. It's the traditional way statistics is done.
The notion of batch processing has some important ramifications, not the least of which is that all the data must be collected and stored somewhere before it's processed. This can take a lot of memory. The process is far more amenable to running on a large mainframe computer, in non-real time, than to a real-time system.
The alternative is to process each new data point as it comes in. This approach is called sequential processing . As you might imagine, this kind of processing is a lot more attractive to us real-time folk.
In that first column, I was mostly focused on the distinction between batch and sequential processing, and I sought to show you how a simple batch processing algorithm can be warped into its sequential counterpart. To do that, I tried to think of the simplest possible application, which was finding the mean and standard deviation of a measured data set.
In retrospect, I think I might have gone too far in seeking the simplest example, because it's something of an outlier in the more general context of least squares fits. This month, I'd like to revisit what we did, but do it in that more general context.
From that perspective, we can think of the average as merely a special case of the polynomial regression, where the polynomial happens to be the zeroth-order one:
Let's see how our definitions of residuals and variance work out in this case, and lead to the solution that minimizes that variance. From Equations 5 and 6:
Expanding the polynomial gives:
Summing these squares gives the variance. In the process, note that because a is a constant, we can boost it out of the summations. We get:
The last term is rather cute. Because the a 2 factor has been boosted out, there's nothing left to sum except an implied coefficient of 1. Adding 1 to itself N times, where N is the number of data points gives, of course, N . So our final form for the variance is:
The next step requires a little calculus, but very little. When a function is at a minimum (or maximum), its slope–which is also its derivative–is zero. Taking the derivative of Equation 16 gives:
Setting this equal to zero gives:
Eureka! We've just discovered that the average of a set of numbers is equal to their sum, divided by the size of the set. Who knew? But the important thing is that we now see why this definition holds. Remember that for this special case, the “function” f (x ) is simply a , so the optimal value of the curve fit is given by:
Now we see the relationship between a simple average and the least squares fit. The average is simply one kind of a least squares fit, in which the function f (x ) happens to be the degenerate function f (x ) = a .
In my last column I showed you how we can convert the batch process of averaging into a sequential one, so there's no point pursuing this case further.
Moving on up
Now that we see the relationship between the average and the least squares fit, it's time to generalize to higher-order fits. This time, we'll assume the linear polynomial of Equation 8. Repeating the steps in Equations 12 through 19 above, this time we get:
Expanding the polynomial gives:
The variance then, is shown in Equation 24.
Click on image to enlarge.
Ok, I admit it, this looks pretty messy. But don't forget, all those sums are based on the input data and are simple constants in this process. Also, we'll be taking derivatives in our next step, which makes the complexity go away. Those derivatives are:
Again, remember what our goal is: We're trying to find the values of a and b that minimize the variance. To do that, we have to adjust them together, and the minimum is given where the values of both derivatives vanish. Setting both derivatives to zero gives us two equations in the two unknowns, a and b , which we must solve simultaneously. We get:
Even if you're still in freshman algebra, you should be able to solve these two equations–remembering, of course, that all the sums are mere constants. But there's a much more elegant way to solve the equations, and it's one that we can use for polynomials of any order. To see how it works, define the vector:
Now we can see that the two equations can be cast into the matrix form:
From here on, it's a simple matter of linear algebra, otherwise known as vector-matrix math, to get our optimal solution. We'll need to invert the matrix Q , to get:
Some important points
In this example, I've used the term “linear” in two ways, and the distinction can be confusing. So let me clarify.
My first use of the term had to do with the function I've chosen to study: the first-order polynomial of Equation 8. The term “linear function” is another way of saying that x appears in the function only as its first power. Contrast this with quadratic function, or cubic functions, which involve polynomials of orders 2 and 3, respectively.
Because of this terminology, you might get the impression that linear algebra can only be performed, and therefore the solution can only be written, in the form of Equation 33. Because the sought-for coefficients, a and b , only appear as their first orders. It's the order of the coefficients, not the values of x and y , that matter.
So, even if I use a very complicated function like:
the unknown coefficients still only appear as multiplicative constants. The equation is highly nonlinear in x , but linear in a , b , and c . Because of this, the squares of the residuals can only appear as, at most, squares of these coefficients, or cross products like ab , ac , and bc . Take the derivatives of each of these terms, and you're back to first order terms, so linear algebra works. I hope this clears up any confusion.
A second point: Note that the sizes of Q and U are only 2×2 and 1×2, respectively. We can't escape the need to invert Q , but its order is low and the inversion will go fast. In fact, there's a closed-form solution for the inverse of a 2×2 matrix, so we could use this in a real-time implementation. But even if the number of unknowns is greater than two, it's probably not much greater. So the matrix inversion will usually go relatively fast.
I once worked on a Kalman filter with 18 states to solve, and therefore an 18×18 matrix to invert. Doing that one took some time, especially using software floating point in a 16-bit processor. I've heard of filters using as many as 40 states, though one has to wonder just how good a solution you're getting. But even in that extreme case, the number of equivalent multiplications is on the order of 403 , or 64,000 multiplies. That sounds like a lot, but not to a modern CPU with hardware floating point.
Also note the forms for Q and U . Equations 31 and 32 only apply if the function f (x ) is a polynomial in x , but that's a very important special case. As you can see, Q is symmetric, with elements containing only sums of powers (including power zero) of xi . Furthermore, the powers are equal along each minor diagonal.
The fact that the terms in Q depend only on the x 's leads to an important special case. There are times when we know in advance what values we will use xi . In real-time systems, for example, we often take samples at fixed time ticks. In these cases, if we also know how many samples we're going to take, we can write down the value of Q in advance. Therefore we can also write down the value of Q –1 .
Finally, note that the terms in U always contain sums involving only the first power of yi . It's only the powers of xi that are increasing down the array. This is true for any order of the fitting polynomial.
In this column I had hoped to take our method one more step, to include the case where f (x ) is a quadratic. I'd also hoped to cast both cases into the form of sequential processing, and show concrete examples as I did for the case of averaging. But I'm running out of room, and anyhow Microsoft Excel 2010 is not cooperating (grrrrr), so those things will have to wait until next time.
See you then.
This article provided courtesy of Embedded.com and Embedded Systems Design magazine.
See more articles like this one on Embedded.com.
This material was first printed in Embedded Systems Design magazine.
Sign up for subscriptions and newsletters.
Copyright © 2011
UBM–All rights reserved.