Understand how to cut down on the noise in your system, using the math behind the Kalman filter. If you know something about the dynamics of the system, you can make a better estimate of what it's doing now and what it's going to do next.
The universe is a noisy place. More so in Manhattan, Detroit, or the galactic core; less so in the Swiss Alps or interstellar space, but noisy everywhere. No matter how hard you try, you can't escape the noise.
Few people understand this better than scientists and engineers, especially those of us who work with embedded real-time systems. Chances are, the first time you hooked up some measuring sensor to an analog-to-digital converter, you learned this lesson well.
Most likely, you decided to insert a low-pass filter into the signal path. That will remove a lot of the noise, all right, but also the fine structure of the system behavior. And it necessarily adds a time delay to the system–a delay that will affect the stability of a control system.
Filters have their place, for sure. But in the end, the use of a low-pass filter is a tacit admission that we don't have a clue what the system is really doing. We filter it because it's the only recourse we have. We treat the input signal as pseudo-static, and hope that our sample rate is fast enough to make the signal look like a constant.
On the other hand, there can be times when we do know–or think we know–how the system should behave. A real physical system obeys the laws of physics. A reactor in a chemical plant is going to follow the laws governing its particular chemical reaction. An airplane in flight is going to obey Newton's laws of motion.
In such cases, we can do a little better than simply relying on brute force low-pass filters. If we know something about the dynamics of the system, we can make a better estimate of what it's doing now and what it's going to do next. That concept has been the focus of my last few columns and will remain our focus for the near future. This is the third installment of the series.
Wouldn't you like to fly?
Now, there's one place in the universe that's blissfully noise-free: the pristine world of mathematics. Add 2 and 3, and you always get 5. Exactly. For a given value of x , the function f (x ) returns the same value. Every time.The same is true of applied mathematics. If I simulate, say, the motion of an airplane, I know it must obey Newton's laws of motion. And when I run the simulation, the airplane follows the same course. Exactly.
The real world is not so accommodating. A real airplane obeys the same laws of motion, but it's also subject to disturbances that aren't modeled in my simulation: wind gusts and updrafts; headwinds; changes in air temperature and density; pilot inputs. Even people moving around in the cabin.
Perhaps more importantly, every sensor measuring a flight-related parameter has errors of its own: electronic noise, scale factor errors, quantization errors, drift, temperature sensitivities, etc. All of these things introduce uncertainties into my perfect world of simulation.
But wait; there's more. Even if I could simulate all those effects, the simulated airplane is still not going to behave like the real thing, because there are system parameters whose values I don't know exactly. I may know the mass and inertia properties of the dry airplane, but not with an unknown number of passengers of unknown weights, or the mass and distribution of fuel in the tanks. I may have precise data sheets that give me the theoretical thrust of the engines, but not the actual thrust they're producing at any given time.
So there are three ways my simulation can never be perfect: I don't know the precise values of the system parameters, I don't have precise measurements of the system state, and my math model doesn't include all the possible effects. In the real world, our challenge is to make our best guess as to all these things, based on the noisy measurement data we have available.
Because the system is subject to unmodeled effects, it's not enough to just measure the state once or twice, and assume the system will obey its laws of motion from there on. I have to continue to take measurements, and constantly improve my best guess as to both the system state and the parameters that I thought I knew, but got wrong.
Parameter estimation is the determination of those system parameters. State estimation is the determination of the state of our system. A good data processing system can do both, and nothing does it better than the fabled Kalman filter.
A short review
In this series, I deliberately started slowly, and moved forward in baby steps, so as not to leave anyone behind. I first chose the absolutely simplest example I could think of, which was to take the average of a set of numbers:
The mean , or average is, of course, given:
The variance and standard deviation are measures of the level of noise in the data. To compute them, we defined the measurement error, or residual , to be:
The variance, then, is the sum of squares of the residuals:
The standard deviation is simply:
To illustrate the method, I generated an example that produced a graph like the one in Figure 1 . The two lines denoted “Upper” and “Lower” delimit an error band, given by +σ and -σ. These limits are not hard and fast, of course. As you can see, many of the data points lie outside this “one-sigma” error band. Even so, it's a good measure of the average amplitude of the noise.
Click on image to enlarge.
To get the result of Figure 1, we processed all the elements of y at the same time, just as they're shown in Equations 2 through 5. That's called batch mode processing . It's obviously a simple way to go, but it's not well suited to real-time systems because the time it takes to recalculate the mean is proportional to the number of elements in y . In real-time systems, we usually need the results quickly, and we're often limited in data storage space. For such cases, a better approach uses sequential processing , where we process only the one new data point each time it comes in.
To support sequential processing, we defined two running sums. For a given value of n ≦ N , they are:
Each time a new measurement, yn , comes in, we must update the sums according to:
Then the updated mean and variance are given by:
Using this algorithm, we got a figure like Figure 2 .
Click on image to enlarge.
A picture is worth…
As utterly simple as this first example is, Figures 1 and 2 tell us some important things. The first is that I chose to draw x –y graphs at all. After all, no variable x has been mentioned before. So why did I introduce it?The short answer is: for clarity. You can think of x as a time parameter if you like, and remember that, like time, “it keeps everything from happening at once.”
The values of x don't enter into the computations; in fact we could shuffle the data points around, and it wouldn't change the results. For the graph, I simply defined x to be the set of ordinal numbers:
Together, the two vectors define a set of ordered pairs:
Now look at Figure 2. In this case, the order of the elements of y very much does matter. That's because, at each value of xn , we're only considering the values yn and older. And, as the figure clearly shows, the estimates of mean and standard deviation are constantly updated each time a new measurement comes in.
I'm sure you can see that this kind of processing is precisely what we need in a real-time system. At any point in time, the resulting values of and V represent our best guesses–the optimal estimates –based on the data collected so far. When a new measurement comes in, we must update those results. The algorithm described in Equations 6 through 9 does this using only the old values, plus one new measurement. The new best estimate is also optimal.
This is exactly the kind of behavior we'd like from a Kalman filter. In fact, the algorithm that generated Figure 2 is the Kalman filter for this simple problem.
By the way, look closely at the final values of the statistical parameters in the two figures. You'll see that they're the same, as they should be.
It's all about the graph
Graphs like Figures 1 and 2 are certainly handy ways to display our results but, as I pointed out in my last column, the graph-oriented mindset goes much further than mere convenience. It's also the best way to explain and understand the math involved.When we plot any x –y graph, we're making the tacit assumption that the two variables are somehow related; that there's a functional relationship between them, of the form:
When we plot data that has scatter in it, such as red curves in the figures, we don't imagine for a moment that the scatter is truly part of f (x ). More likely, f (x ) itself generates a relatively smooth curve, but our measurements of it are corrupted by noise. Our challenge is to discover the true nature of f (x ) and draw the curve it would follow if the noise weren't there. The method of least squares gives us a formal, mathematical way to do that. This method defines the optimal solution to be the one that minimizes the variance, as it's defined in Equation 4.
We place no restrictions on the nature of f (x ), except that it be single-valued . That is, for a given value of x , there must be one and only one value of y . Although the function is by no means restricted to polynomials, people often do assume a polynomial, simply because the math gives a closed-form solution.We could, for example, assume the first-order polynomial:
In which case, we'd be performing linear regression . Our example problem of finding the average of a set of numbers is equivalent to assuming f (x ) to be the degenerate, zeroth-order polynomial:
To first order and beyond
Now take a look at Figure 3 , in which I've applied the same batch averaging process to a new set of data.
Click on image to enlarge.
Did the method work? Absolutely. Did I get the mean and standard deviation? Yep. Are the results meaningful? Hardly. That's because, when we average a set of numbers, we're making the implicit assumption that f (x ) is the constant, degenerate polynomial of Equation 14. One look at Figure 3 tells you that this is a Bad Idea. If the figure is trying to tell us anything at all, it's that even the best math algorithms can't protect you from terminal stupidity.
The problem, of course, is that my assumption concerning f (x ) was not a good one. From the figure, it certainly seems that a first-order polynomial might give a better fit.
Now, throughout this series of columns, I've been saying that we seek to discover the nature of f (x ), and the method of least squares can help us do that. But come on, the method can't do everything for us. We have to make an educated guess as to the nature of f (x ). The goodness of fit depends on our making a reasonable that guess.
In other words, we need a model . In the averaging algorithm, that model is Equation 14. In linear regression, it's Equation 13. For the example data of Figure 3, the linear assumption seems better, but the method of least squares can't tell us that (at least, not directly). It can't make the decision for us. We have to suggest the model. The method can only give you the best values for the coefficients. In other words, it does only the parameter estimation.
There's another hugely important message implied by Figure 3: You need to draw that figure. We could have blithely applied the method of averaging, and we'd have gotten a solution. Without that figure, we might very well have accepted the results, and moved along, blissfully unaware that our assumption was an awful one, and the solution was dross. The old adage, “A picture is worth a thousand words,” was never more true than in the case of curve fitting. You have to see a graph of the raw data before you can make an educated guess as to the nature of f (x ). My advice: Never skip this step. Always plot the raw data before trying to fit it.
Parameters? What parameters?
Throughout this series of articles, you've heard me say that we seek the best fit for f (x ). The method of least squares gives us the optimal curve fit, in the sense that it minimizes the variance V .
But I just said that the method can't specify the model; We have to give it one. So if I have to specify the model, what's left for the method to optimize? The answer, of course, is the unknown coefficients in Equation 13. To clarify this point, we should probably make the dependence explicit by rewriting the equation in the form:
In my last column, I derived the equations for this case. We found that we could put the solution into a nice matrix form:
Applying this method to the data of Figure 3 gives the linear curve fit in Figure 4 .
Click on image to enlarge.
Is that beautiful, or what? Notice how much smaller the error band is, compared with Figure 3. I said earlier that the method of least squares can't choose the model for you, but it can certainly help. By trying both the constant model and the linear model, we can clearly see that the linear one has a better fit, since the error band is smaller.
You have to be careful when you do this, however. I can make the standard deviation zero simply by using a 31st order polynomial. That would give me an exact polynomial fit to the data, but it would be a horrible mistake. The fitted polynomial would pass exactly through every data point, oscillating wildly between them. That would violate our suspicion that f (x ) is a smoothly varying function. So perhaps I should modify my dogmatic statement to read: “The method of least squares can help you choose a model, within reason .” As usual, the method is no defense against stupidity.
The nicest thing about polynomial curve fits is that the math is easily extendible to higher orders. The only thing that changes is the size of the matrices, and the powers of xi in the sums. In the general case for a polynomial fit of order k , their formats are:
Note carefully that these patterns only apply for the case when f (x ) is a polynomial. For other cases, we can still use linear algebra, and the form of Equation 16, as long as the coefficients appear in f (x ) only as multiplicative constants. Remember that the things we're varying in the method of least squares are the coefficients, not the terms involving x . Thus the function:
Is still linear in the coefficients a , b , c , and d . So linear algebra still works. It's only the individual elements of Q and U that would change.
The sequential solution
OK, we've seen how to use linear algebra to produce a least squares fit, for the case of batch processing, where we processing the entire data set at once, after it's all collected. But can we still transform the algorithm to support sequential processing? Actually, yes we can, and the conversion isn't difficult. Take another look at Equations 19 and 20, and you'll see that every element of both matrices is a sum (noting that ). So all we need to do is to maintain these elements as running sums, and we can still produce a solution using only the past values of those sums, plus the one new data point.
Unfortunately, we still have to assemble Q and U each time. There is no algorithm to calculate the matrices Q n and U n , given Q n -1 and U n -1 . For that reason, we can't escape having to invert Q each time. The upside is, the order of Q is relatively small. It's only the order of the number of coefficients in f (x )–2 for a linear fit, 3 for a quadratic fit. While the computer time to generate the inverse used to be a cause for concern, it's not so much anymore, thanks to faster processors and floating point hardware.
As the cherry on top of the sundae, I've generated Figure 5 , which shows the sequential algorithm applied to a second-order curve fit. Pretty slick, huh? Notice that the error band is still pretty narrow, considering the amount of noise involved.
Click on image to enlarge.
As a matter of interest, for this example I used the function:
The least squares algorithm guessed 4.667, 0.359, and -0.009 for the coefficients. Perhaps they're not as exact as we might like, but still not bad, considering the amplitude of the noise and the small data set.
Why are these estimated values not equal to the original ones? Simply because they're better , in the sense that they produce a smaller variance based on the data given. To get closer to the true coefficients, we'd have to take more measurements.
The past is prologue
At this point I think we've pretty much exhausted the topic of least squares fits. Now it's time to move on to more challenging topics in estimation theory. When you think about it, it's really quite remarkable that we've been able to get as far as we have, without discussing the topics of probability theory and probability distributions at all. We even managed to define the standard deviation of noise, without mentioning its relationship to the normal distribution.
That's going to change. While it's possible to delve even further into the theory without mentioning probability distributions, to do so gets more and more awkward as we go. In my next installment, we're going to set aside curve-fit algorithms for a time, and focus on probability theory.
You might want to tighten your seat belts; it could be a bumpy ride.