Most of you readers know that my main focus in this column has been on the use of fairly high-powered math to solve practical problems. Over the years, we've covered topics ranging from fundamental functions to vector and matrix calculus to Fourier transforms to numerical calculus to control systems to digital filters, and lots of stuff in between.
Every once in awhile I get e-mail asking me, “Why are you focusing on these things, and what do they have to do with embedded systems?” Another perennial favorite is, “Why are you writing software for that, when it's already available in
| Jack Crenshaw's Estimation Series
Part 1: Why all the math?
Part 2: Can you give me an estimate?
Part 3: Estimation interruptus
Part 4: The normal distribution
To answer the second question first, this column is, after all, in a magazine called Embedded Systems Design . It's supposed to be about implementing embedded systems, and it's not likely that you're ever going to build an embedded system that has Matlab or another proprietary product built in. I'm not going to say it can't be done; it can, given a fast enough processor and enough money to pay the royalty fees. But the company who can get the same algorithms working in their own software, royalty-free, are going to be able to sell their product at a lower cost, and thereby eat your lunch.
As for the first question, it's true that embedded systems aren't what they used to be. The original embedded systems were missile guidance systems and were all about steering those missiles to targets, based on math models of their dynamics. Today, such embedded systems are in the minority. These days, embedded software is in cellphones, PDAs, set-top boxes–even microwave ovens and simple thermostats. Not exactly things requiring heavy-duty math.
Fine. If that's the kind of embedded system you're involved with, I can't help you. My focus has been, and always will be, on those embedded systems that do need the high-powered math.
Can it be possible that all the math algorithms I've covered in lo these seventeen years, have really been used in embedded systems. Yup, it can. I've not shown you one single algorithm that I've not applied myself, in one embedded system or another.
The least squares fit
This month, I want to talk about yet another math technique, called the Least Squares Fit . This won't be my first time to talk about it. I first covered the topic in 1993 (“The Least Squares Fit,” Embedded Systems Programming , April 1993, p. 36). In that article, I used a very different, and more academic, approach. I began with the most general and powerful derivation, then specialized to more practical cases. Such an approach feels very comfortable for an old college professor, but perhaps not so comfortable for the student, who has no idea where the prof is going.
This time, I'll still talk about the elegant math and the more general cases, but I'll begin with a very simple, practical, and useful special case, postponing the fancy math for later.
At its core, the least squares fit is all about drawing graphs. In particular, the two-dimensional, x –y graph sometimes called a scatter plot. I have a process in which I know or suspect that one parameter–the dependent variable –seems to depend on the value of another one, the independent variable . We can, of course, assign any names we like to these variables, but in this study I'll follow the usual convention and call the independent variable x and the dependent variable y .
Mathematically, we hint at the functional dependence by writing:
where f (x ) is some function to be either given later or discovered.
While we can use x –y graphs for a number of purposes, it's in the processing of experimental data that the least squares fit has its greatest value. For exanple, someone once told me that the period between the chirps of a cricket depends on the ambient temperature. If I'm sufficiently motivated, I might perform an experiment to find out. Or, if I were Galileo, I might measure the motion of a ball rolling down an inclined plane and seek to determine the nature of gravity.
If I plot the measured data, I might get a graph something like Figure 1 .
From the figure, I can see two things immediately. First, there does indeed seem to be some correlation between the x – and y -components of the data. The y -value seems to decrease as x increases. But I can also see that the data doesn't at all follow the nice, smooth curve that I might have expected.
There are two ways I might react to the data I've just plotted. If I haven't yet had my morning coffee, and my brain is not yet in gear, I might be forgiven for thinking, “Wow, I guess the functional relationship between x and y is a lot more complicated than I thought.” In that case, I might try to fit a curve that passes directly through every single data point, as in the red curve of Figure 2 .
Hopefully, though, I'll be a little more astute and recognize the data for what it really is: a simple functional relationship, corrupted by lots of noise. Given that level of noise, the best guess of the relationship is the yellow line in Figure 2.
But how did I get the yellow line? For me, it was pretty easy: in Excel, I clicked a button that said, “Add Trendline.” Because the data is so noisy, a straight line is about as good a fit as we can expect to get. Business types call this process “linear regression.”
And what, you ask, did Excel do to get the trend line? Why, it performed a least squares fit. Now I'm going to show you how you can do it yourself, in your own software.
If there is truly a linear relationship between x and y , it's described by some linear equation:
where a and b are constants with values we don't yet know. Our challenge is to determine those values that give the “best fit” to the experimental data. But to do that, we must first define what we mean, exactly, by the term “best fit.” And to do that , we need to establish a few definitions.
Remember, we already have experimental data, consisting of measured values of x and y , collected together as data points (x 1 , y 1 ), (x 2 , y 2 ), and so forth. Let the coordinates of the i th point be x i and y i . respectively.
Never forget that these coordinates are not, repeat not , variables. They aren't going to change. They're data, with specific values recorded in our data book.
This concept may take a little getting used to. Years of algebra have conditioned us to think of parameters with names like x and y to be unknowns, and parameters like a and b to be constants. That isn't the case here. While a and b certainly look like constants in Equation 2, they are in fact the things whose values we seek. They're the unknowns.
Now, for every value x i there are actually two values of y . First, there's the measured value, y i . That, remember, is a constant. Then there's the value y would have, if it were generated by Equation 2. In general, these two values will be different, which is another way of saying that the measured data has an error, given by:
For our straight-line function:
We need one last concept to complete the formulation. Remember that we typically have anywhere from a handful to thousands of data points, but we are looking for the values of only two parameters, a and b . It should be pretty obvious that no data point is any more important than any other (although we can adjust the weighting of them if we choose to). That is, there's not usually anything that requires that the line pass through the first point, or the last, or any other. Whatever criterion we're going to use for “best fit,” it's probably going to be a bulk property that involves all the data points.
To that end, we're going to define a penalty function based on the individual errors. That penalty function is the sum of the squares of all the errors:
where n is the total number of data points. We'll assert that the best-fit values of a and b are those values that minimize M .
I can show you the general solution in all its elegance (and it is indeed elegant). But first I'd like to walk you through a simple example, involving only four data points. The data is shown in the first two columns of Table 1 . In the remaining columns, I'm showing the values of f (x i ) and e i .
The value of M is the sum of the squares of the terms in the last column.
Expanding and collecting terms gives:
This is the function we seek to minimize. Does it even have a minimum? A little reflection will show that yes, it does. If I make either a or b huge positive or huge negative numbers, M itself will be huge, Since it's a sum of squares, it's always positive. So somewhere between huge, zero, and huge again, there must always be a minimum. This is, of course, hardly a rigorous proof, but it works for me.
To find the minimum takes a little calculus, but only a little. If I take the partial derivatives of M with respect to a and b , I get:
These partials represent the slopes of M in the a and b directions. When M is at its minimum value, both slopes must be zero. So we end up with two equations to solve simultaneously:
At this point, I'll spare you the linear algebra, and give you the answer, which you can check by direct substitution:
Figure 3 shows the fitted data. From this graph, it appears that a straight line is not a very good fit to the data–we might have been better off with a quadratic. But for the straight line that we assumed, this certainly looks like the best fit we could hope for.
Now that we've seen how to do the math for a simple 4-point example, let's try to generalize it. To do so, we're going to have to lean on summation notation, but if anything, it comes out much cleaner than the math of our example. Take another look at Equation 5:
Substituting from Equation 4 gives:
Expanding the polynomial gives:
As we expand into individual sums, remember that a and b are the same for all values of i , so we can extract them from the sum operators. Only the terms involving x i and y i actually get summed. To simplify the notation, I'm going to drop the summation range info, except for the first sum. We get:
The value of that first sum is simply n , the number of data points.
As in the example, we seek the minimum of M by taking its derivatives with respect to a and b . We get:
Setting both of these to zero (and dividing both by two) gives our two linear equations in a and b :
We have one more bit of elegance to introduce, and that's to note that, since the equations are linear in a and b , a matrix formulation is appropriate. Let:
Then we have the lovely solution:
After all the math we needed to define the “best fit” criterion, the final result is satisfyingly simple. Only four of the summations are unique. Yes, we have to invert the matrix Q , but even that inversion is easy, because no matter how many data points we have, the matrix is always only a 2×2 matrix. The 2×2 matrix is unique in that, unlike its larger cousins, we can write down the inverse without any numerical tricks. First, the determinant is:
Then the inverse is given by swapping the two diagonal terms, negating the other two, and dividing by D .
It's important to note that Q doesn't depend on the measured y -values at all; only on the x 's. In the general case, this fact doesn't matter much. But it has huge implications if I know the values of the x 's in advance.
You'll note that in my simple example, I used values of x i that were evenly spaced, starting with x 1 = 0. This is not at all required. Equations 16 through 18 work just as well if the x 's are unevenly spaced. They don't even need to be in any kind of order. But in many cases, particularly in embedded work, I do know that the data is going to be measured at regular intervals. What's more, with a little judicious scaling and base-shifting, I can always make the x -values start with [0, 1, 2, 3, …]. In this important special case, Q never changes. It's a constant we know in advance. Therefore we also know its inverse in advance, and the need to invert the matrix goes away completely.
Suppose, for example, that I'll always be applying the algorithm on the last five measurements; sort of a souped-up moving average. Then the x -values will always be (again, shifting and scaling as needed):
The values of the sums are:
The determinant is:
and the inverse is:
Of course, it's a constant matrix, so we can encode it as such, using whatever scaling we find to be appropriate.
Now let's look at the terms in U . These, you'll recall, do involve the measured values of the y 's, so we can hardly compute them in advance. On the other hand, u 1 depends only on the y 's, not the x 's. We can compute this one recursively, using the equivalent of a moving average (or, more properly, sum). Just add the one new data point and drop the oldest one.
Let me make this last point crystal clear. Suppose the last measurement I've processed was y k . After I've processed that data, the value of u 1 will be:
Now suppose that I get one new data point, y k +1. The new value of u 1 is:
Here I'm using the '+' superscript to indicate the updated version of u 1 . Looking at the two values, it's clear that:
You can implement this algorithm very efficiently by keeping the y 's in a circular buffer. Subtract the oldest value, replace it with the newest one, and add that one to the sum.
Computing u 2 , the second term in U , is trickier because of our need to deal with the new offset in the x -values. Before the update, u 2 will be:
After processing the new data point, it's going to be:
Subtracting the “before” and “after” forms tells us what we must do to update the summation. We get:
Now look at the sum inside the parentheses in Equation 32. It looks suspiciously like the sum in u 1 + . In fact, we can write:
Therefore, I can write:
So, as it turns out, we can get the “harder” sum in U without doing a summation at all. Just take the old value, add five times the newest y -axis data value, and subtract the new value of u 1 . Piece of cake.
Now that we have expressions for both Q and U for this special case, we can write down the scalar expressions for a and b . They are:
This is the way I implemented a real-time version of the least squares fit, some years ago. Trust me, you will not find a simpler implementation anywhere on the planet.
Of course, you may choose to use a different number of data points in your implementation. But the process is the same; only the values of the summations change.
The general case
All the analysis you've seen here so far has been for the special case where we assume a linear polynomial for f (x ). You might think that if we want to fit higher-order polynomials, we'll have to abandon the linear algebra used so far. But we don't. The order that matters is not the order of the polynomial, but the order used in the error penalty function. That's always going to be order 2 (hence, least squares ). If we try to fit higher-order polynomials, we'll have more unknown constants to solve for, so we'll need larger matrices. But we still get to use linear algebra. For the general case, the matrices Q and U look like this:
where k is the order of the polynomial being fitted.
You can see the pattern. The elements of Q are equal down the minor diagonals, and are the sums of various powers of x i . The elements of U have sums involving y i times some power of x i . All the relationships we've seen so far, including the fact that Q depends only on the x 's, and can sometimes be computed in advance, still hold. The only difference is that, for polynomial orders greater than one, the matrix inversion becomes nontrivial.
Remember again, though, that if the data is measured at a fixed rate, as is often the case with embedded systems, the value of Q , and therefore its inverse, doesn't change.
Can we use the concept of least squares fits with functions other than polynomials? You bet we can. Whenever you can express the fitting function as a linear combination of primitive functions, you can still apply the least squares criterion. That is, if your trial function has the form:
You can apply the least squares method of Equations 11 to 16 as usual. You'll still get equations in which the derivatives are linear in the constants, so you'll still get a linear algebra solution like Equation 19. It's just that the summations will not be so easy to evaluate and will require calls to the primitive functions.
Even more generally, we aren't absolutely required to restrict the unknown constants to the role of linear coefficients. Any function f (x , a , b , …) will do. In that case, however, the partial derivatives are liable to be messy, and it's likely that you won't be able to solve the resulting simultaneous equations for the minimum.
On the other hand, with the computing power available to us these days, and powerful numerical optimization methods, we can still hope to determine the optimal values of the unknown constants using some numerical optimization method.
And yes, now that you ask, I've implemented this kind of least squares fit in embedded real-time software, too.
Why least squares?
After all is said and done, you might be wondering if the definition of the penalty function of Equation 11 is the best one. How did we come to decide that the least squares criterion is the right one for a “best fit,” anyhow?
The answer is both simple and incredibly profound. It involves the secret to Life, the Universe, and Everything.
Albert Einstein is widely quoted as saying, “Simplify as much as possible, but no more.” Ockham's razor says that, given two competing theories, the simpler one is probably the correct one. And indeed, the universe does seem to be a simple place, in the sense that most of the equations describing it are themselves simple, usually involving linear relationships. Acceleration is proportional to force. Voltage is proportional to the rate of change of current.
If a linear relationship doesn't do, a quadratic or inverse-square relationship usually does. The electrostatic force is inversely proportional to the square of the distance. Einstein himself was struck by the apparent simplicities in nature and noted that we had no right to expect such a result. But we have it just the same.
So as we look for a definition of “best fit,” we can feel justified in seeking a simple definition. There are candidate criteria even simpler than the sum-of-squares criterion. For example, we could try minimizing the sum of the errors instead of their squares. Or we could minimize the sum of their absolute values.
Neither of these criteria work. If I define M as the sum of errors and then move the fitting line up and down by changing a , I simply add na to M . There is no minimum value in M . On the other hand, if I use absolute values, moving the line upward increases errors below the line, and decreases the ones above it. Again, there's no direct relationship between a and M that can lead to a minimum.
There is one possible candidate that could work. This is the so-called minimax solution, in which we seek to minimize the magnitude of the largest error(s). Such an approach usually gives a solution in which two or more of the largest errors are equal in magnitude.
| Jack Crenshaw's Estimation Series
Part 1: Why all the math?
Part 2: Can you give me an estimate?
Part 3: Estimation interruptus
Part 4: The normal distribution
The problem with the minimax approach is that it's not continuous. We got good results from the use of the partial derivatives, by assuming M was a smooth function with continuous derivatives in the vicinity of the minimum. That won't be true for the minimax case. As we move the fitting line around, the points with the largest errors will tend to jump around also, and in somewhat unpredictable ways. A computerized search could still produce a minimum, but a calculus-based minimization won't work.
So here's the big secret behind the least squares fit, and it's also the secret behind Life, the Universe, and Everything:
It's the simplest definition that works.
Simplify as much as possible, but no more. Words to live by.
Jack Crenshaw is a systems engineer and the author of Math Toolkit for Real-Time Programming. He holds a PhD in physics from Auburn University. E-mail him at firstname.lastname@example.org. For more information about Jack click here