that the guy's only doing it for some doll —
Stubby Kaye and Johnny Silver, Guys and Dolls, 1955
This column is the fourth in a series on parameter estimation, leading up to the justly famous Kalman filter. The discipline is based on the fact that our knowledge of the state of any real-world system is limited to the measurements we can take — measurements that are inevitably corrupted by noise.
Our challenge, then, is to determine the true state of the system, based on these imperfect measurements.
In previous columns, I've discussed parameter estimation from the context of curve fitting, taking a graphical approach to arrive at the method of least squares. The general idea is to take more measurements — usually many more — than the minimum needed to determine the system state. Then you crank the data through an algorithm that mitigates the effect of noisy data.
| Jack Crenshaw's Estimation Series
Part 1: Why all the math?
Part 2: Can you give me an estimate?
Part 3: Estimation interruptus
Part 4: The normal distribution
The method of least squares is inherently a batch processing sort of method, where you operate on the whole set of data items after they've all been collected. But I showed you how to convert the algorithm to a sequential process that's far more suitable for real-time processing.
Of course, the whole point of the method of least squares is to smooth out noisy measurements. But we've never addressed the nature of the noise itself. We even estimated statistical parameters like mean , variance , and standard deviation , without ever defining these terms.
That has to change. In this column, we're going to look noise in the eye, and deal with its nature. We'll discuss the behavior of random processes , introducing notions like probability and probability distributions . For reasons that will become clear, we'll focus like a laser on a thing variously called the bell curve , Gaussian distribution , or normal distribution .
Now, I've been dealing with problems involving the normal distribution for many decades. But to my recollection, no one ever derived it for me. They just sort of plunked it down with little or no explanation.
This would usually be the place where I'd start deriving it for you, but I'm not going to do that either. The reason is the same one my professors had: The classical derivation is pretty horrible, involving power series of binomial coefficients.
Instead, I'm going to take a different approach here. I'm going to wave my arms a lot, and give you enough examples to convince you that the normal distribution is not only correct, but is inevitable.The nature of noise
Whenever I measure any real-world parameter, there is one thing I can be sure of: the value I read off my meter is almost certainly not the actual value of that parameter. Some of the error may be due to flaws in the instrument — scale factor and bias errors, linearization errors, quantization errors, temperature sensitivities, etc.
Given a thorough enough calibration process (and enough patience), I can compensate for those kinds of errors. But there's one kind of error I have no chance of correcting for: the random noise that's always going to be there, in any real-world sensor. Whenever I try to measure some state variable x, I don't get its true value; I get the measured value
Where n is the noise, changing randomly with time.
If I take only a single measurement, I can say nothing at all about the true value of x , because I have no idea what the instantaneous value of n is. But there is a way we can make an educated guess as to the value of x : measure it more than once. That's the whole idea behind the method of least squares.
In my previous columns about the method of least squares, we were focused on determining the (possibly changing) value of x . We didn't care about the nature of n , except that we wished it would go away. But, as you may recall, we did calculate parameters describing the nature of the noise; parameters like the mean , variance , and standard deviation . Why would we calculate these parameters, if we're trying to make the noise go away?
The answer is, we have this nagging suspicion that, by understanding the nature of the noise, we may be able to make a better estimate of x . At the very least, we can use these parameters to calculate error bounds for the measurements.
If there's any one thing that sets the Kalman filter apart from all other approaches, it's the fact that it doesn't just maintain a running estimate of the noise parameters; it uses them to get a better estimate — an optimal one, in fact — of the state variables.
That being the case, we find ourselves highly motivated to learn more about noise.
Where does noise come from, anyhow? For our purposes, we'll say that it comes from some random process , which runs independently of the dynamic system.
But what does that term mean, exactly?
That question has an easy answer. A random process is one whose output is unpredictable . In fact, “unpredictable” is a very synonym for “random.”
So in our imagination, at least, we can envision some random process — a physical machine, if you will –that's free-running in our otherwise pristine, state-of-the-art embedded system. The random process has no apparent purpose except to muck up that system.
But since that process is, by definition, unpredictable, how can we get a handle on its behavior? One thing's for sure: Although you might get some insight using a spectrum analyzer, you're not likely to learn much by watching white noise marching across the face of your oscilloscope.
If we're going to learn anything at all about the noise generated by a random process, we need to be able to study its outputs, one by one. We need the equivalent of a software break point, that will let us freeze the process in place. We need a single-step “randomizer” that gives us one, and only one, new value every time we push its “go” button.
Fortunately, suitable candidates are all around us.
As a kid, I played board games like Monopoly , Risk , or the Game of Life . Such games depend on the use of “randomizers” like coins, dice, or spinning arrows.
Or, come to think of it, spinning bottles. The unpredictability adds to the enjoyment of the game.
These primitive game randomizers have three things in common.
- When “activated” by a flip, roll, or spin, they must eventually come to rest in a stable state, so we can read the results.
- There must be more than one possible end state. Multiple values are sort of the point of the whole thing.
- The results must be unpredictable.
I should note that, because they are all mechanical devices, none of these gadgets are truly random. Being mechanical, they all obey Newton's laws of motion. If we could know, in advance, not only the air density, temperature, etc., but also the impulse given by the player's toss, roll, or spin, we could theoretically predict exactly what the end state would be.
The thing that makes the gadgets useful as randomizers is that we can't and don't know these things. We trust the devices to be fair because we assume that no ordinary human is so skilled that he can flip a coin, and make it come up heads every time.
Don't bet the farm on that last assumption. People can do remarkable things, especially if there's money to be made. Even so, we can analyze the nature of the devices by assuming that the results are both fair and random.
What are the odds?
A good definition for the word “probability” is hard to find. The ones I've found use synonyms like likelihood or chance, so the definition is circular. Fortunately, most of us have an innate understanding of the concept. The probability that the Sun will come up tomorrow is pretty high. The probability that I'll hit the Powerball jackpot, be asked for a date by Kate Upton, and be hit on the head by a meteorite — all in the same day — is vanishingly small.
If you really want to learn about probability, you don't need to go to Yale or Harvard. You only need to study under a professional gambler, like those fictitious, Runyonesque characters in Guys and Dolls . Nobody can compute probabilities in his head like a gambler. Only he calls them “the odds.”
Ask one of these guys what the odds are for a tossed coin coming up heads, and he'll immediately say “50/50.” That's his way of saying that there is no preference for one result over the other. Tossed many times, the coin will come up heads, on average, half the time. We'd say the probability is 1/2. That's 50% to the gambler.
Faced with the same question, a mathematician might define a probability function :
Since the two probabilities are equal, we say that the distribution is uniform . The gambler would say that the coin is fair .
The roll of a single die is also fair. Except for the number of dots, all the faces are made just alike, so there's no reason to suppose that one of them will come up more often than the other. The gambler would say that the odds of a tossed die showing, say, six, are 1 in 6. The mathematician would write
The sets of probabilities in Equations 2 and 3 are called probability distribution functions . For these two cases, the functions are discrete, having values only at the integer mesh points. Try as you might, you're never going to roll some dice and get a value of 3.185295. As for all non-integer results, the probability of that result is 0.
The probabilities of my sun vs. Kate examples are:
From these few and sometimes silly examples, we can get an idea as to what a probability really is. It must be a single scalar number that represents the likelihood that some event will happen. What's more, the value must be constrained to the range
because no event can happen less frequently than never, or more frequently than always.
Note carefully that the probabilities in Equations 2 and 3 add up to 1. When you flip a coin, you must get some result, and the result can only be heads or tails. Landing on its edge is not allowed. Similarly, when you roll a single die, getting a value between 1 through 6 is certain.
On the basis of this sometimes arm-waving argument, we can now give a rigorous definition of a probability. It's:
(6)On a roll
Let's perform a little thought experiment. I'm going to roll a single die six times, and count how many times each face shows up. The result is shown in Figure 1 , in the form of a histogram .
What's that you say? You were expecting to see each face appear once and only once? Well, that's what you'd get if the results were predictable, but they're not. It's a random process, remember? The chance of getting one and only one occurrence of each face are about:
If we roll the dice a lot more times, we should get a histogram more like what we expect. Figure 2 shows the result of 6,000 rolls.
Even with so many trials, we still see slight differences between the ideal and actual results. But at this point it probably won't take much to convince you that, on average, the number of occurrences are equal. The die is indeed fair, and the probability of rolling any given value is 1/6.
More dice, please
So far, the graphs I've shown are rather dull. Six numbers, all equal, are not exactly likely to get your blood pumping. But things get a lot more interesting if you add more dice. Figure 3 shows the histogram for two dice.
Now we're getting somewhere! At last, the histogram has some character.
Why are there more occurrences of a seven than a 2 or 12? The answer goes right to the heart of probability theory. The unstated rule for a roll with two dice is that we add the values of the two dice. When we add them, there can only be one way to get a result of 2: each die has to show a value of 1. Ditto for a sum of 12. But there are six possible ways to get a sum of 7. You can have:
All six ways must be counted, and the order matters. and count as two different ways, not just one. As we can see from the histogram, a result of 7 is six times more likely than a result of 2 or 12.
If you add the heights of all the bars, you'll get a total of 36. That makes sense; you can arrange the first die in six possible ways. For each of those ways, you can arrange the second in six ways. The total number of ways must be:
Our gambler friends would say the odds of a 7 are 6 out of 36, or 1 out of 6.
Well, if two dice make for a more interesting histogram, maybe we should try three or more. Figures 4 through 6 show the results for three, four, and five dice, respectively.
What are we looking at here? Can we say “bell curve”?
Let's review the bidding. We started this thought experiment with the simple statistics for a single die — statistics which happen to describe a uniform distribution , in which all outcomes are equally likely.
From that simplest of beginnings, we added more dice, always following the rule that the result is the numerical sum of the faces shown on each die. We watched the shapes of the histograms morph from the uniform distribution through the triangular shape of Figure 4 into the familiar bell-curve shape. It's really quite remarkable that we not only got a histogram of this shape from such primitive beginnings, but the bell-curve shape begins to appear with so few (three to five) dice.
But if you think that's remarkable, wait till you hear this: We would have gotten the same shape for any starting distribution! All we need is some device that produces at least two numbers at random, and the rule that we get the score by adding the individual results.
This truly remarkable result follows from the central limit theorem .
No doubt you've already figured out that the shape that these histograms seem to be trending to is the shape of the normal distribution. Now you can see why the normal distribution is so ubiquitous in nature. It's because you almost never see (except in board games) a single source of the noise. Usually the noise is generated by many random processes, all running independently of each other. As long as the outputs of the many sources are added together (as they would be in, say, an electronic circuit), the normal distribution is the inevitable result.Computing the odds
It's time for a little math. Let N be the number of dice for a given experiment. The smallest value we can get from the throw comes when all the die are showing 1's, so the score must be N . Likewise, the largest value must be 6N .
Recall that, in Figures 3 through 6 , the height of each bar is the number of ways a throw can generate a given result. Let the height for bar n be wn . Then, borrowing the form of Equation 6 , we can write:
Where, don't forget, W is the total number of ways we can arrange N dice. It's the sum of all the w 's.
I should say a word about the range. We get the smallest non-zero result when all the dice are showing 1's. Likewise, we get the largest result when they're all showing 6's. You can always make the range wider if you like — even ±∞, since the w 's are all zeroes outside the range shown. From now on, I won't show the range explicitly.
Since W is also the total number of ways we can arrange N dice, we know that it must be:
The first few values are:
You can verify the smaller numbers for yourself — just count the number of 1-unit “bricks” in each column. I have to admit that I let the computer verify the larger numbers.
We need a graph
Armed with Equation 10 , it's easy enough to convert repeating Figures 3 through 6 probability charts. I won't bore you by showing them here; they'll look just like the histograms, except for the for the vertical scale. I'll only note that the sum of all the probability bars must be:
As must be the case for any decent probability function (did this surprise you?)
Instead of repeating the bar graphs, I'd like to corral all the probability distribution functions into a single graph. To do that I have to switch from Excel's bar-chart format to the x –y (scatter) plot format. Figure 7 shows the result.
Well, we did manage to get all the curves on the same graph, and they look very nice, don't they? Let me remind you, though, that the curves don't really exist at all — they're still bar graphs in disguise. When you look at the figure, your brain naturally sees continuous curves. Remember, though, that we're still talking discrete values here. The only data items occur at the marker points on the graph. The lines are there only to show which marker points are in which group.
I suppose I could have emphasized the integer-only nature of the data by leaving off the connecting lines, but trust me: that graph looks even more confusing.Normalizing the x-axis
Looking at the “curves” in Figure 7 , the transition from straight lines to swoopy curves is hard to miss. But the transition would look even more convincing if we could get the curves on the same horizontal scale. We can certainly do that, by scaling the x -values to a specified range. But the scaling is a little tricky, mainly because there are no mathematical rules as to how to do it. I just made some arbitrary decisions, which were:
1. The peak should be centered at x = 0, as any good error curve should be
2. The horizontal scale should range from -1 to +1
3. Each curve should start with one and only one zero (not the four, say, of the five-dice case)
Let's see how the scaling works out. Table 1 shows the essential x-axis range information for each “curve.” The Min and Max columns include the single zero values.
From the table, we can write some equations. If N is the number of dice, then:
So to scale to the range -1..1, we need the scale factor:
After scaling, we can translate the x -values left by subtracting 1. The scaled results are shown in Figure 8 .
Did we get it right?
What's that you say? Something looks wrong? You were expecting curves enclosing equal areas?
Ah, there's the rub. You're still looking at the graph and seeing continuous curves — a mistake that's even easier to justify, considering that the new x -values are no longer integers. But, as before, Figure 8 is really still a bar chart in disguise. The data still only exist at the marker points.
You're expecting curves enclosing equal areas because you're jumping ahead of me. You know that all the probabilities for a given “curve” should add up to 1, and you're thinking “integral under the curve.” But, of course, there is no integral under the curve (yet), because there's still no curve — only the lone data points where the markers are. Or, equivalently, the heights of the bars in the bar charts.
What if I want areas?
As we've discussed, the sum of the probabilities for all possible results must add up to unity. And indeed they do, in Figure 8 . If, for a given “curve,” you add up all the y -axis values at the marker points, you will absolutely get 1.0.
But what if we actually prefer the “curves” to be shown such that they enclose equal areas? In that case, we have one more transformation to perform.
Figure 5 gives us a hint. In that graph, the width of each bar — call it Δx — is exactly equal to 1. So the area of the bar is the same as its height. Mathematically, we can say:
In this special case, adding up all the occurrences is the same thing as computing the total area, which had better come out to be 7,776.
The situation is the same in Figure 8 . Now the graph is showing probabilities, but the width of the (not very apparent) bar is still unity, so the area of the bar is:
Adding them all up, we should get:
Why would I want to include this new parameter, Δx , if its value is unity anyhow? For two reasons. First, the variable x may have units, like meters, volts, or pomegranates. The parameter Δx might have the value 1, but it will still have the same units as x . Mixing parameters with and without units is not allowed.
More importantly, I just got through scaling the curves to force them onto the same horizontal range. In doing so, I multiplied by the scale factor given in Equation 16 . Now I see that this scale factor is in fact the very same thing as Δx . In the figure, you can see that the marker points are getting closer together as we add dice to the experiment. As in Equation 19 , the total area is no longer unity, but Δx .
To force the curves to have areas of unity, I have to divide the y -values by Δx again. Since these values are no longer probabilities, I'll just call them y . Figure 9 shows the results.
Now you see it …
Now here's a graph we can learn to love. Now that we have equal areas under each curve, we can see more clearly how they morph to look more like continuous curves. Not only do the (apparent) curves get smoother as we add dice, but the peak also gets higher, while the sides pinch in to maintain the equal area requirement.
But hang on … is that a fifth curve I spy? According to the legend, the dotted black line is something called “Normal .” Unlike the other “curves,” it's a truly continuous curve.
That, my friends, is the normal distribution function . It's taken us awhile to get to it, but the evidence of Figure 9 is overwhelming. If, seeing Figure 9 , you still aren't convinced that the sum of separate random processes trends to the bell curve of the normal distribution, there's no hope for you.
Sum vs. integral
Before we go forward, I want to call your attention to a very important aspect of Figure 9 . As you know, the two-dice through five-dice “curves” are not really curves at all, but discrete functions, with y -values that only exist at the marker points. But the curve labeled “normal” is very much a continuous curve.
It's not often that you get to see both discrete and continuous functions on the same graph. How did we do this?
The answer becomes clear when you compare the area under the curves. When I scaled the y -axis values to force the areas for the discrete curves to be unity, I required:
For the continuous curve, I require:
See how the two formulas complement each other? For the discrete version, we're measuring the area of a bar whose height is P (n ), and whose width is Δx . Similarly, for the continuous function p (x ) we get the area by integrating it over all real numbers. So what is this new function p (x )?
Well, it's a probability all right, but it's not just a probability that a measurement is exactly the same as the x-axis value. Since x can range over all numbers, the probability that the result is exactly equal to x is zero.
Instead, p (x ) is the probability that the measurement fall into an infinitesimally narrow range, between x and x +dx .The math of it all
Now that you've seen the curve, I still must show you the math behind it. Here again, I'm given the opportunity to derive the math from first principles. But I'm going to duck it again. As I mentioned earlier, the classical derivation is pretty horrible. If you'd like to see it done the easy way, see the exquisite paper
The Normal Distribution: A derivation from basic principles , Dan Teague, North Carolina School of Science and Mathematics
To learn all there is to know about the normal distribution (including its origin, inspired by a gambler), see the exhaustive study by Saul Stahl:
“Evolution of the Normal Distribution,” Saul Stahl, Mathematics Magazine , Vol. 79, No. 2, April 2006, pp. 96-113
As for my “derivation,” I'm going to follow the example set by J. Willard Gibbs, the father of statistical mechanics, circa 1900. He said (and I paraphrase), “We use this form because it's the simplest one we can think of, that works.” Now, that's my kind of physicist!
Take another look at the shapes in Figure 9 . There are a lot of things we can say about them, without knowing anything about the mathematical formula underlying them. Indeed, if we'd been clever enough, we could have said these things from the outset. These things are:
• The most probable value of x (the peak of the distribution) should be zero
• The distribution should decrease monotonically as x moves away from zero
• The functions should be symmetric around zero
• It should tail off to zero at the extremes (which are ±∞)
As soon as you hear the words, “tail off to zero,” you should be thinking of an exponential function. One function that does this is:
But that one's no good, because it's not symmetric. In fact, it grows to infinity as x goes more and more negative.
So what's the next simplest function we can think of? Why, it's the one that doesn't care if x is positive or negative:
This is the function Sir Willard used, and if it's good enough for him. it's good enough for me. Figure 10 shows the function in all its glory.
That's definitely the shape we want. We still have to add some bric-a-brac to make it functional, but the shape is perfect.
By now we should be very comfortable by the fact that any probability distribution curve must include an area equal to unity. Does this one? Let's find out. The area under the curve of Figure 10 is:
Did you see that I had to integrate from -∞ to +∞, which is of course the full range of real numbers? The function in Figure 10 sure looks as though there's little or no area out past x =±4 , but since the function never quite gets to zero, we still have to include those tiny slivers of area out in the suburbs.
Now, what's the value of the integral? We can find it in a number of ways. If you're feeling adventurous and like to do things from first principles (as I usually do), you can derive the integral yourself. It's fairly easy, but not at all obvious. See how here:
If you still have your book called Tables of Integrals , you can simply look up the answer. Your book is probably not the same as mine — mine was Pierce, printed in 1939.
Or, you can do as I did: Ask Mathcad, who says:
Noting very astutely that is not the same thing as 1, I see that I must modify Equation 23 to read:
In this form, the function has an integral of 1, so it's earned the right to be called a probability distribution function (hence the name change). Note that the height of the central peak of p(x) occurs when x = 0, where it's clearly:
(27) On the home stretch
As you'll recall, in building Figure 9 I had to shrink and stretch the N -dice “curves” to force them onto the same x -axis interval (±1) and keep their areas equal. We need to be able to do something similar for p (x ). The new multiplying constant takes care of the area constraint, but we still need to be able to scale the x -axis width. I think it's safe to say that we won't always want the width of the central peak to be about ±2 or so. Even if we did, we still need a scale factor on x , because remember, x can — and often does — have units. I'm pretty sure that I don't know how to raise e to the power 1.618 pomegranates .2
To take care of this, let's make the change of variables:
I'm sure you must be wondering where that factor of 2 came from. It seems like an unnecessary complexity, added for no good reason. Actually, there is a good reason — even a very good reason — but it won't be apparent until later. For now, just trust me, Ok?
Note carefully that it's not enough to just substitute for x in Equation 26 . If we try to just stretch or shrink the horizontal scale, the function will still have the same height, so the area will change. We really need to go back to Equation 24 and evaluate the integral again. Differentiating the last of Equation 28 gives:
Substituting for both x and dx in Equation 24 gives the new integral:
Since we're integrating over the range ±∞, the changes to the exponent don't matter. times infinity is still infinity. So the integral still evaluates to , which makes the new area:
And our function now takes the form:
There is one last little tweak to p (x ). Sometimes, people need to translate the x -component so that the central peak no longer occurs at x = 0. This isn't so much a problem for us, because when you're dealing with noise, it's most likely value will always be zero. But for the sake of completeness, here is the normal distribution function in its most general form.
As you can see, we now have two parameters we can adjust to match the situation. The constant μ is an additive factor to shift the peak left and right, while σ allows for scaling (and possibly removing the units of) x .
These two parameters have names, and those names — which come from the science of statistics — should be familiar to you. μ is the mean , and σ is the standard deviation . As my last trick for this column, I'll prove to you that these names fit the statistical definitions of these parameters.
Because we had to scale x , we now have a factor of σ in the multiplicative constant. This means, or course, that the height of the central peak will change as we vary σ.
The expectation value
Let's look back for a moment, to the things we were doing with dice. For any number of dice, I showed you the histograms, which can be easily turned into probability distributions using Equation 10 . Until now, we've only concerned ourselves with the probabilities of having a certain result, like 2, 12, or 7. But what if the thing we're interested in is not the result itself, but something that depends on it? To stick with the dice-game theme, what if you get, say, $10 every time you roll two dice and get a 4, but only $2 if you roll a 9 (which, you may recall, has the same probability: 1/9). In that case, it's not enough just to know the probability of getting a certain result from a dice roll; you also need to know what happens when you get that roll. In other words, you need the rules of the game.
To take another example, suppose I buy a $1 lottery ticket, for a pot that's currently worth $300,000,000. What can I expect to get out of the deal? Well, one thing's for sure: It's not the 300 mill, because my likelihood of winning is very, very low.
There's a mathematical term for this concept, and it's the same one the gamblers use. The only difference is that the gamblers were using it several thousand years earlier. The term is called expectation value .
Mathematically, if P is the probability of winning, and v the payout value, then the expectation value of my lottery ticket is:
Here I've shown two popular notations for the expectation value. I tend to prefer the angle-bracket notation ⟨..⟩, because it's completely unambiguous. But the E(..) notation seems more popular lately.
The same principle works for games like the dice game, only then we need to compute the average of all possible outcomes. If there are n possible outcomes from a given dice roll, then the expectation value becomes:
Now that we see the concept, it's easy enough to extend it to the case of continuous functions. If f(x) represents some function of x (the rules of the game, if you will), then its expectation value is:
This important integral embodies the central idea of how to deal with random processes.
For everything we'll be doing from now on, we'll be using the normal distribution, so we might as well insert it into Equation 36 explicitly, to get:
Just to emphasize: This definition works for any function f (x ) — at least, any “well-behaved” function, meaning that it doesn't have any internal infinities. Of course, there's no guarantee that we'll be able to get a closed-form solution; we might have to resort to a numerical method such as Simpson's rule.Working the statistics
For my last trick, I'd like to show you how to use Equation 37 to compute statistical parameters like mean, variance, and standard deviation. For each of these parameters, we'll be evaluating integrals like Equation 37 , Each time, we're going to need fundamental definite integrals like these:
How do we know that these are the right results? Well, you can ask Mathcad, as I did, or consult a table of integrals, as I also did. But if you're curious to see how the derivations play out, you can find the secrets here:
The trick is to square the integral, convert to polar coordinates, and to integrate over r and θ . Slick.
By the way, while you're at the MIT site, you might want to browse their other videos, which are myriad.
Armed with the primitive integals of Equation 38 , we can compute the mean values when f(x) is any power of x . For each case, we'll use the change of variables:
The integral of the distribution function: f(x) = 1
This has to be the simplest function worth looking at. For this case, Equation 37 becomes:
From Equation 38 , this is:
This is, of course, the same result that I wrote down in Equation 25 . We chose the multiplying constant to force the integral of the distribution function to be unity, so it's hardly surprising that we get the result we were demanding. Still, it's sort of comforting to see that mathematics still works.The mean: f(x) = x
If your memory is good enough to remember my first column on this subject, you'll recall that we calculated the average , or mean , of a bunch of numbers from the familiar formula:
By analogy, the mean of the function f(x) = x is given by the formula:
Making the usual substitution, we get:
We can split this one into two integrals:
We've already established that the first integral in square brackets is equal to . What's more, according to Equation 38 is equal to zero. So the mean of x is simply:
The variance: f(x) = x 2
Finally, recall that the variance of a bunch of numbers was defined to be:
This time, our f(x) is equal to x 2 , and the defining equation becomes:
Note carefully that, as in the variance of a discrete set of measurements, we are calculating the expectation value of x as measured from μ, not from x = 0. This makes sense. We're looking for the variation from the central peak, wherever that is. Making the usual substitutions, we get:
Using the identity in Equation 38 , we get, finally:
Now we have the results for all three incarnations of f(x) . We found that:
So what is the standard deviation ? Why, it's what it's always been: The square root of V . The value s, which we introduced just to give us a way of scaling the width of the central peak of the distribution is, in fact, σ.
Now, at last, you understand the reason for that seemingly unnecessary step of including a factor of ½ in the exponent of the distribution function. If we hadn't put it there, we wouldn't have ended up with σ as the standard deviation.A view backwards
Well, it's been a long, slow slog, but we've made great strides in defining and understanding the normal distribution. Let's just briefly review what we've done.
I began this column by pointing out that noise is always going to be present in embedded systems, so we need to understand its nature so we can better deal with it. As a way of dipping our toes into the water, I suggested that we look at the most primitive kinds of random processes, which are physical “randomizers” like coin flips, pointer spins, and dice throws. In their most primitive forms, all three kinds of devices have uniform probability distributions, meaning that any one outcome is as likely as any other.
But when we began to look at the statistics of thrown dice, we found that the distributions were no longer uniform, but trended towards a continuous, bell-shaped curve. The only requirement is the usual rule for multiple dice, that the final result is the sum of all the values showing on the various dice.
After some judicious scaling, I developed Figure 9 , which shows pretty convincingly that not only do the curves trend toward a limiting case of a continuous curve, but that the curve is, in fact, the normal distribution.
Without actually proving it, I suggested that the mathematical form for the distribution should be the one suggested by Sir Willard Gibbs: “It's the simplest one we can think of.” From that simple conjecture, I tweaked the distribution with a multiplying constant and a couple of constant parameters, μ and σ .
Finally, I introduced the concept of an expectation value of some function f (x ). Then I specialized f (x ) to be the first three powers of x : x 0 =1, x 1 =x , and x 2 . We found that:
So the parameters introduced into the normal distribution for purposes of scaling turn out to be the mean and standard deviation.
Now that we've looked the normal distribution in the eye, are we done with it? Not by a long shot. For starters, you've probably heard terms like one-sigma, three-sigma, and even six-sigma. From the terms, you can probably guess that they relate to deviations of σ , 3σ . and 6S from the mean. The implications of these results are profound, because they relate to the reliability of any process affected by random noise.
| Jack Crenshaw's Estimation Series
Part 1: Why all the math?
Part 2: Can you give me an estimate?
Part 3: Estimation interruptus
Part 4: The normal distribution
Lastly, we've only looked at scalar functions f(x) . But in most real-world problems, the state of a system is described by multiple scalar values, which we can lump together into a state vector . For such cases, we need to extend the normal distribution, and in particular the variance, in scalar/matrix form. In the vector form of the normal distribution, the variance V becomes a matrix, famously known as the covariance matrix . This matrix plays an utterly critical role in to the Kalman filter, which is the whole point of this study.
I'll be talking about all these extensions of the normal distribution in my next column.
See you then.