It turns out that Brent was wrong when he combined bisection with parabolic interpolation. There's a better way to converge on a minimum.
The lyrics of Joni Mitchell's song, “Both Sides Now” go, “I've looked at clouds from both sides now … [but] it's clouds' illusions I recall; I really don't know clouds at all.”
Sometimes that's the way I feel about function minimization. Looking at my file directory, I see that we're currently in month 17 of this series; surely some kind of record, and one I'm quite sure that my editors, as well as you readers, are hoping I don't try to extend any further. Though I have talked about a lot of other things, from Good Guys/Bad Guys to Heathkits, minimization has remained the main focus. We've probably spent more time on this thing than Brent did when he developed Brent's method in the first place.
And yet, each month, I seem to discover great truths about the algorithm that I didn't see before. Each month, I think I understand what's going on pretty well. And between columns, I often discover, along with Diamond, that my insights were mostly illusion, that “I really don't know minimization at all.”
This month offered another such breakthrough-perhaps the last one we need. Certain things have been nagging at me for some time, and I've voiced my thoughts about them as they came up. But now, at last, I see not only why they have been nagging at me, but why they are inevitable with the algorithm upon which Brent's method is based. Even further, now that I see the clouds more clearly, I also see what needs to be done to clear the fog. I have some brand new insights as to the behavior of both Brent's method and my own, insights that I don't believe anyone has voiced before, and that have got me pretty excited. I hope you'll catch the excitement, too. I also hope you'll agree, when you see the finished product, that it was well worth the wait.
At least one other person has caught the fever. James M. Moe, of Sohnen-Moe Associates, has been doing some experimenting on his own, and sharing his ideas with me via e-mail. Was it James' thoughts that triggered my own? Could be; doing something like this is always easier when you have other folks to talk to about it. In any case, I'm gratified to see that the same nagging problems that bothered me were bothering James as well. Though our solutions have not always coincided, I want to thank James right now for his insights, and am perfectly comfortable with giving him full credit for any breakthroughs we discover together.
Lately I find myself a heavy user of both Matlab and its companion, Simulink. I like to think I know a few things about software math algorithms; I'm the original Jack of all trades, master of some. But I can't even think about being master of them all. No one can. But a company like Mathworks can hire different masters for each field of expertise, and that's what they've done. So I've had to force myself to swallow my pride, excise the NIH factor, and use Matlab and Simulink without worrying about what's going on inside. The more I use these tools, the better I like the notion of not having to worry.
Wanna see an example? Well, if you wanted to find the minimum of a function, you could wait for this incredibly long series to conclude, and use the minimizer that finally comes out of it. Or, you could write the two lines of Matlab code:
x0 = 0;
x = fminsearch(@f, x0);
Does this code find the minimum of f(x)? Yes, it does, for virtually any f(x) you care to throw at it. Is it fast, accurate, robust, and stable? Yes, it is. How does Matlab do it? Do we care? Do we need to know? Not really. I've come to trust Mathworks to use the best algorithms the world has to offer. Just out of curiosity, I'd like to know what method they use, but I don't really need to. All I need to know is that I get the answer I'm seeking. No muss, no fuss.
All about stepping
Let's get back to the business of minimizing. As you regular readers know, my minimization algorithm starts out seeking the general location of the minimum of a function by stepping along it. Lately I've drawn a lot of flak from readers who think this is an inefficient process. I touched on this subject last month, but I didn't have time to treat it properly, so this month, let me try again, this time with an analogy.
Suppose I have a function:
And I want to find a root of the function (not the minimum). That is, I seek the value of x such that:
A typical function is shown in Figure 1. The root is the point where the curve crosses the x-axis.
How would we go about finding the root, x, given only the description of the function, and nothing about its slope (derivative)? I suppose we could start at a single point, say x0, and just poke around until we find a direction in which f(x) is decreasing. If we go far enough, we might be able to isolate the root. Unfortunately, one can conceive of all kinds of functions, and not very pathological ones at that, where such an approach won't work. Suppose the curve only just missed the x-axis, then started up again, to plunge to a root somewhere else. We might spend a lot of time poking around in an area that's nowhere near the true root.
For this reason, all root-finding algorithms I know of (except those based upon Newton's method, which we'll discuss in another column), begin with two points, x0 and x1, that are known to straddle the root. Given that:
(or vice versa) and certain assurances about the continuity of the function, we can be 100% certain that at least one root lies in the range x0..x1. A good root finder will find it to the limit of machine accuracy-guaranteed. The only burden we impose on the user is to pick values of x0 and x1 such that:
- Inequality 3 is true, and
- Only one root lies between them.
The root-finder will do the rest.
One simple and common method for finding the root is the secant method. In this method, we assume that the curve is a straight line between the points at x0 and x1. The intersection of this line with the x-axis gives us our first estimate of the location of the root. Depending on whether f(x) is positive or negative, we throw out one of the two points, and repeat the process, as shown in Figure 2.
The first straight line, between the points at x0 and x1, yields an estimated root at x2. Evaluating f(x2), we find that this point is too low. The point at x2 is still below the x-axis, so we discard x1, draw another line between x0 and x2, and repeat the process. I think the figure shows two things clearly:
- The process will converge.
- It may take awhile, because the curve is so far from being a straight line.
A little more reflection will show something else: the point at x0 never gets replaced. Every line passes through it. In the end game, the convergence factor will be a ratio that depends on the slope of the line between x0 and x, compared to the true slope of the curve there. If, as I've shown here, the curve has a lot of curvature near x0, the ratio will be relatively low. We will never be able to jump directly to the point at x, but can only hope, at best, to sneak up on it by that constant ratio at each step. In short, the convergence is linear.
This behavior is characteristic of the secant method. Like the Golden Ratio method for finding a minimum, it's sure, but slow.
The point of concern is the need to specify two initial points, x0 and x1, that not only define the initial range of search, but also guarantee the existence of a solution. In the case of a root-finder, we don't consider this starting condition overly burdensome. If the curve of f(x) is fairly simple, like the one shown in the figures, we are justified in choosing a very large range for x0..x1, one that will assure that one and only one root is there, somewhere in between.
With respect to a minimizer, the situation is quite different. The function I've been using as a test case is shown in Figure 3:
Here, even though the function is well-behaved, we have a problem in defining the starting conditions. Remember that to trap a minimum, we need not two points as in the case of a root-finder, but three. These three points must satisfy the now-infamous triplet condition:
It is not enough to specify just the range of search; we need that interior point, now called x1, in order to guarantee a solution.
We, as developers of the minimization tool, could take the coward's way out, and simply dump that burden off onto the user of the algorithm. Indeed, many minimizers (including Brent's) do exactly that. Such an approach is also consistent with the notion that anyone using a minimizer should know the function he wants to minimize, and be able to estimate where the minimum is. That's one approach. Later on, however, we're going to use this minimizer of a function of one variable in larger, multi-dimensional applications, where the mimimizer will be used iteratively inside a larger function. For that kind of use, the caller is not a human being, but another function, and asking it to be intimately familiar with the function being minimized at the moment is a bit much. The only reasonable recourse, it seems to me, is to start with three points in the triplet configuration of Equation 5.
Suppose you are given the function of Figure 3, and are asked to provide that critical, central point. In this case, the problem is relatively easy since every point between x0 and x2 has a lower y-value than they do. Pick one.
Remember, however, that I've only shown a certain range for the function. It does, in fact, extend to infinity in both directions. It's symmetric about x = 0, so there's another, equally valid minimum at x = –0.7937. For that matter, there's an infinity of other minima, all of equal value, further out along the x-axis in both directions. By showing you the function only over the range 0..1, I've already greatly simplified the problem, and given you a huge hint as to what x0 and x2 should be. From here any interior point will do for x1. A good choice (but not without its problems) might be the midpoint between x0 and x2.
Not all functions, however, are so accommodating, and they needn't be pathological cases to do so. Change the test function to:
and you get the curve of Figure 4.
Suddenly, finding that elusive middle point doesn't look so easy. Certainly, a bisection won't produce it. Neither will a golden section bisection, from either direction. So how do you propose to find x1?
See the problem? Trust me, I've thought about this issue for years, perhaps decades. At one point, I thought perhaps I'd just keep bisecting every subsection, recursing down into ever-deeper levels, until a suitable x1 was found. But I ran into problems, not only of logical complexity, but issues of what to do when I find not one suitable x1, but several. How do I decide which one to use? The lowest? The leftmost?
For me, and for most others doing this kind of thing, the only reasonable answer seems to be to search by stepping along the function until the first minimum is trapped. This is, admittedly, not a perfect solution, though it's the one almost always used by other tool-providers. As many have pointed out, it takes time. It also takes a decision as to how big to make the step size (or the number of divisions to create). Make the step size too small, and we waste a lot of time searching for the minimum. Make it too large, and we might step right over it.
When you think about it, if we're going to step along the function, we really don't even need x2, just x0. Begin at x0 and step until the needed condition is found. For that matter, we could do the same with the root-finder of Figure 2. The only function x2 serves (x1 in Figure 2) is to tell us when to stop. If we ever get to the right-most limit without finding a suitable geometry, we must give it up and admit defeat.
You may think that stepping along a function like that of Figure 1 would be crazy; knowing that we have a root trapped between x0 and x1, there is seemingly no point in the stepping procedure. But you'd be wrong. In Figure 1, I've shown a function that has a pretty severe curvature and ends up fairly close to the x-axis. But I could have shown you reasonable, non-pathological cases where the knee in the curve is much more severe, the final slope much less, and the right-hand part of the curve much closer to the x-axis. For such cases, a secant method using the point at x0 would be interminably slow. Stepping to find a good starting point is a much better-even faster-approach.
One final thought: I think many of the critics of the stepping process have missed the point that the number of steps to take is entirely at the discretion of the user. In my test driver, I set the number to 10, but that's a number I pulled out of the air, just to give a reasonable test. Because the minimum of the test function just happens (by design) to be far to the right, the algorithm takes 10 of the 11 possible steps before it encounters a triplet.
I never dreamed, however, that so many people would see this as an indictment of the method. We are by no means tied to 10 segments; you are free to use any number you like. As I've said, in the case of the function of Figure 1, any interior point at all will generate the triplet right away. Set “steps” to 2, and watch the thing start iterating after three function evaluations-the absolute minimum number needed.
Before we leave this root-finder analogy, let's imagine that we not only have the function, f(x), but its derivative. For the function f(x) of interest, the derivative is:
The curve is shown in Figure 5.
Earlier, I pointed out that when it comes to root-finders, if we can ever give a range x0..x1 such that a sign change is known to occur, we have methods that will find that root down to machine precision, guaranteed. The secant method is among the crudest and slowest of such methods, but it is virtually bulletproof. Personally, I use a second-order method that is lightning fast and every bit as bulletproof. I've shared it in these pages before (“Roots of a Function,” August 1994, p. 13). After seven years, we're probably ready for a reprise, and I promise to give you the algorithm again soon.
As you no doubt recall from calculus class, a function has its minimum (or maximum) when the derivative of that function is zero. Note the significance of this statement. It means that a root-finder, operating on the derivative, can find our minimum for us. What's more, it can find it to much better accuracy than a minimizer. That's because, as we've learned the hard way, the minimum gets more and more fuzzy and indistinct as we narrow the range of our search. As I've said many times, every curve with a smooth minimum looks like a parabola near that minimum, and the curve gets flatter and flatter as we narrow down the range. That's why we cannot expect as much accuracy in xmin as we are used to getting from numerical computations. But the curve of the derivative must surely have its maximum slope as it passes through the x-axis, so we have every right to expect to be able to nail down the root accurately.
So why don't we just differentiate the function, find its root, and move on? Answer: if we have an analytical expression for the derivative, that's exactly what we should do.
Press et al, in Numerical Recipes, have this to say: “Yes, we are fuddy-duddies when it comes to making flamboyant use of derivative information in one-dimensional minimization. But we have had a bellyful of functions whose computed 'derivatives' don't integrate up to the function value and don't accurately point the way to the minimum, usually because of roundoff errors, sometimes because of truncation errors in the method of derivative evaluation.”
I have to tell you, I have a problem with this statement. Though they have a point in an earlier statement that the root of the derivative is zero at both maxima and minima, I have trouble envisioning a case where the derivative function says the root is one place, but the minimum is somewhere else. How can this happen?
Understand, I'm no novice to the notions of roundoff errors. They can always occur when we have expressions involving the difference of nearly equal numbers. A good example, from orbit theory, is:
where d < r.="" i="" understand="" that="" the="" results="" can="" lose="" a="" lot="" of="" resolution.="" but="" such="" things="" are="" not="" the="" fault="" of="" the="" minimizer="" or="" root-finder;="" they're="" the="" fault="" of="" the="" analyst="" who="" failed="" to="" recognize="" that="" the="" problem="" is="" ill-conditioned.="" somewhere="" around="" my="" first="" year="" doing="" analysis="" using="" equation="" 8,="" i="" learned="" that="" one="" should="" never="" use="" it="" in="" the="" form="" shown,="" but="" expand="" it="" into="" a="" power="" series="" such="" that="" the="" two="" equal="" terms="" vanish.="" the="" result="" is="" an="" expression="" that="" is="" not="" only="" accurate="" when="" d="" -=""> 0, but is most accurate there.
I also don't understand the comment concerning truncation error. Truncation error involves the use of, say, a polynomial in place of an infinite series. In short, it's a consequence of approximating an expression.
If the authors of Numerical Recipes are saying that one shouldn't seek the minimum of a function by trying to approximate its derivative, then seeking the root of that function, I couldn't agree more heartily. But I can't imagine trying to do such a thing, and I suggest you don't try it either. If, on the other hand, you have a perfectly good, exact, analytical expression for the derivative, such as we have in Equation 7, by all means use it. Just remember that, as Press et al. warn, every root doesn't imply a minimum. You could be converging on a maximum instead.
Where do we stand?
This month, I don't have more code for you. Instead, I've spent the month studying the problem analytically, using Mathcad. As a result of that analysis, I have a stunning new insight that I think will blow your socks off. I think it safe to say that the end result will also blow Brent's method into the weeds. Before sharing it, though, let's review the bidding, and remind ourselves of the history of this study. For those readers just now joining the discussion, this should serve as a welcome and much-needed review.
We are trying to develop a robust algorithm that will find the minimum of a function like f(x) of Equation 4. That is, we are looking for the value xmin such that f(x), for any value other than xmin, will yield a larger value for y = f(x). From Equation 4, we can solve analytically for the true minimum. It turns out to be:
We began by trying the most simple-minded method I could think of, which was simply to evaluate f(x) at several points along the function (steps = 100,000 or more) and returning the value that leads to the smallest f(x). Though crude beyond belief, this approach is always an option in a pinch.
Next, we tried methods associated with bisection. The general idea is that given the triplet condition of Equation 5, we bisect one of the two intervals. If the resulting f(x) is smaller than the previous midpoint, we have a new triplet with reduced range. If it's larger, we can replace one of the bounding points and, again, reduce the range. Either way, the range is reduced, with the minimum trapped in between. The process ends when the range x0..x2 is sufficiently small.
We then looked at the so-called Golden Section method, which is really a bisection method, with a specific ratio to define the next value of x. Its advantage over equal-sized bisection is a fractionally faster convergence rate.
The bisection methods are, like the tortoise, slow but sure. We can be sure of finding the minimum eventually, but it may take awhile. One advantage of these methods is that the criterion for convergence is simple: when the interval x0..x2 gets sufficiently small, we declare victory, and return the middle point of the triplet.
All bisection methods are linear, in the sense that the error at each step decreases by a constant factor. In true bisection, the factor is 1.5. With the Golden Section, it's 1.618. To get, say, six digits of accuracy, we will need:
iterations with even bisection, or:
Not much difference between the two, but every little bit helps.
To get faster convergence, for both root-finding and minimizing, we must abandon linear methods and turn to higher-order, “superlinear” methods. Usually, this means second order. In its essence, such a method involves fitting a quadratic curve through three points, and using that to estimate the root/minimum. Such methods converge much faster, typically doubling the number of significant digits at each step.
This is where things began to get sticky. The industry-standard, superlinear method seems to be Brent's method. When the time came to give you the superlinear method, I intended to give you a version of Brent's method, which Numerical Recipes describes. However, when I began looking at the method, I found many things I didn't like about it, not the least of which was that it was a C function copied, essentially verbatim, from a Fortran II subroutine, which was in turn copied from Brent's original Algol code. Explanations as to how the method worked ranged from vague to nonexistent. This seems to be one of those cases where one is supposed to abandon curiosity and have faith. However, since part of my job is to not only give you algorithms, but explain how they work and why we use them, I was stuck.
While searching for explanations, I proceeded to experiment with the method, and I came across some things I didn't like very much. These ranged from the convergence criterion to the rules concerning when to switch methods.
Brent's method combines both a superlinear method (parabolic fit) with the Golden Section approach. It switches back and forth between them according to a fairly complex set of rules. I didn't like some of the behavior I saw, so somewhere along the line I abandoned the notion of copying Brent's method, and chose instead to offer my own variation with somewhat different rules. For starters, Brent's method uses an absolute error in x as the main convergence criterion, whereas I'm convinced that a relative error, based upon the initial x0..x2 range, is more appropriate. I'm further convinced that, to ensure against false indications of convergence, we should be monitoring changes in y as well as x.
I set out to provide code to do what I thought should be done, trying to adhere, as best I could, to the coding style and structure of the original Brent algorithm. After quite a bit of floundering, I realized that I was ending up with spaghetti code in the best Fortran II tradition. I eventually realized that part of the problem was that Brent's code was designed to be fast, not understandable. While that criterion may indeed be the correct one for production use, it's not appropriate for this column, where understandability is important. After a bit of trial and error, I arrived at a state-machine approach with which I'm very pleased. I also decided to retain not just the last three data points, but the last five. This particular decision may turn out to be a severe case of overkill, but it does give us some advantages otherwise unattainable, such as being able to handle an absolute-value sort of function, with a sharp minimum.
My last code offering, which is available on the ESP Web site (www.embedded.com/code.htm), implements the state machine, along with probably more minimum-estimation algorithms than we'll ever use. So far, it's still light on logic; I expect to add more criteria for switching methods, to protect against ill-conditioned situations. I also have not yet implemented the production algorithm for convergence; the one there now is a simple test on the range x0..x2. This test works fine for bisection methods, but is rotten for quadratic fits, as we shall see.
When I tested the algorithm, I was a bit surprised to see that it didn't converge very rapidly, using 36 function evaluations (26 iterations, plus the stepping procedure). In fact, I found that forcing an alternation between parabolic and bisection steps actually reduced the number of iterations, from 26 to 16. This was still far more iterations than I'd hoped for.
At first, I thought I must have made a mistake in the code, that the parabolic interpolation wasn't working. This turned out not to be a problem. I did have an error, thanks to a last-minute “innovation,” in the estimated y-value, but since I didn't use it anyway, that was not a problem either.
Enter James Roe, who pointed out the real problem, which was nothing more than my simple-minded, temporary convergence criterion. As James pointed out, the iteration itself had converged long ago; it's just that the software hadn't realized it. The x0..x2 range is simply not a good measure of convergence when using parabolic interpolation.
This brings us to a key problem with Brent's method. It's one that I'd noticed earlier and tried to fix by forcing bisection steps. Brent, I believe, noticed the same thing before us, and tried the same fix: it is the tendency of this algorithm to hang onto outlying end points.
Take another look at the secant method of Figure 2 and notice again that the left endpoint, x0, never moves. We always converge from one direction, moving only the right point inwards until we converge. This behavior is characteristic of the method. Regardless of the shape and complexity of the function, in the end game the method will always settle on one or the other end point and use it for a pivot until it converges.
Parabolic interpolation works the same way. Though we always fit the data with a parabola, the actual curve tends to be flattened somewhat in one direction or the other. In the case of our test function in Figure 2, the slope is steeper on one side of the minimum than the other, and a parabolic fit will always estimate the mimimum to the left of its true value. The algorithm tends to hang onto the x2 points, and the triplet becomes more and more unbalanced.
Let's not lose sight of the algorithm: we are going to take three points, defined by x0, x1, and x2, and fit a parabola through them. We are going to use this parabola to estimate the location of the minimum. If this were a perfect world, and we had floating-point numbers with infinite precision, any three points satisfying the triplet condition would do.
However, this is not, you may have noticed, a perfect world. The equations in the computation for the minimum have terms involving slopes, like:
When x1 is almost exactly equal to x0, we lose too much precision, and our algorithm turns to trash. And, as it turns out, this condition is unavoidable.
Why? Well, think about it. Suppose you had a function that was, in fact, exactly a parabola. Beginning with three nicely separated points as a triplet, you compute the minimum, and nail it on the first try. Now you use this new point to fit a second parabola. Would you be surprised to learn that the two parabolas are identical, and predict the same minimum? Hopefully, at this point both you and your algorithm are smart enough to say, “Minimum found, convergence complete,” because you have no way of taking the next step. Since both steps predict the same point, which one are you going to replace for a third iteration?
Now, suppose that the curve really isn't a parabola, but is shaped almost exactly like one. Would you be surprised to learn that two iterations predict almost exactly the same minimum? Hardly. The surprise would be if they didn't.
Here's where it gets interesting: the two values are not really quite equal, so one produces a y-value just a bit higher than the other. According to the rules of parabolic interpolation, we must replace one of the bounding points and use the two nearly equal ones. Bingo! We're suddenly at Equation 12, with nearly equal points and, therefore, rotten accuracy. As the iteration progresses, our triplet tends to get more and more unbalanced, with two points very close together and the third way out in left field.
I noticed this tendency, and reported it to you early on. James Roe has noticed it, also, and it bothers him as much as me. Brent no doubt noticed it too, and that's why he forced bisection steps to try to bring more balance to the triplet. However, I'm here to tell you that this approach simply will not work.
Here's what Numerical Recipes has to say: “A typical ending configuration for Brent's method is that a and b [our x0 and x2] are 2*x*tol apart [tol is the convergence tolerance], with x (the best abscissa) at the midpoint of a and b, and therefore fractionally accurate to tol.”
At this point, I'm prepared to stick my neck out and claim that this statement is patently false. I can't r