It turns out that Brent was wrong when he
combined bisection with parabolic interpolation. There's a better way to converge on a minimum.
The lyrics of Joni Mitchell's song, "Both Sides Now" go, "I've looked at clouds from both sides now ... [but] it's clouds' illusions I recall; I really don't know clouds at all."
Sometimes that's the way I feel about function minimization. Looking at my file directory, I see that we're currently in month 17 of this series; surely some kind of record, and one I'm quite sure that my editors, as well as you readers, are hoping I don't try to extend any further. Though I have talked about a lot of other things, from Good Guys/Bad Guys to Heathkits, minimization has remained the main focus. We've probably spent more time on this thing than Brent did when he developed Brent's method in the first place.
And yet, each month, I seem to discover great truths about the algorithm that I didn't see before. Each month, I think I understand what's going on pretty well. And between columns, I often discover, along with Diamond, that my insights were mostly illusion, that "I really don't know minimization at all."
This month offered another such breakthrough-perhaps the last one we need. Certain things have been nagging at me for some time, and I've voiced my thoughts about them as they came up. But now, at last, I see not only why they have been nagging at me, but why they are inevitable with the algorithm upon which Brent's method is based. Even further, now that I see the clouds more clearly, I also see what needs to be done to clear the fog. I have some brand new insights as to the behavior of both Brent's method and my own, insights that I don't believe anyone has voiced before, and that have got me pretty excited. I hope you'll catch the excitement, too. I also hope you'll agree, when you see the finished product, that it was well worth the wait.
At least one other person has caught the fever. James M. Moe, of Sohnen-Moe Associates, has been doing some experimenting on his own, and sharing his ideas with me via e-mail. Was it James' thoughts that triggered my own? Could be; doing something like this is always easier when you have other folks to talk to about it. In any case, I'm gratified to see that the same nagging problems that bothered me were bothering James as well. Though our solutions have not always coincided, I want to thank James right now for his insights, and am perfectly comfortable with giving him full credit for any breakthroughs we discover together.
Mathworks
Lately I find myself a heavy user of both Matlab and its companion, Simulink. I like to think I know a few things about software math algorithms; I'm the original Jack of all trades, master of some. But I can't even think about being master of them all. No one can. But a company like Mathworks can hire different masters for each field of expertise, and that's what they've done. So I've had to force myself to swallow my pride, excise the NIH factor, and use Matlab and Simulink without worrying about what's going on inside. The more I use these tools, the better I like the notion of not having to worry.
Wanna see an example? Well, if you wanted to find the minimum of a function, you could wait for this incredibly long series to conclude, and use the minimizer that finally comes out of it. Or, you could write the two lines of Matlab code:
x0 = 0;
x = fminsearch(@f, x0);
Does this code find the minimum of f(x)? Yes, it does, for virtually any f(x) you care to throw at it. Is it fast, accurate, robust, and stable? Yes, it is. How does Matlab do it? Do we care? Do we need to know? Not really. I've come to trust Mathworks to use the best algorithms the world has to offer. Just out of curiosity, I'd like to know what method they use, but I don't really need to. All I need to know is that I get the answer I'm seeking. No muss, no fuss.
All about stepping
Let's get back to the business of minimizing. As you regular readers know, my minimization algorithm starts out seeking the general location of the minimum of a function by stepping along it. Lately I've drawn a lot of flak from readers who think this is an inefficient process. I touched on this subject last month, but I didn't have time to treat it properly, so this month, let me try again, this time with an analogy.
Suppose I have a function:
(1)
And I want to find a root of the function (not the minimum). That is, I seek the value of x such that:
(2)
A typical function is shown in Figure 1. The root is the point where the curve crosses the x-axis.
How would we go about finding the root, x, given only the description of the function, and nothing about its slope (derivative)? I suppose we could start at a single point, say x0, and just poke around until we find a direction in which f(x) is decreasing. If we go far enough, we might be able to isolate the root. Unfortunately, one can conceive of all kinds of functions, and not very pathological ones at that, where such an approach won't work. Suppose the curve only just missed the x-axis, then started up again, to plunge to a root somewhere else. We might spend a lot of time poking around in an area that's nowhere near the true root.
For this reason, all root-finding algorithms I know of (except those based upon Newton's method, which we'll discuss in another column), begin with two points, x0 and x1, that are known to straddle the root. Given that:
(3)
(or vice versa) and certain assurances about the continuity of the function, we can be 100% certain that at least one root lies in the range x0..x1. A good root finder will find it to the limit of machine accuracy-guaranteed. The only burden we impose on the user is to pick values of x0 and x1 such that:
- Inequality 3 is true, and
- Only one root lies between them.
The root-finder will do the rest.
One simple and common method for finding the root is the secant method. In this method, we assume that the curve is a straight line between the points at x0 and x1. The intersection of this line with the x-axis gives us our first estimate of the location of the root. Depending on whether f(x) is positive or negative, we throw out one of the two points, and repeat the process, as shown in Figure 2.
The first straight line, between the points at x0 and x1, yields an estimated root at x2. Evaluating f(x2), we find that this point is too low. The point at x2 is still below the x-axis, so we discard x1, draw another line between x0 and x2, and repeat the process. I think the figure shows two things clearly:
- The process will converge.
- It may take awhile, because the curve is so far from being a straight line.
A little more reflection will show something else: the point at x0 never gets replaced. Every line passes through it. In the end game, the convergence factor will be a ratio that depends on the slope of the line between x0 and x, compared to the true slope of the curve there. If, as I've shown here, the curve has a lot of curvature near x0, the ratio will be relatively low. We will never be able to jump directly to the point at x, but can only hope, at best, to sneak up on it by that constant ratio at each step. In short, the convergence is linear.
This behavior is characteristic of the secant method. Like the Golden Ratio method for finding a minimum, it's sure, but slow.
The point of concern is the need to specify two initial points, x0 and x1, that not only define the initial range of search, but also guarantee the existence of a solution. In the case of a root-finder, we don't consider this starting condition overly burdensome. If the curve of f(x) is fairly simple, like the one shown in the figures, we are justified in choosing a very large range for x0..x1, one that will assure that one and only one root is there, somewhere in between.
With respect to a minimizer, the situation is quite different. The function I've been using as a test case is shown in Figure 3:
(4)
Here, even though the function is well-behaved, we have a problem in defining the starting conditions. Remember that to trap a minimum, we need not two points as in the case of a root-finder, but three. These three points must satisfy the now-infamous triplet condition:
(5)
It is not enough to specify just the range of search; we need that interior point, now called x1, in order to guarantee a solution.
We, as developers of the minimization tool, could take the coward's way out, and simply dump that burden off onto the user of the algorithm. Indeed, many minimizers (including Brent's) do exactly that. Such an approach is also consistent with the notion that anyone using a minimizer should know the function he wants to minimize, and be able to estimate where the minimum is. That's one approach. Later on, however, we're going to use this minimizer of a function of one variable in larger, multi-dimensional applications, where the mimimizer will be used iteratively inside a larger function. For that kind of use, the caller is not a human being, but another function, and asking it to be intimately familiar with the function being minimized at the moment is a bit much. The only reasonable recourse, it seems to me, is to start with three points in the triplet configuration of Equation 5.
Suppose you are given the function of Figure 3, and are asked to provide that critical, central point. In this case, the problem is relatively easy since every point between x0 and x2 has a lower y-value than they do. Pick one.
Remember, however, that I've only shown a certain range for the function. It does, in fact, extend to infinity in both directions. It's symmetric about x = 0, so there's another, equally valid minimum at x = --0.7937. For that matter, there's an infinity of other minima, all of equal value, further out along the x-axis in both directions. By showing you the function only over the range 0..1, I've already greatly simplified the problem, and given you a huge hint as to what x0 and x2 should be. From here any interior point will do for x1. A good choice (but not without its problems) might be the midpoint between x0 and x2.
Not all functions, however, are so accommodating, and they needn't be pathological cases to do so. Change the test function to:
(6)
and you get the curve of Figure 4.
Suddenly, finding that elusive middle point doesn't look so easy. Certainly, a bisection won't produce it. Neither will a golden section bisection, from either direction. So how do you propose to find x1?
See the problem? Trust me, I've thought about this issue for years, perhaps decades. At one point, I thought perhaps I'd just keep bisecting every subsection, recursing down into ever-deeper levels, until a suitable x1 was found. But I ran into problems, not only of logical complexity, but issues of what to do when I find not one suitable x1, but several. How do I decide which one to use? The lowest? The leftmost?
For me, and for most others doing this kind of thing, the only reasonable answer seems to be to search by stepping along the function until the first minimum is trapped. This is, admittedly, not a perfect solution, though it's the one almost always used by other tool-providers. As many have pointed out, it takes time. It also takes a decision as to how big to make the step size (or the number of divisions to create). Make the step size too small, and we waste a lot of time searching for the minimum. Make it too large, and we might step right over it.
When you think about it, if we're going to step along the function, we really don't even need x2, just x0. Begin at x0 and step until the needed condition is found. For that matter, we could do the same with the root-finder of Figure 2. The only function x2 serves (x1 in Figure 2) is to tell us when to stop. If we ever get to the right-most limit without finding a suitable geometry, we must give it up and admit defeat.
You may think that stepping along a function like that of Figure 1 would be crazy; knowing that we have a root trapped between x0 and x1, there is seemingly no point in the stepping procedure. But you'd be wrong. In Figure 1, I've shown a function that has a pretty severe curvature and ends up fairly close to the x-axis. But I could have shown you reasonable, non-pathological cases where the knee in the curve is much more severe, the final slope much less, and the right-hand part of the curve much closer to the x-axis. For such cases, a secant method using the point at x0 would be interminably slow. Stepping to find a good starting point is a much better-even faster-approach.
One final thought: I think many of the critics of the stepping process have missed the point that the number of steps to take is entirely at the discretion of the user. In my test driver, I set the number to 10, but that's a number I pulled out of the air, just to give a reasonable test. Because the minimum of the test function just happens (by design) to be far to the right, the algorithm takes 10 of the 11 possible steps before it encounters a triplet.
I never dreamed, however, that so many people would see this as an indictment of the method. We are by no means tied to 10 segments; you are free to use any number you like. As I've said, in the case of the function of Figure 1, any interior point at all will generate the triplet right away. Set "steps" to 2, and watch the thing start iterating after three function evaluations-the absolute minimum number needed.
Differentiate?
Before we leave this root-finder analogy, let's imagine that we not only have the function, f(x), but its derivative. For the function f(x) of interest, the derivative is:
(7)
The curve is shown in Figure 5.
Earlier, I pointed out that when it comes to root-finders, if we can ever give a range x0..x1 such that a sign change is known to occur, we have methods that will find that root down to machine precision, guaranteed. The secant method is among the crudest and slowest of such methods, but it is virtually bulletproof. Personally, I use a second-order method that is lightning fast and every bit as bulletproof. I've shared it in these pages before ("Roots of a Function," August 1994, p. 13). After seven years, we're probably ready for a reprise, and I promise to give you the algorithm again soon.
As you no doubt recall from calculus class, a function has its minimum (or maximum) when the derivative of that function is zero. Note the significance of this statement. It means that a root-finder, operating on the derivative, can find our minimum for us. What's more, it can find it to much better accuracy than a minimizer. That's because, as we've learned the hard way, the minimum gets more and more fuzzy and indistinct as we narrow the range of our search. As I've said many times, every curve with a smooth minimum looks like a parabola near that minimum, and the curve gets flatter and flatter as we narrow down the range. That's why we cannot expect as much accuracy in xmin as we are used to getting from numerical computations. But the curve of the derivative must surely have its maximum slope as it passes through the x-axis, so we have every right to expect to be able to nail down the root accurately.
So why don't we just differentiate the function, find its root, and move on? Answer: if we have an analytical expression for the derivative, that's exactly what we should do.
Press et al, in Numerical Recipes, have this to say: "Yes, we are fuddy-duddies when it comes to making flamboyant use of derivative information in one-dimensional minimization. But we have had a bellyful of functions whose computed 'derivatives' don't integrate up to the function value and don't accurately point the way to the minimum, usually because of roundoff errors, sometimes because of truncation errors in the method of derivative evaluation."
I have to tell you, I have a problem with this statement. Though they have a point in an earlier statement that the root of the derivative is zero at both maxima and minima, I have trouble envisioning a case where the derivative function says the root is one place, but the minimum is somewhere else. How can this happen?
Understand, I'm no novice to the notions of roundoff errors. They can always occur when we have expressions involving the difference of nearly equal numbers. A good example, from orbit theory, is:
(8)
where d << r. I understand that the results can lose a lot of resolution. But such things are not the fault of the minimizer or root-finder; they're the fault of the analyst who failed to recognize that the problem is ill-conditioned. Somewhere around my first year doing analysis using Equation 8, I learned that one should never use it in the form shown, but expand it into a power series such that the two equal terms vanish. The result is an expression that is not only accurate when d -> 0, but is most accurate there.
I also don't understand the comment concerning truncation error. Truncation error involves the use of, say, a polynomial in place of an infinite series. In short, it's a consequence of approximating an expression.
If the authors of Numerical Recipes are saying that one shouldn't seek the minimum of a function by trying to approximate its derivative, then seeking the root of that function, I couldn't agree more heartily. But I can't imagine trying to do such a thing, and I suggest you don't try it either. If, on the other hand, you have a perfectly good, exact, analytical expression for the derivative, such as we have in Equation 7, by all means use it. Just remember that, as Press et al. warn, every root doesn't imply a minimum. You could be converging on a maximum instead.
Where do we stand?
This month, I don't have more code for you. Instead, I've spent the month studying the problem analytically, using Mathcad. As a result of that analysis, I have a stunning new insight that I think will blow your socks off. I think it safe to say that the end result will also blow Brent's method into the weeds. Before sharing it, though, let's review the bidding, and remind ourselves of the history of this study. For those readers just now joining the discussion, this should serve as a welcome and much-needed review.
We are trying to develop a robust algorithm that will find the minimum of a function like f(x) of Equation 4. That is, we are looking for the value xmin such that f(x), for any value other than xmin, will yield a larger value for y = f(x). From Equation 4, we can solve analytically for the true minimum. It turns out to be:
(9)
We began by trying the most simple-minded method I could think of, which was simply to evaluate f(x) at several points along the function (steps = 100,000 or more) and returning the value that leads to the smallest f(x). Though crude beyond belief, this approach is always an option in a pinch.
Next, we tried methods associated with bisection. The general idea is that given the triplet condition of Equation 5, we bisect one of the two intervals. If the resulting f(x) is smaller than the previous midpoint, we have a new triplet with reduced range. If it's larger, we can replace one of the bounding points and, again, reduce the range. Either way, the range is reduced, with the minimum trapped in between. The process ends when the range x0..x2 is sufficiently small.
We then looked at the so-called Golden Section method, which is really a bisection method, with a specific ratio to define the next value of x. Its advantage over equal-sized bisection is a fractionally faster convergence rate.
The bisection methods are, like the tortoise, slow but sure. We can be sure of finding the minimum eventually, but it may take awhile. One advantage of these methods is that the criterion for convergence is simple: when the interval x0..x2 gets sufficiently small, we declare victory, and return the middle point of the triplet.
All bisection methods are linear, in the sense that the error at each step decreases by a constant factor. In true bisection, the factor is 1.5. With the Golden Section, it's 1.618. To get, say, six digits of accuracy, we will need:
(10)
iterations with even bisection, or:
(11)
Not much difference between the two, but every little bit helps.
To get faster convergence, for both root-finding and minimizing, we must abandon linear methods and turn to higher-order, "superlinear" methods. Usually, this means second order. In its essence, such a method involves fitting a quadratic curve through three points, and using that to estimate the root/minimum. Such methods converge much faster, typically doubling the number of significant digits at each step.
This is where things began to get sticky. The industry-standard, superlinear method seems to be Brent's method. When the time came to give you the superlinear method, I intended to give you a version of Brent's method, which Numerical Recipes describes. However, when I began looking at the method, I found many things I didn't like about it, not the least of which was that it was a C function copied, essentially verbatim, from a Fortran II subroutine, which was in turn copied from Brent's original Algol code. Explanations as to how the method worked ranged from vague to nonexistent. This seems to be one of those cases where one is supposed to abandon curiosity and have faith. However, since part of my job is to not only give you algorithms, but explain how they work and why we use them, I was stuck.
While searching for explanations, I proceeded to experiment with the method, and I came across some things I didn't like very much. These ranged from the convergence criterion to the rules concerning when to switch methods.
Brent's method combines both a superlinear method (parabolic fit) with the Golden Section approach. It switches back and forth between them according to a fairly complex set of rules. I didn't like some of the behavior I saw, so somewhere along the line I abandoned the notion of copying Brent's method, and chose instead to offer my own variation with somewhat different rules. For starters, Brent's method uses an absolute error in x as the main convergence criterion, whereas I'm convinced that a relative error, based upon the initial x0..x2 range, is more appropriate. I'm further convinced that, to ensure against false indications of convergence, we should be monitoring changes in y as well as x.
I set out to provide code to do what I thought should be done, trying to adhere, as best I could, to the coding style and structure of the original Brent algorithm. After quite a bit of floundering, I realized that I was ending up with spaghetti code in the best Fortran II tradition. I eventually realized that part of the problem was that Brent's code was designed to be fast, not understandable. While that criterion may indeed be the correct one for production use, it's not appropriate for this column, where understandability is important. After a bit of trial and error, I arrived at a state-machine approach with which I'm very pleased. I also decided to retain not just the last three data points, but the last five. This particular decision may turn out to be a severe case of overkill, but it does give us some advantages otherwise unattainable, such as being able to handle an absolute-value sort of function, with a sharp minimum.
My last code offering, which is available on the ESP Web site (www.embedded.com/code.htm), implements the state machine, along with probably more minimum-estimation algorithms than we'll ever use. So far, it's still light on logic; I expect to add more criteria for switching methods, to protect against ill-conditioned situations. I also have not yet implemented the production algorithm for convergence; the one there now is a simple test on the range x0..x2. This test works fine for bisection methods, but is rotten for quadratic fits, as we shall see.
When I tested the algorithm, I was a bit surprised to see that it didn't converge very rapidly, using 36 function evaluations (26 iterations, plus the stepping procedure). In fact, I found that forcing an alternation between parabolic and bisection steps actually reduced the number of iterations, from 26 to 16. This was still far more iterations than I'd hoped for.
New insights
At first, I thought I must have made a mistake in the code, that the parabolic interpolation wasn't working. This turned out not to be a problem. I did have an error, thanks to a last-minute "innovation," in the estimated y-value, but since I didn't use it anyway, that was not a problem either.
Enter James Roe, who pointed out the real problem, which was nothing more than my simple-minded, temporary convergence criterion. As James pointed out, the iteration itself had converged long ago; it's just that the software hadn't realized it. The x0..x2 range is simply not a good measure of convergence when using parabolic interpolation.
This brings us to a key problem with Brent's method. It's one that I'd noticed earlier and tried to fix by forcing bisection steps. Brent, I believe, noticed the same thing before us, and tried the same fix: it is the tendency of this algorithm to hang onto outlying end points.
Take another look at the secant method of Figure 2 and notice again that the left endpoint, x0, never moves. We always converge from one direction, moving only the right point inwards until we converge. This behavior is characteristic of the method. Regardless of the shape and complexity of the function, in the end game the method will always settle on one or the other end point and use it for a pivot until it converges.
Parabolic interpolation works the same way. Though we always fit the data with a parabola, the actual curve tends to be flattened somewhat in one direction or the other. In the case of our test function in Figure 2, the slope is steeper on one side of the minimum than the other, and a parabolic fit will always estimate the mimimum to the left of its true value. The algorithm tends to hang onto the x2 points, and the triplet becomes more and more unbalanced.
Let's not lose sight of the algorithm: we are going to take three points, defined by x0, x1, and x2, and fit a parabola through them. We are going to use this parabola to estimate the location of the minimum. If this were a perfect world, and we had floating-point numbers with infinite precision, any three points satisfying the triplet condition would do.
However, this is not, you may have noticed, a perfect world. The equations in the computation for the minimum have terms involving slopes, like:
(12)
When x1 is almost exactly equal to x0, we lose too much precision, and our algorithm turns to trash. And, as it turns out, this condition is unavoidable.
Why? Well, think about it. Suppose you had a function that was, in fact, exactly a parabola. Beginning with three nicely separated points as a triplet, you compute the minimum, and nail it on the first try. Now you use this new point to fit a second parabola. Would you be surprised to learn that the two parabolas are identical, and predict the same minimum? Hopefully, at this point both you and your algorithm are smart enough to say, "Minimum found, convergence complete," because you have no way of taking the next step. Since both steps predict the same point, which one are you going to replace for a third iteration?
Now, suppose that the curve really isn't a parabola, but is shaped almost exactly like one. Would you be surprised to learn that two iterations predict almost exactly the same minimum? Hardly. The surprise would be if they didn't.
Here's where it gets interesting: the two values are not really quite equal, so one produces a y-value just a bit higher than the other. According to the rules of parabolic interpolation, we must replace one of the bounding points and use the two nearly equal ones. Bingo! We're suddenly at Equation 12, with nearly equal points and, therefore, rotten accuracy. As the iteration progresses, our triplet tends to get more and more unbalanced, with two points very close together and the third way out in left field.
I noticed this tendency, and reported it to you early on. James Roe has noticed it, also, and it bothers him as much as me. Brent no doubt noticed it too, and that's why he forced bisection steps to try to bring more balance to the triplet. However, I'm here to tell you that this approach simply will not work.
Here's what Numerical Recipes has to say: "A typical ending configuration for Brent's method is that a and b [our x0 and x2] are 2*x*tol apart [tol is the convergence tolerance], with x (the best abscissa) at the midpoint of a and b, and therefore fractionally accurate to tol."
At this point, I'm prepared to stick my neck out and claim that this statement is patently false. I can't reproduce the condition they describe, no matter how hard I try, and I challenge anyone else to do so. I think I can show, quite easily, that it's not possible.
Consider: because parabolic interpolation is superlinear, it's going to home in on the true solution very quickly. Bisection, whether by golden section or otherwise, is a linear method. One method reduces the range by a factor an. The other converges on a solution at the rate an2. Guess which one wins? No matter how hard we try, the golden ratio bisections cannot reduce the range fast enough to keep our triplets from becoming more and more unbalanced.
James Roe suggests that, though he feels as uncomfortable as I about using unbalanced triplets, that we accept them anyway and see what happens. I did that in a Mathcad analysis, and got convergence to a very high degree of accuracy in only seven iterations. Can we say "seven is fewer than sixteen"?
The telltale problem, however, comes when one plots the x0..x2 range and the solution error on the same graph. I've done that in Figure 6.
Look at what happens to the two curves. The error in the iteration goes plunging from 0.01 to an astonishing 10-11-far more accurate than we have any right to hope for, considering the indefinite bottom of any smooth minimum. But the range remains stuck at 10-3. No wonder my simple-minded error criterion wasn't convinced of convergence.
In Figure 7, I've plotted the relative distance (relative to current range) between the middle point, x1, and the two end points. After some dancing around, the rightmost interval goes to essentially 1 and stays there, while the leftmost one drops to 10-5. In other words, the distance between x0 and x1 is 0.00001 of the total range. Needless to say, this does not bode well for numerical accuracy. The surprise is that the algorithm converges as well as it does.
Brent worried about this effect also, and forced a bisection whenever the solution points get too close together. But, as we've seen, this approach does not, and cannot, work. The bisection can't reduce the interval fast enough to do any good. I tried the same thing, though I used a much more stringent criterion to force bisection: 3% of total range, rather than one step of machine precision. The method does not work; once two solution points get together, they remain stuck for the remainder of the solution.
And why shouldn't they? Remember that one of the bounding points for the triplet (the left-most one, x0, for our f(x)) is the best estimate of the minimum, from the previous step. These points are supposed to get closer and closer-that's what the term "convergence" means. The challenge is to find a way to maintain a well-conditioned triplet as they do so. To do that, it should be obvious by now that we must find a way to move the other range point inwards.
A solution
Considering the popularity of Brent's method, the ultimate conclusion about its efficacy is a shocking one: the combination of bisection and parabolic interpolation does not work.
Well, okay, that may be a bit of hyperbole. It obviously works to some extent, since researchers all over the world are using Brent's method. Maybe I should soften the comment to "it doesn't work well." Only one fix is possible: don't use bisection. It is simply not able to keep up with parabolic interpolation.
Unfortunately, Roe's solution, to use parabolic interpolation exclusively, also doesn't work, and he agrees that it's a risky move. This tendency for two of the triplet points to coalesce and the numerical inaccuracy that is sure to follow create too serious a problem to brush aside. We must find a way to reduce the imbalance.
The secret, I think, is to be found in comparing the predicted and actual values of f(x) after an interpolation step. Because the curve fit produces the equation for a parabola, we can not only compute the x-value at the minimum, but also its corresponding y-value. I do that in the code I've already given, though it's incorrect thanks to a dumb error.
Some months ago, again while tinkering in Mathcad, I came across a case where the algorithm predicted an improvement in the minimum (as it always should) but the value of f(x) got larger instead. The dx step called for by the algorithm made things worse, not better.
What this means, really, is that the slope of the function near that point was opposite what the parabolic fit expected. To recover, I tried what seemed to be a simple fix: stepping in the other direction. I reported this approach to you a while back, and included it in my code. The action, for those following the code, is called "flip."
That code was in the algorithm that I tested last month, but I was a little surprised to find that it didn't help. The condition triggering the flip never happened.
Looking at the process again, in Mathcad, I can see why: even though the actual improvement in y is not as predicted by the parabolic interpolation, it's in the right direction. In other words, the slope has the sign we expected.
This prompted me to think of another approach: if we step in the opposite direction when the slope seems wrong, why not step further in the same direction when it's right? If the current point is very close to the true minimum, as we hope it is, another step in the same direction should overshoot, and give us a new triplet with a very small span, and a much more balanced configuration.
I tried this approach in Mathcad. Figures 8 and 9 show the results, and they are, in a word, stunning. In Figure 8, first notice that we are now converging in only six iterations. And, please note, these are safe iterations, in the sense that we no longer have to worry about points coalescing.
You can see that the tendency to hang onto outlying end points is gone. Instead of getting stuck at one value as happened before, the x0..x2 range follows the true error right down, decreasing by seven orders of magnitude over the number of iterations used. (Why didn't I use more iterations? Because the next step converged to a true machine zero. That point, unfortunately, can't be shown on a log plot.)
Furthermore, the tendency of two of the points to coalesce is gone. In Figure 9, you can see that the relative spacing now alternates from one side to the other, never going much smaller than about 10% of the total range. With such behavior, we can perform our interpolations confidently, with no worries about machine roundoff error.
Does this excite you as much as it does me? I think we have finally found an algorithm that really, really works-and works beautifully. Why did it take so long? Because, in my not-so-humble opinion, we spent an inordinate amount of time hanging on, for far too long, to the parabolic/bisection paradigm used by Brent. Once freed of that mindset, we are able to use other methods to accelerate convergence much faster.
Think of it this way: the whole approach to minimization, regardless of the algorithm, is to probe the function at various points, and try to find the probing point that is right at the true minimum. The trick is to probe judiciously, so as to minimize the number of function evaluations. It really doesn't matter how we choose those probing points-voodoo or ESP would be acceptable if we could make them reliable.
What we have learned over the past months is a couple of lessons that were hard to accept, given the popularity of Brent's method:
- Parabolic interpolation alone is too risky; two of the triplet points tend to coalesce.
- Bisection cannot and does not help, because it can't reduce the range quickly enough.
Despite the claims, the combination of the two methods used in Brent's algorithm don't do a very good job of helping us choose the next probing point. This new algorithm does. End of story.
Our challenge, over the next month or two, is to convert this approach, proven out using Mathcad, into executable code. I have no problems in using my state machine architecture to do that; I'm satisfied that it's up to the challenge. It also gives us a platform that's easy to modify, so we can alter and adapt the rules as we see fit. Please join me as we bring this new algorithm home. I think you'll find that it was well worth the wait.
Jack W. Crenshaw is a senior principal design engineer at Alliant Tech Systems in Clearwater, FL. He did much of his early work in the space program and has developed numerous analysis and real-time programs. He holds a PhD in physics from Auburn University. Jack enjoys contact and can be reached via e-mail at jcrens@earthlink.net.
Figures
Return to July 2001 Table of Contents