Bingo! Months of agonizing produce an intellectual breakthrough. Can a quality minimizer be far behind?
Boy, life does deal some crazy surprises, doesn't it? As I type this, I'm sitting here in Manatee County, Florida, one of the few counties that didn't have a controversy stemming from the presidential election, and wondering who our next president is going to be. When I scheduled my vacation from my day job for November 9, who could have known that I'd spend most of it glued to CNN? It was not exactly what I had in mind.
To paraphrase Jay Leno, it seems that the only kind of news we get these days is the breaking kind, so much so that our entire news system must be broken.
Because of the delays in the publication process, I can only presume that by the time you read this, you will know who the 43rd President of the United States is. If you don't, then don't bother reading this column; you have more important things to worry about. I will not comment on the politics that are unfolding before our collective eyes, except to note that such events as this election tend to give one a sense of perspective and sharpen one's priorities. At a time when the fate of the free world hangs in the balance, trying to write the world's best minimization algorithm seems rather insignificant.
Nevertheless, I have some news-breaking, of course-on that front. I have been tinkering around with new algorithms for parabolic interpolation, and I think I'm on the trail of an algorithm that blows Brent's method away. What's more, one of its features is dear to my heart, and the very characteristic I'd hoped to achieve all along: simplicity.
Before diving in, let's review the bidding for the benefit of newcomers. Quite some months ago, I began a long-term study into methods for finding the minimum of a function:
We seek the value xmin such that f(xmin) is smaller than for any other value of x, at least within some vicinity of xmin. This algorithm is key, because in addition to the problems it allows us to solve directly, it figures into most methods for multivariate optimization.
You may recall that I breezed through all the quick and dirty approaches, such as simple mesh-scanning and bisection methods, including the golden section method, and then went on to seek solutions that depend on second-order curve fitting; these are the so-called superlinear methods. In words of few syllables, we attempt to fit a second-order polynomial (parabola) through three data points, and use the fitted curve to estimate the location of the minimum. When I began this process, I thought I had an ace in the hole: an existing technique known as Brent's method, developed, oddly enough, by a fellow named Brent.  This method seems to be the de facto standard function minimizer. It's a method given, along with code to implement it, in Numerical Recipes. 
Unfortunately for me, I don't give algorithms in this column unless I'm able to explain them. When I set out to understand Brent's method, I discovered that I couldn't, and Numerical Recipes was not much help. What followed next was a long series of vamping, as I tried to learn more about the method. I sought explanations from the Internet, from literature searches, and from any source I could find. It culminated in my receiving, from a sympathetic and most generous reader, a copy of Brent's book itself. Imagine my surprise to learn that even Brent doesn't explain Brent's method.
Brent developed his method as part of his PhD dissertation, and the book is an outcome of that study. Apparently, all these folks who have been using the method all these years were content to take Brent's word for it and use the method because it worked. I like to think that surely someone besides me must have wondered why Brent took some of the approaches he did, and worked out the reasons. If anyone did, I haven't been able to find them. Unlike yours truly, most users of the method don't have to write columns on it; they only need to let it work for them. And it does seem to do that.
Things could have ended there. I could have simply given you Brent's method, said, “I have no idea how it works, but here it is anyhow,” and gone on. Except that-aside from the obviously unsatisfying taste such an approach imparts-when I tested the method, I saw some things I didn't like.
More on perspective
Let me try to put things in the proper perspective, that commodity that seems so rare in these turbulent times. When we write an equation like Equation 1, we are clearly thinking of a function f(x) that's continuous; it has a value for every x over some predefined range (which may be infinite). (Clearly a continuous function has other properties as well, but this first one is essential.) In a solution of this kind however, we don't ever get to see f(x) in all its glory; we only get samples of it when we probe its values at selected points. It's as though we were walking around on a snow-covered hillside, trying to deduce the shape of the hill by poking a stick through the snow.
The operative phrase is “selected points.” We don't have an endless amount of time in which to map the function. Suppose a mountain climber is buried under that snow, and we have only so long to find him. The unspoken imperative is to locate the minimum with as few probings as possible.
We must assume that this function is not trivial. If it were trivial, we could apply calculus to it and find the minimum analytically. Therefore, we have to assume that the software that calculates f(x) is fairly complex, and calling it costs CPU clock cycles. Although the total CPU time required for a solution depends on both the cost of evaluating the function and the overhead associated with computing the next point to probe, we will not go far wrong if we assume that the cost is directly related to the number of times we evaluate f(x). The goal of any advanced method for minimization, then, is to choose the probing points with great care, using as much information as we have so far. Fitting a parabola to points near the current best minimum is one strategy we can use to choose the best place to probe next.
As we sample the function at new points, we must also decide what to do with existing points. We want to use all the knowledge available, but if we have several points tightly clustered around the minimum, it does us no good to hang on to points much farther away. They no longer give us any useful data. Therefore, a successful algorithm has two features:
- It chooses its probing points wisely
- It manages the existing data to narrow the field of focus
I should mention one third characteristic, also usually unstated but critical: the algorithm must know when to quit. Taken too far, any method will quickly vanish into a haze of numerical errors. Nowhere is this more true than in minimization, where the shape of the curve tends to flatten out as we get closer and closer to the solution. Imagine a deep Moon crater as seen from space. From a distance, it might appear extremely deep and well defined, like a soup bowl, with an obvious bottom at some point midway between its walls. As you look closer though, the walls seem farther away, and the bottom of the bowl not nearly so well defined. If you were walking along the crater floor in your moon suit, it might be difficult to even tell that you're in a crater, much less where the lowest point is. If the crater is large enough, the curvature of the Moon could even place the walls over the horizon. At this level of detail, boulders and mini-craters (representing numerical debris, also known as round-off error) can mask the minimum completely. It may well be that the most difficult part of the whole process is deciding when we've squeezed as much out of the function as we're going to, and any further attempts to narrow down the minimum will cause us to lose it completely.
Trouble on the lunar horizon
For several months, I struggled to put this issue of second-order minimization to bed. Each month, I hoped to either understand Brent's method at long last, or give you another one that worked as well. Each time I thought I was close, I found some things that seemed wrong enough to make me take a second look.
Things finally began to break my way last month, when I realized two facts of life:
- I was not going to generate a complete, working algorithm in a single month
- Software for algorithms like this tends to be written with very tight and, therefore, inscrutable code. It is there to solve the problem, not illuminate the method
The only way I was going to get loose from this tar baby was to take things slowly and study the problem at greater length. I also realized I needed to make the software very understandable, and worry about efficiency another day. That process began last month. Since last month's column, I've been diligently poking and prodding into minimization, defining and using test functions, and trying approaches that seemed promising. One thing you got last month was a driver that makes an initial scan of the function, and picks three points, P0, P1, and P2, such that:
In words, P1 is lower (has a smaller y-value) than its two neighbors. This means that the true minimum must be somewhere in the range x0..x2. We have it bracketed. One of the jobs of any search algorithm is to keep it that way, by maintaining the condition in Equation 2.
The rest of the solution-the hard part-is to probe f(x) with new points in a way that maintains this condition, narrows the range x0..x2, and locates our best guess as to the true minimum. I've been working on that since we last talked. I now know a lot of approaches that don't work, but I've found at least one that does, and very well indeed.
So do I now have the ultimate solution? Nope. Can I give you executable code that solves the problem once and for all? Not yet, but I think I see the light of dawn on the horizon. I not only understand the problems that gave me concern, I think I see how to fix them. This month, I'll share with you what I've learned.
Am I going to drop the other shoe and tell you about my new breakthrough? Again, no-not yet. I had to suffer to get to where I am now. It's only fair that you suffer too. More importantly, by seeing the problem Brent's method leads us into, you'll see the solution more clearly. Trust me, it will be worth the wait.
Things that didn't work
Least squares fit. Believe me, I tried lots of approaches; some were so cockamamie no one in their right mind would give them a second thought. One of my first, and worst, ideas was to use a least-squares fit to the data, rather than a polynomial fit. I thought that in doing so, I'd somehow squeeze more information from the data, rather than relying only on the last few points. That method didn't work at all well. The problem is that outlying data corrupts rather than sharpens the solution. If we could truly deduce the nature of the function-if, for example, we could learn its true mathematical form-then a least-squares approach might work. But in practice we don't know-we can't know-the nature of the function. Even if we did, almost all least-squares methods assume a polynomial as the fitting function. If f(x) is not really a polynomial, we can never truly fit it, no matter how hard we try.The least-squares method has one other slight flaw. It crashes just when you need it most. Its solution depends on inverting a matrix. When you get close to the solution, the matrix goes singular. End of game, end of story.
The two-line gambit. You might recall me saying last month that I thought perhaps we should be retaining the best five, rather than the best three, data points. In Numerical Recipes, Press et al point out that Brent retains as many as six past data points, so I thought using five would be good. I'll talk more about this later, but one neat feature of having five points is that you can fit straight lines to the two extreme points beside the central one. The math is simple enough. Let the points be P0, P1..P4. A line passing through P0 and P1 can be described by the form:
Because y1 must be less than y0, and y3 less than y4, we can be assured that neither mL nor mR is zero, and that mL is negative and mR is positive.The intersection of the two “curves” occurs when their x- and y-values are equal:
In short, we're guessing at the minimum by looking at where the two bounding lines cross. I thought that this point might serve to validate the estimates that come out of a parabolic interpolation.It may do so yet, but I didn't find it helpful for my studies. I'm holding this approach in reserve for a very important set of special cases, cases in which the function f(x) is continuous, but its derivative is not. The simplest example is:
which is one of the test functions I gave you last month. Any method that depends on a function that looks like a parabola near the minimum is going to go crazy with a(x). The two-line fit, on the other hand, will nail the function in one iteration. It's a handy method to keep in reserve for some strange functions.
A detailed example
When I first started studying minimization methods, I did a lot of work in Mathcad, trying different approaches and plotting the results. (Explanation for those with good memories: yes, I know, I know, I swore I'd never use Mathcad again. I was going to use Matlab and Maple. Unfortunately, I haven't had time to learn them. So for many problems, I'm still muddling along with Mathcad. I have been able to make it workable by reverting to its last stable version, which is v. 6.0 in my opinion.) I've wanted to do more detailed studies with the test functions I've defined in recent months, but time never seems to permit. In preparation for this month's column, I did another very detailed study, which I'll outline shortly. The study just happens to point up, in crystal clarity, the main problem with all methods such as Brent's, including my own. Fortunately, it also hints at a cure.I'll begin with my usual test function, the by-now boringly familiar function:
When I first chose this function, I didn't have any grand design in mind; it just seemed like a good example that wasn't so symmetric as to hide potential problems. It also happens to be trascendental, not a polynomial, so any solutions that expect a polynomial won't have an edge. In recent months, I've come to appreciate the fact that this function makes an excellent test case. While it's continuous and analytic, it's also perverse enough that any flaws in the method quickly rise to one's attention.
To begin the iteration, we need to get points on the function that satisfy the condition of Equation 2. To do that, I assumed that we step along the curve in even steps-as the driver function I gave last month does-until the condition is satisfied. That process gives us the initial starting configuration to begin the iteration, as shown in Table 1.
Because I wanted to make sure I had enough points to try different things, I wanted to begin with at least five points around the current minimum. To get the extra two points, I began with two golden section steps. This resulted in the configuration shown in Table 2.
We've talked about the method of parabolic interpolation before, as well as how to use it. The general idea is to fit a parabola through three selected points and differentiate it to find its minimum value. I won't bore you with the math again-if I do it every month, we will never be done with this thing. Suffice it to say that a formula exists for predicting not only the location of the minimum along the x-axis, but also its y-value. Note that this value is not, in general, the same as f(x) at that point. In fact, the difference between the two can serve as a measure for our estimate's accuracy. This is one area where my methods differ from Brent's. He doesn't consider the y-values of the function in his halting criteria; I do.In past months, I've talked about using three of four points to get two estimates of the minimum. This time I chose to use three of five points twice. The points I chose are P0, P2, and P4, and P1, P2, and P3.
Examination of the data shows us that the point at xm2 is our new minimum, so we throw out the two outliers. Table 4 shows the resultant data set.
Sharp-eyed readers will notice one thing immediately. After only one parabolic step (which requires two function evaluations), we have pulled in both end points. When I did this same study using Brent's method, the value x4 remained stuck at 1.0 for several iterations. Here we have already pulled in both end points, and reduced the search interval from a width of 0.2 to:
0.861803 – 0.738197 = 0.123606
This is definitely a good sign, and gave me great hope for a good finish. The moral is simple: while it may seem frivolous and wasteful to do two parabolic fits in one step, we get good data from both fits, which we can use to our advantage. Table 5 shows another iteration. As before, we take the five points straddling the minimum to get the data set shown in Table 6.
Though it may not be obvious, storm clouds are brewing on the horizon. This time, we didn't get the luxury of being able to pull x4 in. We had to cull the two lowest x-values instead. If you plot the locations of the points, you will find a big gap between x4 and the other four points.I saw this tendency of the points to bunch up in my early studies of Brent's method. So did Brent. He includes a test to make sure a parabolic step doesn't give a result too close to an existing point. If it does, he forces a step of golden section. So did I.
However, Brent used what I consider a risky criterion for the test: he flagged the problem only if the two points were near an epsilon distance (related to machine precision) apart. My criterion was much more conservative. I decided to force bisection if the points were within 5% of the total range. My logic for the bisection is also different. I forced a bisection after accepting the new points, not rejecting them. The results after the bisection step are shown in Table 7. Do you see the problem yet? There's that pesky 0.8, still stuck in the list.
Sadistic I may be, but I won't put you through the torment of following the analysis through to its ultimate conclusion. Yes, I can see that one can get a solution to converge, but it is neither easy nor fun. To get those end points in, I had to continue executing bisection steps. What's more, because of my 5% rule, I found myself alternating between parabolic and bisection steps each time. Table 8 shows the last data set I obtained before throwing up my hands.Yes, the method is converging, but the convergence is starting to bog down because I'm having to bisect after every parabolic iteration. In short, my convergence is now linear, not quadratic.
What's going on?
Hindsight is a wonderful thing. Now, at last, after all these months, I finally see the problem that has plagued my attempts at a solution. More exciting, I also see how to fix it.
It would be nice if we could guarantee that as we iterate and converge on a solution, we keep the five (or three, or seven) data points nicely spaced out (that is, balanced) around the current best guess for the minimum. Unfortunately, the inexorable laws of mathematics are working against us. The problem is that the convergence in Brent's method, or any parabolic fit, is superlinear; it converges quadratically, so that every estimate doubles the number of significant digits in the result. However, the convergence in bisection methods is merely linear, which means that the gaps don't shrink nearly as fast as they do for the parabolic fit.
The implication to our dream of balanced data points is clear. We can't have them. The points estimated by the parabolic steps are going to get close together much faster than those of bisections, so we can expect the imbalance to get worse, not better, as we approach the solution. As a matter of fact, we can expect the ratio of largest to smallest intervals to grow as we near the solution. This fact of life has two immediate and profound results for our method.
First, recall the golden section method, in which we used the distance between the extreme points in x as our halt criterion. You might also recall my lament that the same criterion is no help in Brent's method, because the end points tend not to pull in. Again, we can now see why. The best we can hope for from bisection is a halving of the interval at each step. If Brent's method is going to converge faster than bisection, it must do so on some criterion other than the width of the total interval. And indeed it does, though I wasn't all that happy with the criterion Brent chose.
Second, because two of the three points we use for the curve fit are bound to be much closer than the third, the solution is ill-conditioned. I am not sure that this matters much; if we have a point very close to the true minimum, an error in computing another point even closer is not fatal. But it's not satisfying either.
You might think that use of parabolic steps would make the problem go away. Sometimes the estimate would hit on the left side of the true minimum, sometimes on the right. So on the average, we should be pulling both end points in towards the middle.
Unfortunately, it doesn't work that way. Think back to your experiences trying to find the root of a function by extrapolating its slope (Newton's method). The way you approach the root depends entirely on the shape of the function, and that approach is always monotonic after the first couple of steps. If the function curves upward, your estimate will always be to the left of the true root. If it curves downward, it will always be to the right. No matter what your initial guess is, you will find yourself, in the end game, sneaking up on the solution from one side or the other.
The same is true for the minimum of f(x). I told you it was a good choice for a test function, and this is one of the main reasons. Because of the x3 term in Equation 9, the function never looks quite like a parabola. It's always a bit flattened on the right side, and all estimates tend to end up to the left of the true root. Because of this, our data points for this particular test function tend to bunch up on the left side, but have very large gaps on the right. As we've seen, bisection doesn't help, because the gap doesn't shrink fast enough.
I should comment here on Numerical Recipes, which states, in reference to Brent's method, that the end condition is usually one in which the triplet is balanced, with x0 and x2 a distance epsilon (defined in the code) left and right of the solution. I don't believe this. I haven't seen it happen in any of my tests, and from the foregoing description, I don't see how it can happen. Barring bouncing around of the solution because of numerical error, I see the range x0..x2 remaining large compared to epsilon, and the end state as one of imbalance.
Help is on the way
As you might expect, I would not be telling you all this if I wasn't going somewhere with it. Again, once you see the problem, the solution seems almost obvious. If a combination of bisection and parabolic fit is not going to give us balanced data points, we need another mechanism to do so. It's that simple.What is the mechanism? I'll tell you in a moment, but it doesn't really matter what that mechanism is. All we're required to do is probe the function, like that ski patrol guy poking holes in the snow. We don't have to justify our choice of where to probe. Anywhere we choose is “legal.” Our only motive in seeking an algorithm is the hope that it will help us probe wisely.
My epiphany came when I went back and redid the analysis I just outlined, with a couple of differences. First, I decided not to use two parabolic fits at each step. This might surprise you, since the method seemed so promising early on. It's not so promising later.
Recall, I fit the two parabolas by using the two data sets, P0, P2, and P4, and P1, P2, and P3.
This approach worked just fine when the data was spread evenly over the search interval. Unfortunately, it doesn't stay that way. P1 and P2 tend to bunch up together, while P0 and P4 may not. In my tests, the end result was that the parabola fitted through P0, P2, and P4 tended to pass through P2 with a slope so close to zero as not to matter. That meant that its estimated minimum only echoed P2.
Second, I dropped the initial bisection step to get my five points. Since I'm now back to using only three points, and iterating a fourth, the starting configuration is fine to get me going. A nice side effect, for those function-call counters, is that we eliminate two function calls in the opening move. By the way, I also reverted from golden section to true bisection. I did not find that the golden section gives any advantage to make up for its added complexity.
My starting points were the same three as before: x = 0.7, 0.8, and 0.9. The first interpolation gave:
xmin = 0.78398
yielding the data set shown in Table 9. Clearly, x2 is our choice for the new minimum. But look how close this point is to x1. Already, we have an imbalance in which the distance between the x1 and x2 is only 13% of the total range. While this is well outside even my own pessimistic 5%, it's also on the very first step. So it doesn't bode well for the future.The secret key to fixing the problem comes when you look at the predicted vs. the actual value of f(x) at the interpolated point. They are:
y_predicted = -1.013983y_actual = -0.993507
Our interpolation predicted that the minimum would be less than the value at x2. Actually, it was greater. What does this tell you, or at least hint at? It tells me that the slope at P2 isn't what I expected it to be. I expected the curve to be moving upward, with a positive slope at P2. In actuality, it's still going downward.
This means that the true minimum is to the right of P2, not the left. We're looking in the wrong place!This inspired me to probe again, on the right of P2. Where should I probe? Well, the two values x1 and x2 are a distance:
Dx = x2 – x1
apart. For this case, the distance is 0.01602. It makes reasonable sense to probe the same distance on the opposite side, at:
x = 0.81602.
The value of f(x) there is:
y = -0.963086
This new point doesn't give us a better estimate of the minimum; that value is still found at x2. What it does give us, though, in spades, is a reduced range between data points. After eliminating the outlier, we have the triplet show in Table 10.
In one step, our range has been reduced from 0.2 to 0.03, a factor of 6. Not bad. This approach only gets more effective as the process continues. I'd like to be able to tell you that it always works. It doesn't. Sometimes the predicted value of y isn't in the wrong direction, which means the slope isn't wrong and we can't use the delta-approach. Other times, we can use it, but it still doesn't yield a triplet satisfying Equation 2. Even so, my test shows that the method converges dramatically better, and we can maintain a balanced data set all the way down to convergence.
To spare you further suspense, the complete set of triplets for the full process is listed in Table 11. To save space, I'm showing only the x-values. The y-values don't contribute much anyhow.Yes, you read that right. The method has converged in only four iterations! Compare the final value with the theoretical one:
The error is only -2.15470710-9, which is better than we have any right to expect.
Understand that the four iterations above don't tell the whole story; there's a lot going on behind the scenes. Sometimes I used the delta-step, which eats a function evaluation. Sometimes I used bisection, which eats another. If the delta-step fails, you must also bisect, so that's two function evaluations. Nevertheless, the total number of evaluations was only nine, not counting the initial three that we had going into the algorithm. That's a new record for my tests, which I find most encouraging.
Making an algorithm work when you're creating it on the fly is one thing. Making it work for all test cases is quite another. We still have a long way to go, and I realize that. Nevertheless, I think we're on the right track. The way to fast convergence is to keep the data set balanced. The trick is to recognize that a combination of bisection and parabolic fit will not, and cannot, do this. We needed that one extra strategy, which, for me, has turned out to be delta steps. No doubt even more strategies exist. Perhaps you folks can think of even better ones. The important thing, again, is to realize that a third strategy is needed. That was the breakthrough I've been searching for all these months. Now that I see it, I expect fast progress. Next month, I'll have a go at turning theory into code. See you then.
Jack W. Crenshaw is a senior principal design engineer at Alliant Tech Systems in Clearwater, FL. He did much early work in the space program and has developed numerous analysis and real-time programs. He is the author of Math Toolkit for Real-Time Programming from CMP Books. He holds a PhD in physics from Auburn University. Crenshaw enjoys contact and can be reached via e-mail at .