# Where Do We Stand?

**The minimizer series is winding down. This month, Jack adds the features that make this addition to your toolbox safe for the road.**

Imagine that we're trying to build a race car, rather than a function minimizer. Since I introduced the notion of a state machine as the proper structure for the system, we have had a solid, reliable chassis. With it, our system will withstand whatever we choose to throw at it, in the way of handling or performance demands.

Until my column “Looking at Clouds” (July 2001), in which I realized that Brent's method had some fundamental flaws, we were copying our “engine” design from the competition. Since that article, we have moved on to a new and innovative design that promises to provide us with a fire-breathing, power-producing engine that no one else can match. I presented the strategy behind this new design in that column, and a prototype version in September's (“Making it Work“).

What we have now is a car that handles well, and has more power than we ever dreamed of. But it has no brakes, no body, no seat belts, and no traction control. In short, it goes like gangbusters, but is unsafe at any speed. Our goal for this month is to make it safe by adding all the bells, whistles, and myriad doodads that distinguish a rolling chassis from a complete, usable racing car.

I expect the results of this month's efforts to produce a minimizer safe enough, in practiced hands, to be used without fear of crashing out through the guard rails. That should free us up to go look at new topics, for a change.

**How do you stop this thing?**

The first order of business for any routine that uses iteration, is telling it how to know when to stop. For us, that's the convergence criterion. I addressed that issue last month, and offered some new (or, rather, updated) data structures to support the logic.

Back when we were searching for a minimum using either simple bisection, or golden section, the criterion was simple: begin with some search interval *x _{0} ..x_{2} * , let the bisection logic update them as needed, and stop when:

(1)

That simple criterion worked beautifully for bisectors. Each bisection inexorably pulls in the boundary points, until we have the minimum nicely trapped between them. One can then return, as a best guess:

(2)

However, that criterion didn't work worth diddly when using parabolic interpolation, because of the frustrating tendency of that method to hang onto outlying endpoints.

Recall, however, that we've now fixed that problem. My new techniques of overstepping and backstepping (I call them “twostep” and “flip,” respectively) cause both end points to pull nicely together, just as bisection did. For that reason, we could conceivably use the criterion in Equation 1. Just to prove it, I modified the minimizer to use this simple criterion. I got convergence to one part in 10^{8} using a mere four parabolic interpolations, and seven function evaluations (after finding the triplet). Remember, this includes the hip-hop stepping that burns a few extra function evaluations.

The key question facing us now is this: since we now have such a simple criterion for convergence, shall we use that, and that alone, or should we continue with more aggressive criteria? I'm tempted to say, the heck with it; given that *x _{0} * and

*x*have been forced to almost coelesce, how can the minimum

_{2}*not*be trapped between them? What more can we possibly hope for?

On the other hand, we've also come up with additional criteria that are way cool, and saved our collective buns before I hit on the hip-hop. I sort of hate to throw them away after working so hard to arrive at them. So I have no problem, at least for now, with adopting a criterion that combines more than one test. In past issues, we've seen cases where estimations of the minimum led to false indications that we'd converged (as, for example, when the parabolic fit suggests no change in *x* ). Because one can come up with so many pathological test cases, I have no problem using all the criteria I can dream up. Think of it as a case of belt and suspenders. In the code that follows, I think I've made reasonable choices as to how best to combine multiple convergence criteria.

The great race car designer Harry Miller built cars of such elegance and beauty that they not only won races, one made it to New York's Museum of Modern Art. His advice to would-be designers is a classic: “Simplify, and add lightness.” I love that. You don't remove weight. You add lightness.

If this minimizer were a racing car, it would most definitely be the heaviest, safest, and most bulletproof car ever built. At every turn, I've opted for safety rather than lightness. To make sure we had enough data to dodge false solutions, I chose to use five adjacent data points, instead of three. I'm using multiple halt criteria, instead of one simple one. I'm using estimates of *y = f(x)* , and not just of *x* . And I've developed estimation methods that we've never even used yet, in anger. They're still sitting in my source code, waiting to be invoked.

It may be fast, thanks to parabolic estimation, but as a race car, this minimizer is more reminiscent of an American LaFrance fire engine, than a Formula 1 car. Harry Miller would not be smiling.

But I'm content with what we've got. Fire engines almost never break down, and if they crash, it's more a matter of flattening whatever gets in their way. After I've got lots more miles on this thing, and I'm thoroughly convinced that it's safe, I may well decide to pare it down. That would be very much a case of removing weight, not adding lightness. But in this case, that's okay with me. Better safe and heavy, than light and broken.

**Concerning relative errors**

One last point of explanation: having established that we're going to use some criterion of the form:

(3)

We have to ask ourselves whether the measure should be absolute or relative in nature? Brent used a relative criterion, of the form:

(4)

Frankly, I think that's crazy. If the function to be minimized happens to have its minimum at *x = 0* , Equation 4 blows up. Brent fixed this with a fudge factor added to *x* , but that solution is, well, a fudge factor. I much prefer to use a measure for the denominator of Equation 4 that can't possibly vanish.

Fortunately, we have an ideal candidate for such a measure. It's the initial search range, *x _{2} – x_{0} * . If we start with a search range of some value, be it 1.0 (as in my test case) or one million, we should be satisfied when the search process has narrowed down the minimum to within some fixed fraction of that range.

The above prose serves to explain my choice of code, of which I show the pertinent fragments below:

**// global value for epsstatic double eps; // tolerance in xstatic double x**

_{eps}; // tolerance in ystatic double y

_{eps};

in minimize():

eps = Eps; // local to globalx_{eps} = (x_{1} -x_{0} )*sqrt(eps);

** **

This code converts an expected maximum relative error in y to an equivalent, absolute error in *x* . Why did I use Eps as a y-error rather than an x-error? Because it feels right. We're used to thinking of errors in terms of what the computer can do in its floating point representation. This error may vary from the order of 10^{-6 } to 10^{-15} , depending on the accuracy of the floating-point arithmetic used. And we can't expect extreme accuracy for this problem, since the function curve is usually parabolic near convergence. This means it will have a shape given by:

(5)

in the vicinity of *x _{min} * . Therefore, the best relative accuracy we have a right to expect in

*x*is the square root of that in

*y*. I've found that I'm more comfortable thinking in terms of the expected error in

*y*, and letting the computer do the mental conversion to the equivalent error in

*x*.

**What about y?**

The relative error in *y* is a bit harder to compute, because we don't have such a well-defined range given to us. The range *f(x _{2} ) * to

*f(x*is meaningless, since the two may very well be equal, as they are in my test case. (Gosh, I love that test case!) Even the range over the entire initial triplet might be misleading, since we may later find y-values much smaller than any of the initial three. This suggests that we should adjust the

_{0})*y*-range and, therefore, the absolute error limit dynamically as new

*y*-values turn up. That's the reason for the rest of my error-controlling code:

**static double y _{max} = -1.0e20;// defining values for rangestatic double y_{min} = 1.0e20;static double y_{eps} ;// tolerance in y**

// Adjust y convergence values

double test_y(double (*f)(double), double x)

{

double Y = f(x);

if (Y <>_{min} )

y_{min} = Y;

if (Y > y_{max} )

y_{max} = Y;

y_{eps} = (y_{max} – y_{min} ) * eps;

return Y;

}

** **

We have one more useful secret in our bag of convergence tricks: the predicted value of *y.* Three of the methods (two parabolic interpolations, and one that simply fits two straight lines) not only produce an estimate of the value of *x _{min} * , but also the expected value of

*y(x*). Brent didn't compute or use that value. I have always thought we should.

_{min}We can define:

(6)

If the predicted and actual values of *y* agree, I can start to feel secure that we have a true minimum.

**The convergence criteria**

To summarize, here are some candidates for convergence criteria:

1. Range *x _{0} ..x_{2} * has shrunk below

*x*

_{eps}2. *x _{min} * has stopped changing

3. *y _{min} * is close to

*f(x*

_{min})In my usual belt-and-suspenders fashion, I propose that we use all three. But how best to combine them?

As I hinted earlier, I think Criterion 1 trumps all the others. If the two end points have pulled together within* x _{eps} * , game over. We have the minimum trapped, and can simply declare victory. I'm therefore going to put that test in the main loop of

*minimize()*. The other criteria may possibly short-circuit this test, ending the process sooner.

A nice side effect of Criterion 1 is that it works even after bisection steps. It does not depend on an estimate of *y _{min} * .

Criterion 2 is definitely unsafe, taken alone. It's too easy to conceive of cases where the process produces two successive values of *x _{min} * that are nearly equal, but still wrong. The same is true of Criterion 3, taken alone. However, Criterion 3 appears stronger to me than 2, since it suggests that our understanding of the function's behavior is not far from the truth.

If both criteria are met for, say, two successive tests (three iterations), I'll feel comfortable that the process really has converged.

We have one more issue to cover, which is what we should do about multiple methods of estimating the minimum. I have three in mind, and already coded, but so far I've only used one of them: the parabolic interpolation/estimation. Because of this, I'm going to defer that last question for the moment, and press on with a convergence criterion that fits this last method. I'll define a new function, **has_converged()** , which is shown in Listing 1.

**Listing 1: Convergence test**

int has_converged(double dx, double dy, double &old_dx, double &old_dy)

{

int retval;

// make the d's positive

dx = abs(dx);

dy = abs(dy);

// evaluate the criterion

retval = (max(dx, old_dx) <>_{eps} ) && max(dy, old_dy <>_{eps} );

// shift the deltas

old_dx = dx;

old_dy = dy;

return retval;

}

** **

Note that both the old and new **d** s are passed into the function, so the caller must provide storage for them. This allows the function to be called from more than one place.

In function **para_fit()** , the call is simple enough:

**if (has_converged(dx, Y-Yguess, old_dx, old_dy)) return DONE;**

The variables **old_dx** and **old_dy** must be declared static, and initialized to some huge values:

**static double old_dx = 1e20;static double old_dy = 1e20;**

**Avoiding ill conditioning**

You've seen me write ad infinitum about the tendency of Brent's method to produce triplets that are ill-conditioned, such that the middle point is very close to one of its neighbors. Brent himself recognized this problem, and used bisection to help keep it under control.

Last month, I also said that parabolic interpolation can and will predict no movement at all (that is, it will suggest *x _{min} = x_{1} ), if f(x_{0} ) =f(x_{2} )* .

Either of these conditions can be fatal to the success of the method, but until now we've been running without brakes in this area, counting on probability to save us from disaster. It's now time to correct this problem and make the parabolic interpolation safe.

Fortunately, this is easy to do. For example, suppose the parabolic interpolation says we should use the existing *x _{1} * as the next iterant. We can simply refuse to do so, and instead force it to be some distance from

*x*.

_{1}How far should it be? Brent required it to be only a distance *x _{eps} * away. That's too close for my taste. I'm going to be a lot more conservative, and require it to be a distance equal to 1% of the gap between

*x*and

_{0}*x*. Note that this always shifts the estimate to the left, which is reasonable, since our search process always favors minima near the left-hand side of the range.

_{1}Similarly, if the interpolation predicts a minimum too close to *x _{0} * or

*x*, I'm going to force the points to be closer to

_{2}*x*by 1%.

_{1}This approach has a downside. It gets us back into what is effectively a bisection. It just happens to be a very lopsided bisection. That means we might not be converging as fast as we could be.

Suppose the range *x _{0} ..x_{2} * is 1.0, and the interpolation gives us a value:

**x _{min} = x_{1} – 1.0*10-8 **

(7)

Using the new algorithms **flip()** and **twostep()** , we can reasonably expect that both endpoints will be pulled in, and we'll have a range of only 2.0*10^{-8} next time. That's superlinear convergence in its finest and purest form.

But suppose both endpoints are not pulled in. In that case, the intervals *x _{0} ..x_{1} * and

*x*can differ by a factor of 10

_{1}..x_{2}^{8}. That's a terrible situation, and is ill-conditioning at its worst. My solution, for better or worse, is simply not to allow such a configuration, and if that slows down the convergence, so be it.

Readers who've stuck this series out for a while may remember that, early on, I worried about cases where the value of *f(x _{min} )* might be equal to the current value of

*y*. This could cause us to lose the triplet condition. To avoid that situation, I introduced the concept of “wiggling” the new value

_{1}*x*, moving it towards

_{min}*x*if necessary, until it no longer gave an equal value.

_{0}**Listing 2: Function to limit xmin value **

**double wiggle(double X){ #define R 0.01 // limit X to safe values double left_limit = R * x[index] + (1-R) * x[index-1]; double left_middle = (1-R) * x[index] + R * x[index-1]; double right_middle = (1+R)* x[index] – R * x[index-1]; double right_limit = (1-R) * x[index+1] + R * x[index];**

// force X inside limits

X = max(X, left_limit);

X = min(X, right_limit);

// and not too close to x1

if (X <=>

X = min(X, left_middle);

else

X = max(X, right_middle);

** return X;}**

We no longer need that trick; the logic involved in inserting new values automatically assures that no equal values get stored. However, we now have to wiggle *x _{min} * for a new reason, and that's to avoid ill-conditioned triplets. So I'm introducing a new function,

**wiggle()**, which is shown in Listing 2. This function assures that the value of

*x*used is not too close to the existing values of

_{min}*x*, or

_{0}, x_{1}*x*. Calls to

_{2}**wiggle()**must be placed in every function that computes a new value of

*x*. For example, in

**para_fit()**, I use:

**X = (x[index-1] + x[index+1] – m/c) / 2.0;X = wiggle(X); **

For my usual test function, the modified routine converges the same as it did before. In fact, none of the parabolic interpolations hit the limits for this case, so function **wiggle()** really doesn't make a difference. However, the new code is safely free of possible ill-conditioning, so we can use it with some confidence.

**Five-point functions**

Recall that I made the decision, quite some time ago, to retain five adjacent points, rather than only three, in the *x[]* and *y[]* arrays. That was to afford me the possibility of using algorithms other than bisection and parabolic interpolation.

I've been carrying along two functions that operate on five points. The first is a second parabolic interpolation, only it uses wider-spaced points spanning the full five-point range.

I've called this function **para_wide()** , and it works exactly like **para_fit** except for the choice of which points to use.

The second function is called **two_line()** , and it works by fitting two straight lines to the outer pairs of points. It estimates a minimum at the intersection of those lines. This algorithm was specifically included to handle functions like **abs(x)** that don't have a parabolic shape near the minimum.

Both functions require a call to **wiggle()** . Note also that this call forces the estimated *x _{min} * to be between the three innermost points of

*x[ ]*, not just somewhere in the range. However, in these cases there's more to the test than simply limiting

*x*.

_{min}These two functions should be called only if we have what I call a “V,” in which all five points satisfy a “quintet” relation:

(8)

Apologies for the seeming new numbering on the points. This means that the triplet condition is also satisfied for the three innermost points. Although we're using the full five points to estimate *x _{min} * , that estimate had better be in the range of the inner triplet. If it's not, the estimate is surely bogus, and we're wasting our time using the result. Therefore, before calling

**wiggle()**for these algorithms, we first qualify them to make sure they're worth our while.

To test for the “V” condition, I've generated a new test function, which is in Listing 3. The new estimation functions are shown in Listing 4.

**Listing 3: Test for “V” condition**

**int have_V(void){ return ((npoints == 5) && (index = = 2) && (y[0] > y[1]) && (y[1] > y[2]) && (y[2] <> (y[3] <>}**

**Listing 4: The five-point functions**

**// do a “wide” parabolic fit of the three points x0, x2, and x4STATE para_wide(double (*f)(double), double &dx){ assert(npoints == 5); assert(index == 2); double X, Y, Yguess; double dx0 = x[index] – x[index-2]; double dx1 = x[index+2] – x[index]; double span = dx0 + dx1; double m0 = (y[index] – y[index-2])/dx0; double m1 = (y[index+2] – y[index])/dx1; double m = (y[index+2] – y[index-1])/span; double c = (m1 – m0)/span; static double old_dx = 1e20; static double old_dy = 1e20;**

// compute the estimated minimum

X = (x[index-2] + x[index+2] – m/c)/2.0;

// test for reasonableness

if ((X < x[index-1])="" &&="" (x=""> x[index+1]))

return PARA_FIT;

X = wiggle(X);dx = X – x[index];

Yguess = y[index-2] – c*(X-x[index-2])*(X-x[index-2]);

Y = test_y(f, X);

insert(X, Y);

// have we converged?

if (has_converged(dx, Y-Yguess, old_dx, old_dy))

return DONE;

else if (have_V())

return LINE_FIT;

else

return PARA_FIT;}

// Fit straight lines through the outer two pairs of points,

// and find the intersection. * return the results in X and Y

STATE line_fit(double (*f)(double), double &dx)

{

assert(npoints == 5);

double X, Yguess, Y;

static double old_dx = 1e20;

static double old_dy = 1e20;

double m0 = (y[1] – y[0])/(x[1] – x[0]);

double m3 = (y[4] – y[3])/(x[4] – x[3]);

X = (m3*x[3] + m0*x[0] – (y[3]-y[0]))/(m3-m0);

// test for reasonableness

if ((X < x[index-1])="" &&="" (x=""> x[index+1]))

return PARA_FIT;

X = wiggle(X);

dx = X – x[index];

Yguess = y[0] + m0*(X-x[0]);

Y = test_y(f, X);

insert(X, Y);

** // have we converged? if (has_converged(dx, Y-Yguess, old_dx, old_dy)) return DONE; else return PARA_FIT;}**

I should say something here about the progression of states from one to the other. As things stand now, a call to **para_fit()** is followed by a call to either **flip()** or **twostep()** . At the end of each of these functions, I now test for the “V” condition. If it exists, I call **df** on the next iteration. If the “V” condition still holds after that call, I call **line_fit()** next. That function then calls **para_fit()** again.

So, assuming that the “V” condition persists, the order of calls will always be:

**para_fitflip/twosteppara_wideline_fitpara_fitetc.**

Is this the optimal sequence? Almost certainly not. With the new functions in place, we now perform 14 function evaluations, where before we used only nine. Adding the new, five-point functions doesn't save us any time, it makes things worse.

Recall, however, that my test function has the behavior of a parabolic function near its minimum. It is hardly surprising that pure parabolic interpolation works nicely for this case. Remember, the ultimate goal is to probe the function space using intelligent choices for *x* , trying to locate the minimum. How we arrive at these choices is not as important as whether they're useful.

My results using **para_wide()** and **line_fit()** seem to show that they are not of much value for my usual test case. This does not at all mean that they are of no value in general.

**Where do we stand now?**

Last month, we had a racing car that went very fast, but had no body, no brakes, and no seat belts. As of now, we have added the safety features, and have an algorithm that is quite bulletproof. It not only avoids bad situations like ill-conditioned triplets, it now uses all five saved data points to squeeze the maximum information out of the available data.

Does it work? Yes. Is it reliable? Yes. Does it use an intelligent set of convergent criteria? Yes. Does it converge for all test cases? Yes, for the seven test functions I've tried it on. Does it have all the bells and whistles that I originally intended to give it? Yes, it does.

Does it converge optimally fast? Alas, it does not. Think of this month's tests as the first race of a season. We still have some shakedown and tuning to do.

In particular, I don't think I'm currently making the best use of the five-point functions. I feel reasonably sure that they have something to offer us for the most ill-behaved functions, but I don't think I'm using them to greatest effect. I think the current state transitions, which pretty much call the functions in a fixed sequence, are almost certainly not optimal. Remember that the two-line function was specifically intended to help us nail functions that have sharp minima, like **abs(x)** or **sqrt(abs(x))** . For those cases, it makes no sense at all to alternate between the two-line and parabolic interpolations.

What's required, then, is a more intelligent set of rules as to how to decide whether a given function is doing us any good, and when to stop using it if it's not. I'll be working on that logic for next month.

**Jack W. Crenshaw** is a senior software engineer at Spectrum-astro in the Gilbert, AZ, and a specialist in the application of advanced methods to the solution of real-world, embedded-systems problems. He is the author of *Math Toolkit for Real-Time Programming,* from CMP Books. He holds a Ph.D. in physics from Auburn University. Jack enjoys contact and can be reached via e-mail at .

Return to October 2001 Table of Contents