Where Do We Stand? - Embedded.com

Where Do We Stand?

The minimizer series is winding down. This month, Jack adds the features that make this addition to your toolbox safe for the road.

Imagine that we're trying to build a race car, rather than a function minimizer. Since I introduced the notion of a state machine as the proper structure for the system, we have had a solid, reliable chassis. With it, our system will withstand whatever we choose to throw at it, in the way of handling or performance demands.

Until my column “Looking at Clouds” (July 2001), in which I realized that Brent's method had some fundamental flaws, we were copying our “engine” design from the competition. Since that article, we have moved on to a new and innovative design that promises to provide us with a fire-breathing, power-producing engine that no one else can match. I presented the strategy behind this new design in that column, and a prototype version in September's (“Making it Work“).

What we have now is a car that handles well, and has more power than we ever dreamed of. But it has no brakes, no body, no seat belts, and no traction control. In short, it goes like gangbusters, but is unsafe at any speed. Our goal for this month is to make it safe by adding all the bells, whistles, and myriad doodads that distinguish a rolling chassis from a complete, usable racing car.

I expect the results of this month's efforts to produce a minimizer safe enough, in practiced hands, to be used without fear of crashing out through the guard rails. That should free us up to go look at new topics, for a change.

How do you stop this thing?

The first order of business for any routine that uses iteration, is telling it how to know when to stop. For us, that's the convergence criterion. I addressed that issue last month, and offered some new (or, rather, updated) data structures to support the logic.

Back when we were searching for a minimum using either simple bisection, or golden section, the criterion was simple: begin with some search interval x0 ..x2 , let the bisection logic update them as needed, and stop when:


(1)

That simple criterion worked beautifully for bisectors. Each bisection inexorably pulls in the boundary points, until we have the minimum nicely trapped between them. One can then return, as a best guess:


(2)

However, that criterion didn't work worth diddly when using parabolic interpolation, because of the frustrating tendency of that method to hang onto outlying endpoints.

Recall, however, that we've now fixed that problem. My new techniques of overstepping and backstepping (I call them “twostep” and “flip,” respectively) cause both end points to pull nicely together, just as bisection did. For that reason, we could conceivably use the criterion in Equation 1. Just to prove it, I modified the minimizer to use this simple criterion. I got convergence to one part in 108 using a mere four parabolic interpolations, and seven function evaluations (after finding the triplet). Remember, this includes the hip-hop stepping that burns a few extra function evaluations.

The key question facing us now is this: since we now have such a simple criterion for convergence, shall we use that, and that alone, or should we continue with more aggressive criteria? I'm tempted to say, the heck with it; given that x0 and x2 have been forced to almost coelesce, how can the minimum not be trapped between them? What more can we possibly hope for?

On the other hand, we've also come up with additional criteria that are way cool, and saved our collective buns before I hit on the hip-hop. I sort of hate to throw them away after working so hard to arrive at them. So I have no problem, at least for now, with adopting a criterion that combines more than one test. In past issues, we've seen cases where estimations of the minimum led to false indications that we'd converged (as, for example, when the parabolic fit suggests no change in x ). Because one can come up with so many pathological test cases, I have no problem using all the criteria I can dream up. Think of it as a case of belt and suspenders. In the code that follows, I think I've made reasonable choices as to how best to combine multiple convergence criteria.

The great race car designer Harry Miller built cars of such elegance and beauty that they not only won races, one made it to New York's Museum of Modern Art. His advice to would-be designers is a classic: “Simplify, and add lightness.” I love that. You don't remove weight. You add lightness.

If this minimizer were a racing car, it would most definitely be the heaviest, safest, and most bulletproof car ever built. At every turn, I've opted for safety rather than lightness. To make sure we had enough data to dodge false solutions, I chose to use five adjacent data points, instead of three. I'm using multiple halt criteria, instead of one simple one. I'm using estimates of y = f(x) , and not just of x . And I've developed estimation methods that we've never even used yet, in anger. They're still sitting in my source code, waiting to be invoked.

It may be fast, thanks to parabolic estimation, but as a race car, this minimizer is more reminiscent of an American LaFrance fire engine, than a Formula 1 car. Harry Miller would not be smiling.

But I'm content with what we've got. Fire engines almost never break down, and if they crash, it's more a matter of flattening whatever gets in their way. After I've got lots more miles on this thing, and I'm thoroughly convinced that it's safe, I may well decide to pare it down. That would be very much a case of removing weight, not adding lightness. But in this case, that's okay with me. Better safe and heavy, than light and broken.

Concerning relative errors

One last point of explanation: having established that we're going to use some criterion of the form:


(3)

We have to ask ourselves whether the measure should be absolute or relative in nature? Brent used a relative criterion, of the form:


(4)

Frankly, I think that's crazy. If the function to be minimized happens to have its minimum at x = 0 , Equation 4 blows up. Brent fixed this with a fudge factor added to x , but that solution is, well, a fudge factor. I much prefer to use a measure for the denominator of Equation 4 that can't possibly vanish.

Fortunately, we have an ideal candidate for such a measure. It's the initial search range, x2 – x0 . If we start with a search range of some value, be it 1.0 (as in my test case) or one million, we should be satisfied when the search process has narrowed down the minimum to within some fixed fraction of that range.

The above prose serves to explain my choice of code, of which I show the pertinent fragments below:

// global value for eps
static double eps;
// tolerance in x
static double xeps ;
// tolerance in y
static double yeps ;

in minimize():

eps = Eps;    // local to globalxeps = (x1 -x0 )*sqrt(eps);

This code converts an expected maximum relative error in y to an equivalent, absolute error in x . Why did I use Eps as a y-error rather than an x-error? Because it feels right. We're used to thinking of errors in terms of what the computer can do in its floating point representation. This error may vary from the order of 10-6 to 10-15 , depending on the accuracy of the floating-point arithmetic used. And we can't expect extreme accuracy for this problem, since the function curve is usually parabolic near convergence. This means it will have a shape given by:


(5)

in the vicinity of xmin . Therefore, the best relative accuracy we have a right to expect in x is the square root of that in y . I've found that I'm more comfortable thinking in terms of the expected error in y , and letting the computer do the mental conversion to the equivalent error in x .

What about y?

The relative error in y is a bit harder to compute, because we don't have such a well-defined range given to us. The range f(x2 ) to f(x0 ) is meaningless, since the two may very well be equal, as they are in my test case. (Gosh, I love that test case!) Even the range over the entire initial triplet might be misleading, since we may later find y-values much smaller than any of the initial three. This suggests that we should adjust the y -range and, therefore, the absolute error limit dynamically as new y -values turn up. That's the reason for the rest of my error-controlling code:

static double ymax = -1.0e20;// defining values for range
static double ymin = 1.0e20;
static double yeps ;
// tolerance in y

// Adjust y convergence values
double test_y(double (*f)(double), double x)
{
    double Y = f(x);
    if (Y <>min )
        ymin = Y;
    if (Y > ymax )
        ymax = Y;
    yeps = (ymax – ymin ) * eps;
    return Y;
}

We have one more useful secret in our bag of convergence tricks: the predicted value of y. Three of the methods (two parabolic interpolations, and one that simply fits two straight lines) not only produce an estimate of the value of xmin , but also the expected value of y(xmin ). Brent didn't compute or use that value. I have always thought we should.

We can define:


(6)

If the predicted and actual values of y agree, I can start to feel secure that we have a true minimum.

The convergence criteria

To summarize, here are some candidates for convergence criteria:

1. Range x0 ..x2 has shrunk below xeps

2. xmin has stopped changing

3. ymin is close to f(xmin )

In my usual belt-and-suspenders fashion, I propose that we use all three. But how best to combine them?

As I hinted earlier, I think Criterion 1 trumps all the others. If the two end points have pulled together within xeps , game over. We have the minimum trapped, and can simply declare victory. I'm therefore going to put that test in the main loop of minimize() . The other criteria may possibly short-circuit this test, ending the process sooner.

A nice side effect of Criterion 1 is that it works even after bisection steps. It does not depend on an estimate of ymin .

Criterion 2 is definitely unsafe, taken alone. It's too easy to conceive of cases where the process produces two successive values of xmin that are nearly equal, but still wrong. The same is true of Criterion 3, taken alone. However, Criterion 3 appears stronger to me than 2, since it suggests that our understanding of the function's behavior is not far from the truth.

If both criteria are met for, say, two successive tests (three iterations), I'll feel comfortable that the process really has converged.

We have one more issue to cover, which is what we should do about multiple methods of estimating the minimum. I have three in mind, and already coded, but so far I've only used one of them: the parabolic interpolation/estimation. Because of this, I'm going to defer that last question for the moment, and press on with a convergence criterion that fits this last method. I'll define a new function, has_converged() , which is shown in Listing 1.

Listing 1: Convergence test

int has_converged(double dx, double dy, double &old_dx, double &old_dy)
{
    int retval;

    // make the d's positive
    dx = abs(dx);
    dy = abs(dy);

    // evaluate the criterion    
    retval = (max(dx, old_dx) <>eps ) && max(dy, old_dy <>eps );

    // shift the deltas
    old_dx = dx;
    old_dy = dy;

    return retval;
}

Note that both the old and new d s are passed into the function, so the caller must provide storage for them. This allows the function to be called from more than one place.

In function para_fit() , the call is simple enough:

if (has_converged(dx, Y-Yguess,
        old_dx, old_dy))
    return DONE;

The variables old_dx and old_dy must be declared static, and initialized to some huge values:

static double old_dx = 1e20;
static double old_dy = 1e20;

Avoiding ill conditioning

You've seen me write ad infinitum about the tendency of Brent's method to produce triplets that are ill-conditioned, such that the middle point is very close to one of its neighbors. Brent himself recognized this problem, and used bisection to help keep it under control.

Last month, I also said that parabolic interpolation can and will predict no movement at all (that is, it will suggest xmin = x1 ), if f(x0 ) =f(x2 ) .

Either of these conditions can be fatal to the success of the method, but until now we've been running without brakes in this area, counting on probability to save us from disaster. It's now time to correct this problem and make the parabolic interpolation safe.

Fortunately, this is easy to do. For example, suppose the parabolic interpolation says we should use the existing x1 as the next iterant. We can simply refuse to do so, and instead force it to be some distance from x1 .

How far should it be? Brent required it to be only a distance xeps away. That's too close for my taste. I'm going to be a lot more conservative, and require it to be a distance equal to 1% of the gap between x0 and x1 . Note that this always shifts the estimate to the left, which is reasonable, since our search process always favors minima near the left-hand side of the range.

Similarly, if the interpolation predicts a minimum too close to x0 or x2 , I'm going to force the points to be closer to x1 by 1%.

This approach has a downside. It gets us back into what is effectively a bisection. It just happens to be a very lopsided bisection. That means we might not be converging as fast as we could be.

Suppose the range x0 ..x2 is 1.0, and the interpolation gives us a value:

xmin = x1 – 1.0*10-8
(7)

Using the new algorithms flip() and twostep() , we can reasonably expect that both endpoints will be pulled in, and we'll have a range of only 2.0*10-8 next time. That's superlinear convergence in its finest and purest form.

But suppose both endpoints are not pulled in. In that case, the intervals x0 ..x1 and x1 ..x2 can differ by a factor of 108 . That's a terrible situation, and is ill-conditioning at its worst. My solution, for better or worse, is simply not to allow such a configuration, and if that slows down the convergence, so be it.

Readers who've stuck this series out for a while may remember that, early on, I worried about cases where the value of f(xmin ) might be equal to the current value of y1 . This could cause us to lose the triplet condition. To avoid that situation, I introduced the concept of “wiggling” the new value xmin , moving it towards x0 if necessary, until it no longer gave an equal value.

Listing 2: Function to limit xmin value

double
wiggle(double X)
{
    #define R 0.01
    // limit X to safe values
    double left_limit = R * x[index] + (1-R) * x[index-1];
    double left_middle = (1-R) * x[index] + R * x[index-1];
    double right_middle = (1+R)* x[index] – R * x[index-1];
    double right_limit = (1-R) * x[index+1] + R * x[index];

    // force X inside limits
    X = max(X, left_limit);
    X = min(X, right_limit);

    // and not too close to x1
    if (X <=>
    X = min(X, left_middle);
    else
    X = max(X, right_middle);

    return X;
}

We no longer need that trick; the logic involved in inserting new values automatically assures that no equal values get stored. However, we now have to wiggle xmin for a new reason, and that's to avoid ill-conditioned triplets. So I'm introducing a new function, wiggle() , which is shown in Listing 2. This function assures that the value of xmin used is not too close to the existing values of x0 , x1 , or x2 . Calls to wiggle() must be placed in every function that computes a new value of x . For example, in para_fit() , I use:

X = (x[index-1] + x[index+1] – m/c)
    / 2.0;
X = wiggle(X);

For my usual test function, the modified routine converges the same as it did before. In fact, none of the parabolic interpolations hit the limits for this case, so function wiggle() really doesn't make a difference. However, the new code is safely free of possible ill-conditioning, so we can use it with some confidence.

Five-point functions

Recall that I made the decision, quite some time ago, to retain five adjacent points, rather than only three, in the x[] and y[] arrays. That was to afford me the possibility of using algorithms other than bisection and parabolic interpolation.

I've been carrying along two functions that operate on five points. The first is a second parabolic interpolation, only it uses wider-spaced points spanning the full five-point range.

I've called this function para_wide() , and it works exactly like para_fit except for the choice of which points to use.

The second function is called two_line() , and it works by fitting two straight lines to the outer pairs of points. It estimates a minimum at the intersection of those lines. This algorithm was specifically included to handle functions like abs(x) that don't have a parabolic shape near the minimum.

Both functions require a call to wiggle() . Note also that this call forces the estimated xmin to be between the three innermost points of x[ ] , not just somewhere in the range. However, in these cases there's more to the test than simply limiting xmin .

These two functions should be called only if we have what I call a “V,” in which all five points satisfy a “quintet” relation:


(8)

Apologies for the seeming new numbering on the points. This means that the triplet condition is also satisfied for the three innermost points. Although we're using the full five points to estimate xmin , that estimate had better be in the range of the inner triplet. If it's not, the estimate is surely bogus, and we're wasting our time using the result. Therefore, before calling wiggle() for these algorithms, we first qualify them to make sure they're worth our while.

To test for the “V” condition, I've generated a new test function, which is in Listing 3. The new estimation functions are shown in Listing 4.

Listing 3: Test for “V” condition

int
have_V(void)
{
    return ((npoints == 5)        &&
        (index = = 2)            &&
        (y[0] > y[1])            &&
        (y[1] > y[2])            &&
        (y[2] <>
        (y[3] <>
}

Listing 4: The five-point functions

// do a “wide” parabolic fit of the three points x0, x2, and x4
STATE para_wide(double (*f)(double), double &dx)
{
    assert(npoints == 5);
    assert(index == 2);
    double X, Y, Yguess;
    double dx0 = x[index] – x[index-2];
    double dx1 = x[index+2] – x[index];
    double span = dx0 + dx1;
    double m0 = (y[index] – y[index-2])/dx0;
    double m1 = (y[index+2] – y[index])/dx1;
    double m = (y[index+2] – y[index-1])/span;
    double c = (m1 – m0)/span;
    static double old_dx = 1e20;
    static double old_dy = 1e20;

    // compute the estimated minimum
    X = (x[index-2] + x[index+2] – m/c)/2.0;

    // test for reasonableness
    if ((X < x[index-1])="" &&="" (x=""> x[index+1]))
        return PARA_FIT;
    X = wiggle(X);dx = X – x[index];
    Yguess = y[index-2] – c*(X-x[index-2])*(X-x[index-2]);
    Y = test_y(f, X);
    insert(X, Y);

    // have we converged?
    if (has_converged(dx, Y-Yguess, old_dx, old_dy))
        return DONE;
    else if (have_V())
        return LINE_FIT;
    else
        return PARA_FIT;}

// Fit straight lines through the outer two pairs of points,
// and find the intersection. * return the results in X and Y

STATE line_fit(double (*f)(double), double &dx)
{
    assert(npoints == 5);
    double X, Yguess, Y;
    static double old_dx = 1e20;
    static double old_dy = 1e20;
    double m0 = (y[1] – y[0])/(x[1] – x[0]);
    double m3 = (y[4] – y[3])/(x[4] – x[3]);
    X = (m3*x[3] + m0*x[0] – (y[3]-y[0]))/(m3-m0);

    // test for reasonableness
    if ((X < x[index-1])="" &&="" (x=""> x[index+1]))
        return PARA_FIT;

    X = wiggle(X);
    dx = X – x[index];
    Yguess = y[0] + m0*(X-x[0]);
    Y = test_y(f, X);
    insert(X, Y);

    // have we converged?
    if (has_converged(dx, Y-Yguess, old_dx, old_dy))
        return DONE;
    else
        return PARA_FIT;
}

I should say something here about the progression of states from one to the other. As things stand now, a call to para_fit() is followed by a call to either flip() or twostep() . At the end of each of these functions, I now test for the “V” condition. If it exists, I call df on the next iteration. If the “V” condition still holds after that call, I call line_fit() next. That function then calls para_fit() again.

So, assuming that the “V” condition persists, the order of calls will always be:

para_fit
flip/twostep
para_wide
line_fit
para_fit
etc.

Is this the optimal sequence? Almost certainly not. With the new functions in place, we now perform 14 function evaluations, where before we used only nine. Adding the new, five-point functions doesn't save us any time, it makes things worse.

Recall, however, that my test function has the behavior of a parabolic function near its minimum. It is hardly surprising that pure parabolic interpolation works nicely for this case. Remember, the ultimate goal is to probe the function space using intelligent choices for x , trying to locate the minimum. How we arrive at these choices is not as important as whether they're useful.

My results using para_wide() and line_fit() seem to show that they are not of much value for my usual test case. This does not at all mean that they are of no value in general.

Where do we stand now?

Last month, we had a racing car that went very fast, but had no body, no brakes, and no seat belts. As of now, we have added the safety features, and have an algorithm that is quite bulletproof. It not only avoids bad situations like ill-conditioned triplets, it now uses all five saved data points to squeeze the maximum information out of the available data.

Does it work? Yes. Is it reliable? Yes. Does it use an intelligent set of convergent criteria? Yes. Does it converge for all test cases? Yes, for the seven test functions I've tried it on. Does it have all the bells and whistles that I originally intended to give it? Yes, it does.

Does it converge optimally fast? Alas, it does not. Think of this month's tests as the first race of a season. We still have some shakedown and tuning to do.

In particular, I don't think I'm currently making the best use of the five-point functions. I feel reasonably sure that they have something to offer us for the most ill-behaved functions, but I don't think I'm using them to greatest effect. I think the current state transitions, which pretty much call the functions in a fixed sequence, are almost certainly not optimal. Remember that the two-line function was specifically intended to help us nail functions that have sharp minima, like abs(x) or sqrt(abs(x)) . For those cases, it makes no sense at all to alternate between the two-line and parabolic interpolations.

What's required, then, is a more intelligent set of rules as to how to decide whether a given function is doing us any good, and when to stop using it if it's not. I'll be working on that logic for next month.

Jack W. Crenshaw is a senior software engineer at Spectrum-astro in the Gilbert, AZ, and a specialist in the application of advanced methods to the solution of real-world, embedded-systems problems. He is the author of Math Toolkit for Real-Time Programming, from CMP Books. He holds a Ph.D. in physics from Auburn University. Jack enjoys contact and can be reached via e-mail at .

Return to October 2001 Table of Contents

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.