To read original PDF of the print article, click here.
Embedded Systems Programming Programmer's Toolbox
Oops! I Did It Again
Jack Crenshaw
Jack returns from holiday refreshed and ready to solve the problem of minimization once and for all.
If you're wondering where I was last month, I was taking a much-needed vacation. I used the time to go out to San Jose for the Embedded Systems Conference. Among other things, I did a book signing for my new book (shameless plug), Math Toolkit for Real-Time Programming.[1] I'll bet you can never guess where the material for it came from. A few readers dropped by to say hello and listened to Jean Labrosse and me try to talk them into buying our books. It was good to see you all.
I have a lot to say about the ESC, mainly that it was huge! If you've never attended one of these conferences, you really owe it to yourself to do so. Beg, plead, and cajole your boss to send you, or just pony up the money on your own. But get there! If you're like me, you'll find its atmosphere of excitement and change a cure for every caught-in-a-rut feeling you've ever had. Exciting things are going on.
I'll have a lot more to say about the conference and other things, such as my occasional Good Guys/Bad Guys list. But this month I want to dive right into the minimization problem. So I'll defer the rest, except for these two comments.
First, I had speculated in an earlier column that large companies nowadays practice “planned disgruntlement,” meaning that they optimize their tech support to save costs by intentionally alienating a certain percentage of customers. A reader (who I'll keep anonymous, along with the company he names, to avoid any repercussions) sent me the following e-mail, under the heading, “Suspicions confirmed”:
“Jack, I want to confirm your belief that companies do exactly what you think about QA on their products. During the '80s, I learned that none less than [Company X, a very large, very prestigious company, noted for its quality] formally used a way to reduce testing on new products even if it meant losing some customers. This was factored into their cost of doing business. This was documented in a paper in an obscure statistics proceedings conference paper in [deleted] during the mid '80s.
I didn't believe this when a [Company X] salesman told me. It took our tech library [Company Y] some time to find it.”
To this I will add an observation from a colleague, who noted that he seemed to be getting a lot of dead-on-arrival items from a large department store that has a reputation for replacing broken items, no questions asked. After some sleuthing, he concluded that the company had made this policy possible by skipping acceptance testing on their products. In effect, they're using their customers as their QA department. Interesting concept.
Second, a reader, Mike Mullen, was kind enough to send me his personal copy of Brent's book, Algorithms for Minimization Without Derivatives.[2] Was that generous or what? A cursory glance goes a long ways towards explaining why nobody ever seems to explain how Brent's method works. It's the same reason why everyone quotes, but very few people ever actually read, Kalman's paper on Kalman filtering, or Cooley's papers on Fast Fourier Transforms. Brent is a mathematician, and mathematicians, bless 'em, think in terms of theorems and lemmas. Such words tend to make my eyes cross in the same way that some of my stuff on matrices or quaternions does for you readers out there.
Even more to the point, and most astonishingly, even Brent doesn't explain Brent's algorithm! That's right, you read correctly. After going through all sorts of proofs of guaranteed superlinear performance and the effects of round-off, Brent just sort of plops his algorithm out there, with no explanation whatsoever. No wonder Press et al. didn't explain it![3] They had nothing from which to work.
This puts us all in a strange situation, it seems to me. The algorithm seems to have proven itself, and has been adopted by most of the scientific programming world as the algorithm of choice. Yet no one seems to be quite sure how it does what it does. The general notion of mixing golden section and quadratic fit is straightforward enough, but the details of such features as the halt criteria, or the logic to decide when to inject a golden section step, are retained as just so much witchcraft. I suppose it goes without saying that I don't consider such a situation acceptable.
Back to the grindstone
This problem seems to have no end to it. The deeper I get into it, the deeper I get into it, so to speak. I feel a little bit like Brer Fox, in the Uncle Remus story, fighting the Tarbaby. (If you have never heard of Brer Fox, the Tarbaby, or Uncle Remus, don't worry about it. The story comes from another generation, and it's no longer politically correct. It's still a great story, and those who missed it have been deprived of some precious and insightful lessons.)
Before each of the last few columns, I had hoped to wrap up the study in a blaze of glory, and present you with a totally finished, canned algorithm, ‡ la Brent, that could solve any minimization problem we throw at it. This series must surely already be the longest running in a magazine like ESP. It's become more of a daytime soap than a collection of articles; it may never end. Each time I've thought myself close to an end, I've had to fall back and regroup as I learned more about the problem, and/or had new thoughts that caused me to rethink things I had previously thought were indisputable givens.
So this month I have some good news and some bad news. The bad news is that, no, you won't find the final solution in this month's column, either. In fact, that elusive, ambitious goal seems to be receding from me; I am not gaining on it-yet. The good news is that, as you might have imagined, I have been thinking about this problem quite a lot, and the more I think about it, the more good ideas I have. Each of them seems to take us further afield from Brent's method, though the general approach of mixing golden sections and parabolic fits remains. But I believe that this departure must be a good thing because the reason I began straying from Brent is that I had concerns about it. By not trying to emulate Brent, I've come up with a lot of new stuff I think you will find interesting and useful, and the end product should be something worth waiting for. Speaking for myself, I'm excited and ready to attack the problem with renewed vigor (vacations have a way of inspiring such feelings). For now, at least, I am not going to try to hurry the process, but rather concentrate on getting all of my ideas into some semblance of order.
Regarding efficiency
In my last column, Moving Right Along (October 2000, p. 17), I mentioned that software like Brent's is so difficult to follow because it's written with efficiency, rather than readability, uppermost in mind. These algorithms tend to get used in other algorithms, such as multivariate minimization problems, so they must be very, very fast. But at this point we don't want fast so much as understandable. So in that column and this one, I'm concentrating on writing an algorithm that's easily understandable. We can always optimize the implementation later, but the algorithm will not change.
Another point occurred to me as I was preparing for this article. I can hardly claim it is profound, but it's definitely worth mentioning: the art of software engineering has not stood still since Brent's day. Think about it. All the programs I've been able to find on minimization, or almost any other canned method for that matter, are written in the style of '70s Fortran programs. Some of them actually are old Fortran programs, translated into C or C++. One popular text, whose title I'll leave out this month (but you all know who I mean), even goes so far as to override the C indexing convention, so the authors can write their software with indices starting at 1 instead of 0. That approach must surely be close to the ultimate in lazy translation.
When I was writing the software for the last column, I collected the various parts of the algorithm into subroutines, with arguments passed in and out to control modularity. I also decided to let each subroutine inform me how successful it was, by returning a flag indicating success or failure. If, for example, a quadratic fit fails to give a reliable value, the function that performs the fit is in the best position to tell us whether it succeeded, and thereby trigger a golden ratio bisection. I ended up with functions that returned a Boolean variable to indicate success or failure. In the midst of this process, however, I found myself constrained a bit by having the function return only a Boolean. It would be a lot better if it could suggest the next step, so the driver program doesn't have to guess.
In preparation for this column, I asked myself, can one get a called function to suggest which way to go next? The answer is yes, using a state machine. This happens to be one of my favorite approaches for embedded software involving complex logic. The implementation of a state machine is the kind of thing that rarely gets done in Fortran, because the language-at least, early versions of it-didn't support the enumerated type that makes it all practical. But it's perfect for our uses. So, this month, in addition to seeing yet another attempt to develop a good, reliable, general-purpose function minimizer, you're also going to see a program architecture that solves a lot of problems in a lot of different areas. I've often used nested state machines in real-time systems; the switch statement that implements them doesn't cost much overhead, and neither does the single integer value returned. My experience has been that nested state machines can turn even the most hairy and convoluted exercise in logic and decision making into a clean, fast, and perfectly understandable solution. Here's hoping it will do the same in this case.
In general, another thing that makes most algorithms so inscrutable is the desire to minimize storage, especially global variables, and data motion. In this month's column, you'll see some radical departures in style. At this point, I'm interested in keeping the count of function evaluations low-that's the part that is potentially expensive. But I don't really care if I'm using one more memory location than I need to, or if I'm moving data around too much. The performance impact of such issues is typically small compared to the cost of a function evaluation, and I can save making the algorithm optimally efficient (and therefore suitably inscrutable) for another day.
Random thoughts
This month, I'm also going to concentrate on some nagging thoughts that have been haunting me for some time. My general feelings are:
We're not using the data from previous steps as effectively as we could
We need more robust criteria for deciding when to quit
Relative to both points, I've noted before that Brent's method compares only previous values of the predicted x-value of the minimum, not the y-value. We've also seen that two applications of the quadratic fit can sometimes predict exactly the same x-value, even though the y-values are radically different (and both values are nowhere close to the minimum). It seems to me that we are not squeezing enough information out of the available data. One way to do so would be to track the y-values as well as x-values.
Also, consider the process involved in two golden-section bisections. We start with three points, which define two intervals, and bisect (evenly or based on the golden section) one of the intervals. Then, based upon the results of the first bisection, we throw away one of those points, and bisect again. After all is said and done, we have two new points, giving us a total of five. Yet only three-the lowest point, and its two neighbors-are passed on to the quadratic fit.
Can we get more information by using all five points? I think so. In past issues, I've suggested that if we consider four points at a time instead of three, we can perform two parabolic fits, each using the two endpoints plus one of the two interior points. The golden section generates that fourth point by bisecting one of the two intervals. But if we retain the last five points close to the minimum, we can fit two parabolas, each of which passes through the current best point. That approach must surely be superior to the one previously suggested.
I also noted last time that Brent's method uses a relative, rather than absolute, error criterion in x. On this point, I'm absolutely adamant. I don't care how popular the method is, this choice is simply No Good, because x may be zero, and dividing by zero is not a good idea. Implementations of Brent's algorithm all use additive deltas to prevent divide-by-zero exceptions. In my not-so-humble opinion, this approach is a band-aid that hides a fundamental problem. I'll have more to say about this in a moment.
If you were paying careful attention, you may have noticed that I previously wrote that we begin the process using two golden section steps, bisecting once on the left and once on the right. But then the thought occurred to me: why do we do this first bisection when we begin the solution sequence with three points chosen to fit the condition:
These points are perfectly good points for at least attempting a quadratic fit. So why bother bisecting? In retrospect, we shouldn't have to do so unless the solution given by the fit is somehow unsatisfactory (for example, the value xmin is identical to x1). I must admit to you that I'm still in the process of adjusting my attitude for this one. I've been assuming for so long that we begin with the golden section that I'm going to have to adapt to this new notion. Most of what I present later in this article still assumes the golden section as a starter. Nevertheless, I believe that in the long term we don't need to do it this way.
One last thought: in my last column, I said that I'd forego Brent's algorithm for a time, to give you guys time to try out the method I presented. But y'all let me down. Though I've gotten a lot of e-mail concerning my column (like Mike Mullen's incredible offer of a copy of Brent), not one of you seems to have actually tried the algorithm and tried to break it. Sigh. I guess I'm going it alone here, so I might as well get on with it.
Some parabolic math
It's been a while since we specifically discussed the math of parabolic fits. What's more, it can be quite tricky, and you will get slightly different results depending on the form you assume for f(x). To make matters worse, I've used different forms in past articles, so I think it's best if we summarize the process of fitting a parabola to three data points here. We start by assuming that we have three data points:
P0 = [x0, y0]P1 = [x1, y1] P2 = [x2, y2]
Next, we assume a quadratic equation using Newton's form:
Note carefully the choice of x indices used. This is very important. Next, we force the function to fit at these three points. This allows us to solve for the three coefficients a, b, and c. The equations are:
(2a)
Simplifying gives:
The first equation clearly gives the value of a, directly. The third one gives us b:
Note that this parameter has the form of the slope, m, of a straight line. In this case, the line connects points P0 and P2, the two end points of the range.We are now left to solve the second equation for c. We have:
Again, the left side has the form of a slope, this time of the line between P0 and P1. We can properly define this as m0. Then we get:
It is not at all obvious from the form, but this is the same thing as the “slope of slopes”:
where m1 is a slope just like m0, but from P1 to P2. The easiest way to prove the equivalence is to expand both expressions out and compare terms. Using the slopes gives us an easy way to compute b and c. Define:
Then b = m, and c is the slope of slopes given in Equation 7. Our functional form can then be written:
Recall that at the minimum the slope of this function must be zero. Differentiating Equation 9 gives:
(9a)
or:
It's worth noting that c cannot be zero, since it's the difference of two slopes that are necessarily non-zero and of opposite signs. That condition is a necessary consequence of Equation 1. The slope m, on the other hand, can be zero, and will be if y0 = y2. This is the case when we start out with a balanced geometry. In this case, xmin is simply the average of x0 and x2, and is independent of the values of y = f(x).Substituting the value for xmin into our function gives the predicted minimum value of f(x). It is:
Simplifying gives:
(11a)
or:
About error criteria
A moment ago I said that I felt Brent's choice of a relative error criterion in x was unfortunate, because it may lead to division by zero. Only the addition of an arbitrary constant prevents numerical disaster.
A far better approach is to use an absolute criterion in the first place. Fortunately, we are in a perfect position to take this approach because the original search range is given going in. If we can reduce the error in x to, say, one part in a million of the original range, I'd have to say we've pinned down the minimum pretty doggone well. Remember, because the shape of the function near its minimum is almost always parabolic when looked at with enough “zoom” applied, the trough of the minimum gets more and more shallow as we get closer to it. Therefore, we can't expect this algorithm (or any other) to give the kind of precision we're used to seeing in double-precision computations. The best we can hope for is roughly the square root of the machine precision.
On the other hand, we can expect the y-value to be accurate to machine precision, and I'm intending to use both. Again, I intend to use an absolute measure (or, to be more precise, a measure that is relative, as measured by the full range as opposed to the change from step to step). The only problem is, the full range of values of x is known going in, but those of y = f(x) are not. Nevertheless, we can determine the full range simply by recording the largest and smallest values of y found so far. Based on these notions, I'm going to define the following error criteria for convergence. Let e be some input value giving the precision desired, as a precision relative to the maximum range. Then define:
x_range = x2 – x0
where the x2 and x0 referred to here are the original values, that is, they define the overall range within which we search for a minimum. Likewise, let:
y_range = big_y – little_y
where the values of big_y and little_y are dynamic. The parameter big_y will, of necessity, be either the original y0 or y2, depending on which is larger. The starting value for little_y will be the first value of y1, the result after we get into the configuration where Equation 1 holds. As the search proceeds, we will hopefully find points whose heights are lower than the current minimum; that is, we are sliding downhill to the true minimum. As this happens, little_y can only get smaller, and as a result, y_range will increase.
Given these two ranges, I can now define what I consider to be a robust criterion for convergence:
Here, I've been necessarily vague as to what Dx and Dy are. In general, they're the differences between two estimates of the minimum values, but I can't be more specific at this time. We might define Dx, for example, to be the difference between the values given by the two parabolic fits of a five-point fit. We also might define it to be the difference between the results of two successive passes or, as Brent does, we might define it to be the difference between estimates two passes apart. Those are details that remain to be worked out. Either way, Equation 15 defines the condition that we are ready to accept the estimate as a converged solution. Also, either way, both of the conditions of Equation 15 must be satisfied.Recall that in the golden section method we stop when the points bracketing the minimum (x0 and x2, in our current notation) are shrunk together to the necessary degree of accuracy. I am not sure yet that this is a reasonable requirement on the parabolic solution. I have the feeling that we cannot reasonably achieve this condition without using golden sections. That is why both Brent and I lean towards comparing successive estimates, rather than the bracketing values.
The test functions
Until now, I've been mostly using a single function for my testing. This is the now familiar one:
On occasion, I've also used a variation:
I use this variation because g(x) has a maximum before its minimum. As we will see in a moment, it's important that our algorithm work even when the function increases for a time before decreasing-that's why g(x) has value for us.
It should go without saying that one or two functions are not enough to prove the algorithm. As I get ever closer to the completed algorithm, I want to be quite sure that it works for functions that might seem, in some way, to be “pathological.”
The loosest requirement on the function to be minimized is that it be continuous. If it is not-if its value jumps around between values of x that are arbitrarily close-all bets are off, and we have nothing left to work with but random probings. If we are to have any hope of searching for a solution in the usual sense of the term, we must require that the function be continuous.The functions f(x) and g(x), however, also have continous first derivatives (in fact, all their derivatives are continuous). That is why their shape in the vicinity of the minimum must look like a parabola.
It should go without saying that an algorithm that assumes a function shaped like a parabola works very well when the function is a parabola. In fact, such a method should converge right away, and if it doesn't, we need to fall back and regroup. On the other hand, many functions in nature don't have continuous first derivatives, and therefore don't look like parabolas, no matter what power of zoom lens we apply. An obvious example is:
Any decent general-purpose algorithm should find the minimum at x = 0. Not surprisingly, Brent's method, being based on successive quadratic fits, works well when the function looks like a parabola, and pretty badly with a function such as the one given in Equation 18. It ends up falling back on successive bisections, not the parabolic fit.
For the remainder of this series, I'm adopting an even more stringent function:
The square root function causes the shape to be the opposite of a parabola, in a sense. The closer one gets to the minimum, the steeper the slope of the function, and the more a sharp edge forms at the minimum.
Does this mean that we don't need the simple absolute value? Not at all. This function is unique in that it has segments that are straight lines on either side of the minimum. A smart minimizer should recognize this fact, and exploit it. Brent's method doesn't. Fortunately, we are in position to achieve fast convergence for this case.
Suppose, as I suggested earlier, we retain (at least) five successive data points, with the middle one being the best minimum to date. This means that we have two points on either side of the minimum. And that means we can fit straight lines through the two points on either side of the middle one. If the function is indeed an absolute value, we should find the solution immediately. Consider the two straight line segments:
where m1 and m2 are the usual slopes:
Where these two lines cross, their values are equal. Hence, at this point:
Solving for x gives:
This approach gives us yet another estimate of the minimum. That estimate is clearly exact for the case of the absolute value function. It is also exact for the true parabola when the points are evenly spaced, left and right of the actual minimum. Thus, the formula can serve as an additional way of locating the minimum, and it has a reasonable chance of giving us a value as good as the parabolic fit. It can't hurt to try, and it definitely needs to be there for functions of the absolute value kind. For the record, I've been experimenting with this approach, and I find that the values it suggests are not at all unreasonable, and sometimes quite good.
I should address the issue of efficiency here, because you might be wondering if all these extra computations don't slow things down tremendously. In general, we must assume they do not. That's because we have no idea what f(x) is really like, or how it's generated. However, we tend to assume the worst and suppose that evaluating this function is expensive. If that's true, the extra math associated with, say, the intersection of line segments, is trivial.
Of course, we might also have the opposite extreme, where the function f(x) is exceedingly simple (like abs(x)). In that case, extra computations clearly cost CPU cycles, but those cycles still may not amount to much when compared to the overhead of executing a function call. Bottom line: at this point I'm not very concerned with computations per cycle, only the number of function calls.
Have we exhausted the kinds of functions needed for thorough testing? Actually, no. Sometimes the simplest cases are the hardest to handle, and in this case a simple function might be a straight line, or any other curve, that increases or decreases monotonically from one end of the range to another. Do either of these “curves” have a minimum? You bet they do. If the function is increasing, the minimum occurs at the original x0. If decreasing, it's at x2. Our function should handle such cases. Brent's method doesn't.
A special case of both these functions is the even simpler case, f(x) = k, a constant. Here, no minimum exists at all. Nevertheless, the function should return a value that's reasonable. I'd consider any of the values, x0, x2, or (x0+x2)/2 to be reasonable.At this point I've defined a fairly extensive set of test programs, all of which our final algorithm should handle. They are summarized in Table 1.
Table 1: The Test Functions
Getting started
One thing we haven't talked much about lately is how the minimizer gets started in the first place. In our tests, we always assume that we have already found three successive points x0, x1, and x2 in the vicinity of a minimum, such that the starting condition of Equation 1 is satisfied. If the function f(x) is continuous, this condition guarantees that at least one minimum exists in the interval x0..x2. But at the beginning, we typically don't know the value of x1 such that Equation 1 is satisfied. Usually, we only know (or suspect) that a minimum exists somewhere between x0 and x2. It's worth reminding ourselves that the perfect minimizer is going to need a way to get started.
Early in this study (which feels like it began somewhere just after the Civil War) I developed a little utility that found a minimum by walking its way through the range in fixed steps. The code is shown in Listing 1.
You must verify your email address before signing in. Check your email for your verification email, or enter your email address in the form below to resend the email.
Please confirm the information below before signing in.
{* #socialRegistrationForm *}
{* firstName *}
{* lastName *}
{* displayName *}
{* emailAddress *}
By clicking "Sign In", you confirm that you accept our terms of service and have read and understand privacy policy.
{* /socialRegistrationForm *}
Almost Done
Please confirm the information below before signing in. Already have an account? Sign In.