This is the continuation of my article, “How I Test Software” (July/August, page 11). We have a lot to cover this time, so let's get right to it. I'll begin with some topics that don't fit neatly into the mainstream flow.
Test whose software?
As with my previous article, “How I Test Software”, I need to make clear that the chosen pronoun in the title is the right one. This is not a tutorial on software testing methodologies. I'm describing the way I test my software.
Does this mean that I only write software solo and for my own use? Not at all. I've worked on many large projects. But when I do, I try to carve out a piece that I can get my arms around and develop either alone or with one or two like-minded compadres. I've said before that I usually find myself working with folks that don't develop software the same way I do. I used to think I was all alone out here, but you readers disabused me of that notion. Even so, my chances of finding a whole project team who work the way I do are slim and none. Which is why I try to carve out a one-person-sized piece.
What's that you say? Your project is too complicated to be divided up? Have you never heard of modular programming? Information hiding? Refactoring? Show me a project whose software can't be divided up, and I'll show you a project that's doomed to failure.
The classical waterfall method has a box called “Code and Unit Test.” If it makes you feel better, you can put my kind of testing into that pigeonhole. But it's more than that, really. There's a good reason why I want to test my software, my way: my standards are higher than “theirs.”
Remember my discussion about quality? To some folks, the term “quality” has come to mean “just good enough to keep from getting sued.” I expect more than that from my software. I don't think it's enough that it works well enough for the company to get their fee. I think it should not only run, but give the right answer.
I'm funny that way. It doesn't always endear me to my managers, who sometimes just want to ship anything, whether it works or not. Sometimes I want to keep testing when they wish I wouldn't. But regardless of the quality of the rest of the software, I try very hard to make mine bulletproof. Call it pride if you like or even hubris. I'm still going to test thoroughly.
Oddly enough, the reason I spend so much time testing is, I'm lazy. I truly hate to debug; I hate to single-step; I hate to run hand checks. But there's something I hate even more, and that's having to come back again, later, and do it all over again. The reason I test so thoroughly is, I don't ever want to come back this way again.
All my old software engineering textbooks used to say that the most effective method to reduce software errors was by desk checking. Personally, I've never found it to be very effective.
I assume you know the theory of desk checking. When a given programmer is working on a block of code, he may have been looking at the code so long, he doesn't see the obvious errors staring him in the face. Forest for the trees, and all that stuff. A fresh pair of eyes might see those errors that the programmer has become blind to.
Maybe it's so, but I find that the folks that I'd trust to desk check my code, and do it diligently, are much too much in demand to have time to do it. And what good is it to lean on a lesser light? If I have to spend all day explaining the code to him, what good is that?
But there's another, even more obvious reason to skip desk checking: it's become an anachronism. Desk checking may have been an effective strategy in one of those projects where computer time was so precious, no one was allowed to actually compile their code. But it makes no sense at all, today.
Face it, unless the checker is a savant who can solve complicated math equations in his head, he's not going to know whether the math is being done right or not. At best, he can only find syntax errors and typos. But today, there's a much more effective way to find those errors: let the compiler find them for you. When it does, you'll find yourself back in the editor with the cursor blinking at the location of the error. And it will do it really fast. With tools like that, who needs the fellow in the cubie next door?
I have to admit it: I'm a single-stepping junkie. When I'm testing my software, I tend to spend a lot of time in the source-level debugger. I will single-step every single line of code. I will check the result of every single computation. If I have to come back later and test again, I will, many times, do the single-stepping thing all over again.
My pal Jim Adams says I'm silly to do that. If a certain statement did the arithmetic correctly the first time, he argues, it's not going to do it wrong later. He's willing to grant me the license to single step one time through the code, but never thereafter.
I suppose he's right, but I still like to do it anyway. It's sort of like compiling the null program. I know it's going to work (at least, it had better!), but I like to do it anyway, because it puts me in the frame of mind to expect success rather than failure. Not all issues associated with software development are cold, hard, rational facts. Some things revolve around one's mindset. When I single-step into the code, I find myself reassured that the software is still working, the CPU still remembers how to add two numbers, and the compiler hasn't suddenly started generating bad instructions.
Hey, remember: I've worked with embedded systems for a long time. I've seen CPU's that didn't execute the instructions right. I've seen some that did for a time, but then stopped doing it right. I've seen EPROM-based programs whose bits of 1's and 0's seemed to include a few ½'s.
I find that I work more effectively once I've verified that the universe didn't somehow become broken when I wasn't looking. If that takes a little extra time, so be it.
This point is one dear to my heart. In the past, I've worked with folks who like to test their software by giving it random inputs. The idea is, you set the unit under test (UUT) up inside a giant loop and call it a few million times, with input numbers generated randomly (though perhaps limited to an acceptable range). If the software doesn't crash, they argue, it proves that it's correct.
No it doesn't. It only proves that if the software can run without crashing once, it can do it a million times. But did it get the right answer? How can you know? Unless you're going to try to tell me that your test driver also computes the expected output variables, and verifies them, you haven't proven anything. You've only wasted a lot of clock cycles.
Anyhow, why the randomness? Do you imagine that the correctness of your module on a given call depends on what inputs it had on the other 999,999 calls? If you want to test the UUT with a range of inputs, you can do that. But why not just do them in sequence? As in 0.001, 0.002, 0.003, etc.? Does anyone seriously believe that shuffling the order of the inputs is going to make the test more rigorous?
There's another reason it doesn't work. As most of us know, if a given software module is going to fail, it will fail at the boundaries. Some internal value is equal to zero, for example.
But zero is the one value you will never get from a random-number generator (RNG). Most such generators work by doing integer arithmetic that causes an overflow and then capturing the lower-order bits. For example, the power residue method works by multiplying the last integer by some magic number, and taking the lower-order bits of the result. In such a RNG, you will never see an integer value of zero. Because if it ever is equal to zero, it will stay zero for the rest of the run.
Careful design of the RNG can eliminate this problem, but don't count on it. Try it yourself. Run the RNG in your favorite compiler and see how often you get a floating point value of 0.000000000000e000.
End result? The one set of values that are most likely to cause trouble—that is, the values on the boundaries—is the one set you can never have.
If you're determined to test software over a range of inputs, there's a much better way to do that, which I'll show you in a moment.
All my career, I've had people tell me their software was “validated.” On rare occasions, it actually is, but usually not. In my experience, the claim is often used to intimidate. It's like saying, “Who do you think you are, to question my software, when the entire U.S. Army has already signed off on it?”
Usually, the real situation is, the fellow showed his Army customer one of his outputs, and the customer said, “Yep, it looks good to me.” I've worked with many “validated” systems like that and found hundreds of egregious errors in them.
Even so, software absolutely can be validated and should be. But trust me, it's not easy.
A lot of my projects involve dynamic simulations. I'm currently developing a simulator for a trip to the Moon. How do I validate this program? I sure can't actually launch a spacecraft and verify that it goes where the program said it would.
There's only one obvious way to validate such software, and that's to compare its results with a different simulation, already known to be validated itself. Now, when I say compare the results, I mean really compare them. They should agree exactly, to within the limits of floating-point arithmetic. If they don't agree, we have to go digging to find out why not.
In the case of simulations, the reasons can be subtle. Like slightly different values for the mass of the Earth. Or its J2 and J4 gravity coefficients. Or the mass of its atmosphere. Or the mass ratio of the Earth/Moon system. You can't even begin to run your validation tests until you've made sure all those values are identical. But if your golden standard is someone else's proprietary simulation, good luck getting them to tell you the constants they use.
Even if all the constants are exactly equal and all the initial conditions are identical, it's still not always easy to tell that the simulations agree. It's not enough to just look at the end conditions. Two simulations only agree if their computed trajectory has the same state values, all the way from initial to final time.
Now, a dynamic simulation will typically output its state at selected time steps—but selected by the numerical integrator way down inside, not the user. If the two simulations are putting out their data at different time hacks, how can we be sure they're agreeing?
Of course, anything is possible if you work at it hard enough, and this kind of validation can be done. But, as I said, it's not easy, and not for the faint-hearted.
Fortunately, there's an easier and more elegant way. The trick is to separate the numerical integration scheme from the rest of the simulation. You can test the numerical integrator with known sets of differential equations, that have known solutions. If it gets those same solutions, within the limits of floating-point arithmetic, it must be working properly.
Now look at the model being simulated. Instead of comparing it to someone else's model, compare its physics with the real world. Make sure that, given a certain state of the system, you can calculate the derivatives of the states and verify that they're what the physics says they should be.
Now you're done, because if you've already verified the integrator, you can be sure that it will integrate the equations of motion properly and get you to the right end point.
Is this true validation? I honestly don't know, but it works for me.
Now we get to the real meat of the issue, which is how to do hand checks. I've been telling you for some time now that I do a lot of testing, including testing every single line of code, usually as I write it. When I say testing, I mean give the code under test some inputs and verify that it gives the right outputs.
But where do those inputs come from? From a test case, of course.
Now, almost all of my software tends to be mathematical in nature—old-time readers know that. So my test cases tend to be numeric, and so are the expected results. But the nature of the software really doesn't matter. Regardless of the type of data, your code should still process it correctly.
Remember the Adams dictum for software testing:
Never ask the computer a question you don't already know the answer to.
Just picking inputs at random (that word again), looking at the results and saying “Yep, that looks right” won't get it done. You have to know it's right.
In the good old days, I used to generate the expected outputs by laborious hand calculation. Typically, the only aid I had was a pocket calculator or, worse yet, a Friden (www.oldcalculatormuseum.com/fridenstw.html). Sometimes, I even had to compute the expected results in octal, because I was dependent on core dumps to see the internal computations. It was not fun, but it had to be done.
Fortunately, today there's a much better approach. Today, we can choose between math-oriented applications such as MATLAB, Mathcad, Mathematica, and Maple. I usually find myself using Matlab or Mathcad, or both.
Long-time readers might remember my heated rants against Mathcad (now sold by PTC). I complained loudly and longly how frustrating it was to use, and how much it was plagued with bugs.
I haven't changed my mind. But a funny thing happened. One day I realized that, despite my frustration with Mathcad, I use it almost daily. It's by far the most oft-used application in my computer. The reason is simple. Despite its faults and my complaints, it's still the best tool of its type, bar none. It's like the old joke: it's the only game in town.
Yes, I know, the other tools I mentioned tend to be much more powerful (and a whole lot more expensive) than Mathcad. But only Mathcad lets me both enter math equations in a WYSIWG format and get the results back the same way. When I'm doing math analysis, either as a pure analysis or to generate test cases, I like to use the application to both do the calculations and document the process. With Mathcad, I can interleave the math equations, text, and results to get a worksheet that looks like a page out of someone's textbook.
And yes, I still curse Mathcad sometimes. It's a love-hate relationship. I hate the frustrations, but love the final results. Come to think of it, it's the same relationship I have with Microsoft Word or Windows.
Back to testing
So here's how it works. To test a given bunch of software, I'll start with a reasonable set of inputs. They may be “photo-realistic,” meaning that they're exactly the same values I'd expect in the real world. But this isn't necessary. Simpler inputs will do. If, for example, I use an initial radius of 6,000 km and the software operates on it properly, I feel reasonably satisfied that it will also work for 6378.137. Arithmetic units are good at that sort of thing.
Anyhow, I'll start with those inputs and let Mathcad calculate the expected outputs. I can also make it show me the expected value of every single intermediate variable in the calculation. When I'm done with this step, I save the file, and now I have a detailed, permanent copy of the test case. I can revisit it whenever I like.
Now it's time to go to the UUT. I'll invoke it with the same inputs. Now it's time for me to do my single-stepping thing. Sitting in the source-level debugger, I can step down through the code, checking the results of every single computation against the value Mathcad got. When I'm done, I promise you, I'm just about 100% satisfied that my software is correct.
Draw a graph
Well, make that 98%. How do I get the remaining 2%? By doing it lots more times, with lots of different sets of input values.
My next trick is one I only got into in the last few years, and I'm sharing it for the very first time.
This stunt is like the random-number thing, only done with more elegance, and for a far more useful purpose. In general, a given bit of code will not have to perform for every single input value, from +∞ to –∞. There will be some range of input values we can expect. The random-number buffs got that part right.
Now suppose I give one of the parameters a value on or near one end of the range. I get a result that agrees with the Mathcad model. Now let me give it a value near the other end of the range. I get a different result, that also agrees with my test case. Let me then sprinkle some values in between, and I can see how the outputs vary when I change the inputs. I can even graph them.
This is where the math aid earns its keep. Like its rivals, Mathcad will operate on a whole range of input values, as easily as it does for a single set. I can make the mesh as dense as I choose. And I can make it draw a graph of the way the outputs respond to changes in the inputs.
Now let me do the same thing with my UUT. I'll set up an input file (hardware folk would call it a test vector), and feed the entire set to the UUT. I can then store the resulting outputs in a file and plot them.
If the two sets of graphs agree, I'm happy. Now I'm 100% sure the code is good.
As you can tell, over the years I've learned some really neat things about how to test software. Now you know them too. Use them in good health.