Test is the last step in traditional software development. We gather requirements, do high level design, detailed design, create code, do some unit testing, then integrate and start—finally— final test .
Since most projects run late, what do you think gets cut? Test, of course. The implication is that we deliver bug-ridden products that infuriate our customers and drive them to competitive products .
Best practice development includes code inspections. Yet inspections typically find only 70% of a system’s bugs, so a fabulous test regime is absolutely essential. Test is like a double-entry bookkeeping system that insures mistakes don’t leak into the deployed product .
In every other kind of engineering testing is considered fundamental. In the USA, every Federally funded bridge must undergo extensive wind tunnel tests, for instance.
Mechanical engineers subject spacecraft to an almost bizarre series of evaluations. It’s quite a sight to see a 15-foothigh prototype being nearly torn to pieces on a shaker, which vibrates at a rate that puts a thousand-Hertz tone into the air. The bridge prototype, as well as that of the shaken spacecraft, are discarded at great expense, but in both cases that cost is recognized as a key ingredient of proper engineering practices .
Yet in the software world test is the ugly stepchild. No one likes to do it. Time spent writing tests feels wasted, despite the fact that test is a critical part of all engineering disciplines. Many segments of the embedded systems design community have thankfully embraced test as a core part of their processes, and advocate creating tests synchronously with writing the code, realizing that leaving such a critical step till the end of the project is folly .
Application versus embedded testing. Embedded systems software testing shares much in common with application software testing. Thus, much of this two part article is a summary of basic testing concepts and terminology. However, some important differences exist between application testing and embedded systems testing. Embedded developers often have access to hardware-based test tools that are generally not used in application development.
Also, embedded systems often have unique characteristics that should be reflected in the test plan. These differences tend to give embedded systems testing its own distinctive flavor. This article covers the basics of testing and test case development and points out details unique to embedded systems work along the way.
Before you begin designing tests, it’s important to have a clear understanding of why you are testing. This understanding influences which tests you stress and (more importantly) how early you begin testing. In general, you test for four reasons:
• To find bugs in software (testing is the only way to do this)
• To reduce risk to both users and the company
• To reduce development and maintenance costs
• To improve performance
To Find the Bugs. One of the earliest important results from theoretical computer science is a proof (known as the Halting Theorem) that it’s impossible to prove that an arbitrary program is correct.
To Reduce Costs. The classic argument for testing comes from Quality Wars by Jeremy Main. In 1990, HP sampled the cost of errors in software development during the year. The answer, $400 million, shocked HP into a completely new effort to eliminate mistakes in writing software.
The $400M waste, half of it spent in the labs on rework and half in the field to fix the mistakes that escaped from the labs, amounted to one-third of the company’s total R&D budget and could have increased earnings by almost 67%.
The earlier a bug is found, the less expensive it is to fix. The cost of finding errors and bugs in a released product is significantly higher than during unit testing, for example (Figure 2-1 below ).
Figure 2-1: The Cost to Fix a Problem. Simplified graph showing the cost to fix a problem as a function of the time in the product life cycle when the defect is found. The costs associated with finding and fixing the Y2K problem in embedded systems is a close approximation to an infinite cost model.
To Improve Performance. Testing maximizes the performance of the system. Finding and eliminating dead code and inefficient code can help ensure that the software uses the full potential of the hardware and thus avoids the dreaded “hardware re-spin.”
When and how to Test?
It should be clear from Figure 2-1 that testing should begin as soon as feasible. Usually, the earliest tests are module or unit tests conducted by the original developer.
Unfortunately, few developers know enough about testing to build a thorough set of test cases. Because carefully developed test cases are usually not employed until integration testing, many bugs that could be found during unit testing are not discovered until integration testing.
For example, a major network equipment manufacturer in Silicon Valley did a study to figure out the key sources of its software integration problems. The manufacturer discovered that 70 percent of the bugs found during the integration phase of the project were generated by code that had never been exercised before that phase of the project.
Unit Testing. Individual developers test at the module level by writing stub code to substitute for the rest of the system hardware and software. At this point in the development cycle, the tests focus on the logical performance of the code.
Typically, developers test with some average values, some high or low values, and some out-of-range values (to exercise the code’s exception processing functionality). Unfortunately, these “black-box” derived test cases are seldom adequate to exercise more than a fraction of the total code in the module.
Regression Testing. It isn’t enough to pass a test once. Every time the program is modified, it should be retested to assure that the changes didn’t unintentionally “break” some unrelated behavior.
Called regression testing , these tests are usually automated through a test script. For example, if you design a set of 100 input/output (I/O) tests, the regression test script would automatically execute the 100 tests and compare the output against a “gold standard” output suite. Every time a change is made to any part of the code, the full regression suite runs on the modified code base to insure that something else wasn’t broken in the process.
(Memo from the trenches: I try to convince my students to apply regression testing to their course projects; however, because they are students, they never listen to me. I’ve had more than a few projects turned in that didn’t work because the student made a minor change at 4:00AM on the day it was due, and the project suddenly unraveled. But, hey, what do I know?)
Because no practical set of tests can prove a program correct, the key issue becomes what subset of tests has the highest probability of detecting the most errors, as noted in The Art of Software Testing by Glen Ford Myers. The problem of selecting appropriate test cases is known as test case design .
Although dozens of strategies exist for generating test cases, they tend to fall into two fundamentally different approaches: functional testing and coverage testing .
Functional testing (also known as black-box testing ) selects tests that assess how well the implementation meets the requirements specification. Coverage testing (also known as white-box testing ) selects cases that cause certain portions of the code to be executed. (These two strategies are discussed in more detail later.)
Both kinds of testing are necessary to test rigorously your embedded design. Of the two, coverage testing implies that your code is stable, so it is reserved for testing a completed or nearly completed product. Functional tests, on the other hand, can be written in parallel with the requirements documents.
In fact, by starting with the functional tests, you can minimize any duplication of efforts and rewriting of tests. Thus, in my opinion, functional tests come first.
Everyone agrees that functional tests can be written first, but Ross, for example, clearly believes they are most useful during system integration – not unit testing. the following is a simple process algorithm for integrating your functional and coverage testing strategies:
1. Identify which of the functions have NOT been fully covered by the functional tests.
2. Identify which sections of each function have not been executed.
3. Identify which additional coverage tests are required.
4. Run new additional tests.
Infamous Software Bugs. The first known computer bug came about in 1946 when a primitive computer used by the Navy to calculate the trajectories of artillery shells shut down when a moth got stuck in one of its computing elements, a mechanical relay. Hence, the name bug for a computer error.
In 1962, the Mariner 1 mission to Venus failed because the rocket went off course after launch and had to be destroyed at a project cost of $80 million. The problem was traced to a typographical error in the FORTRAN guidance code. The FORTRAN statement written by the programmer was
DO 10 I = 1.5
This was interpreted as an assignment statement, DO10I = 1.5.
The statement should have been
DO 10 I = 1,5
This statement is a DO LOOP. Do line number 10 for the values of I from one to five.
Perhaps the most sobering embedded systems software defect was the deadly Therac-25 disaster in 1987. Four cancer patients receiving radiation therapy died from radiation overdoses. The problem was traced to a failure in the software responsible for monitoring the patients’ safety.
When to stop testing?
The algorithm from the previous page has a lot in common with the instructions on the back of every shampoo bottle. Taken literally, you would be testing (and shampooing) forever. Obviously, you’ll need to have some predetermined criteria for when to stop testing and to release the product.
If you are designing your system for mission-critical applications, such as the navigational software in a commercial jetliner, the degree to which you must test your code is painstakingly spelled out in documents, such as the FAA’s DO-178B specification.
Unless you can certify and demonstrate that your code has met the requirements set forth in this document, you cannot deploy your product. For most others, the criteria are less fixed.
The most commonly used stop criteria (in order of reliability) are:
• When the boss says
• When a new iteration of the test cycle finds fewer than X new bugs
• When a certain coverage threshold has been met without uncovering any new bugs
Regardless of how thoroughly you test your program, you can never be certain you have found all the bugs. This brings up another interesting question: How many bugs can you tolerate?
Suppose that during extreme software stress testing you find that the system locks up about every 20 hours of testing. You examine the code but are unable to find the root cause of the error. Should you ship the product?
How much testing is “good enough”? I can’t tell you. It would be nice to have some time-tested rule: “if method Z estimates there are fewer than X bugs in Y lines of code, then your program is safe to release.” Perhaps some day such standards will exist. The programming industry is still relatively young and hasn’t yet reached the level of sophistication, for example, of the building industry.
Many thick volumes of building handbooks and codes have evolved over the years that provide the architect, civil engineer, and structural engineer with all the information they need to build a safe building on schedule and within budget. Occasionally, buildings still collapse, but that’s pretty rare. Until programming produces a comparable set of standards, it’s a judgment call.
Choosing Test Cases
In the ideal case, you want to test every possible behavior in your program. This implies testing every possible combination of inputs or every possible decision path at least once.
This is a noble, but utterly impractical, goal. For example, in The Art of Software Testing , Glen Ford Myers describes a small program with only five decisions that has 1014 unique execution paths. He points out that if you could write, execute, and verify one test case every five minutes, it would take one billion years to test exhaustively this program.
Obviously, the ideal situation is beyond reach, so you must use approximations to this ideal. As you’ll see, a combination of functional testing and coverage testing provides a reasonable second-best alternative. The basic approach is to select the tests some functional, some coverage) that have the highest probability of exposing an error.
Functional Tests. Functional testing is often called black-box testing because the test cases for functional tests are devised without reference to the actual code—that is, without looking “inside the box.”
An embedded system has inputs and outputs and implements some algorithm between them. Black-box tests are based on what is known about which inputs should be acceptable and how they should relate to the outputs. Black-box tests know nothing about how the algorithm in between is implemented. Example black-box tests include:
• Stress tests : Tests that intentionally overload input channels, memory buffers, disk controllers, memory management systems, and so on.
• Boundary value tests : Inputs that represent “boundaries” within a particular range (for example, largest and smallest integers together with − 1, 0, + 1, for an integer input) and input values that should cause the output to transition across a similar boundary in the output range.
• Exception tests : Tests that should trigger a failure mode or exception mode.
• Error guessing : Tests based on prior experience with testing software or from testing similar programs.
• Random tests : Generally, the least productive form of testing but still widely used to evaluate the robustness of user-interface code.
• Performance tests : Because performance expectations are part of the product requirement, performance analysis falls within the sphere of functional testing.
Because black-box tests depend only on the program requirements and its I/O behavior, they can be developed as soon as the requirements are complete. This allows black-box test cases to be developed in parallel with the rest of the system design.
Like all testing, functional tests should be designed to be destructive , that is, to prove the program doesn’t work. This means overloading input channels, beating on the keyboard in random ways, purposely doing all the things that you, as a programmer, know will hurt your baby.
As an R&D product manager, this was one of my primary test methodologies. If 40 hours of abuse testing could be logged with no serious or critical defects logged against the product, the product could be released. If a significant defect was found, the clock started over again after the defect was fixed.
The weakness of functional testing is that it rarely exercises all the code. Coverage tests attempt to avoid this weakness by (ideally) ensuring that each code statement, decision point, or decision path is exercised at least once. (Coverage testing also can show how much of your data space has been accessed.)
Also known as white-box tests or glass-box tests, coverage tests are devised with full knowledge of how the software is implemented, that is, with permission to “look inside the box.” White-box tests are designed with the source code handy.
They exploit the programmer’s knowledge of the program’s APIs, internal control structures, and exception handling capabilities. Because white-box tests depend on specific implementation decisions, they can’t be designed until after the code is written.
From an embedded systems point of view, coverage testing is the most important type of testing because the degree to which you can show how much of your code has been exercised is an excellent predictor of the risk of undetected bugs you’ll be facing later. Example white-box tests include:
• Statement coverage : Test cases selected because they execute every Statement in the program at least once.
• Decision or branch coverage : Test cases chosen because they cause every
branch (both the true and false path) to be executed at least once.
• Condition coverage : Test cases chosen to force each condition (term) in a
decision to take on all possible logic values.
Theoretically, a white-box test can exploit or manipulate whatever it needs to conduct its test. Thus, a white-box test might use the JTAG interface to force a particular memory value as part of a test. More practically, white-box testing might analyze the execution path reported by a logic analyzer.
Because white-box tests can be intimately connected to the internals of the code, they can be more expensive to maintain than black-box tests. Whereas black-box tests remain valid as long as the requirements and the I/O relationships remain stable, white-box tests might need to be re-engineered every time the code is changed. Thus, the most cost-effective white-box tests generally are those that exploit knowledge of the implementation without being intimately tied to the coding details.
Tests that only know a little about the internals are sometimes called gray-box tests. Gray-box tests can be very effective when coupled with “error guessing.” If you know, or at least suspect, where the weak points are in the code, you can design tests that stress those weak points.
These tests are gray box because they cover specific portions of the code; they are error guessing because they are chosen based on a guess about what errors are likely.
This testing strategy is useful when you’re integrating new functionality with a stable base of legacy code. Because the code base is already well tested, it makes sense to focus your test efforts in the area where the new code and the old code come together.
To read Part 2, go to: Testing embedded software
Arnold Berger is a Senior Lecturer in the CSS Department of the University of Washington Bothell. He can be reached at ABerger@bothell.washington.edu.