The basics of embedded programming: Part 9 - Software Test and Validation
Program Validation and Testing
By Wayne Wolf
Embedded.com
(10/01/07, 12:05:00 AM EDT)
Complex systems need testing to ensure that they work as they are intended. But bugs can be subtle, particularly in embedded systems, where specialized hardware and real-time responsiveness make programming more challenging.

Fortunately, there are many available techniques for software testing that can help us generate a comprehensive set of tests to ensure that our system works properly. In this last part in this series, we examine the role of validation in the overall design methodology as well as the nuts-and-bolts techniques for creating a good set of tests for a given program.

The first question we must ask ourselves is how much testing is enough. Clearly, we cannot test the program for every possible combination of inputs. Because we cannot implement an infinite number of tests, we naturally ask ourselves what a reasonable standard of thoroughness is.

One of the major contributions of software testing is to provide us with standards of thoroughness that make sense. Following these standards does not guarantee that we will find all bugs. But by breaking the testing problem into subproblems and analyzing each subproblem, we can identify testing methods that provide reasonable amounts of testing while keeping the testing time within reasonable bounds.

The two major types of testing strategies follow:

In this concluding part in this series, we cover both types of tests, which complement each other by exercising programs in very different ways.

Clear-Box Testing
The control/data flow graph extracted from a program's source code is an important tool in developing clear-box tests for the program. To adequately test the program, we must exercise both its control and data operations.

In order to execute and evaluate these tests, we must be able to control variables in the program and observe the results of computations, much as in manufacturing testing. In general, we may need to modify the program to make it more testable. By adding new inputs and outputs, we can usually substantially reduce the effort required to find and execute the test.

Example 5-11 below illustrates the importance of observability and controllability in software testing.

No matter what we are testing, we must accomplish the following three things in a test:

Example 5-11 Controlling and Observing Programs
Let's first consider controllability by examining the following FIR filter with a limiter:

 firout = 0.0; /* initialize filter output */ 
/* compute buff*c in bottom part of circular buffer */
for (j = curr, k = 0; j < N; j++, k++)
firout += buff[j] * c[k];
/* compute buff*c in top part of circular buffer */
for (j = 0; j < curr; j++, k++)
firout += buff[j] * c[k];
/* limit output value */
if (firout 100.0) firout = 100.0;
if (firout < -100.0) firout = -100.0;

The above code computes the output of an FIR filter from a circular buffer of values and then limits the maximum filter output (much as an overloaded speaker will hit a range limit).

If we want to test whether the limiting code works, we must be able to generate two out-of-range values for firout: positive and negative. To do that, we must fill the FIR filter's circular buffer with N values in the proper range. Although there are many sets of values that will work, it will still take time for us to properly set up the filter output for each test.

This code also illustrates an observability problem. If we want to test the FIR filter itself, we look at the value of firout before the limiting code executes. We could use a debugger such as dbx to set breakpoints in the code, but this is an awkward way to perform a large number of tests. If we want to test the FIR code independent of the limiting code, we would have to add a mechanism for observing firout independently.

Being able to perform this process for a large number of tests entails some amount of drudgery, but that drudgery can be alleviated with good program design that simplifies controllability and observability.

The next task is to determine the set of tests to be performed. We need to perform many different types of tests to be confident that we have identified a large fraction of the existing bugs. Even if we thoroughly test the program using one criterion, that criterion ignores other aspects of the program. Over the next few pages we will describe several very different criteria for program testing.

Execution Paths
The most fundamental concept in clear-box testing is the path of execution through a program. Previously, we considered paths for performance analysis; we are now concerned with making sure that a path is covered and determining how to ensure that the path is in fact executed.

We want to test the program by forcing the program to execute along chosen paths. We force the execution of a path by giving it inputs that cause it to take the appropriate branches. Execution of a path exercises both the control and data aspects of the program. The control is exercised as we take branches; both the computations leading up to the branch decision and other computations performed along the path exercise the data aspects.

Is it possible to execute every complete path in an arbitrary program? The answer is no, since the program may contain a while loop that is not guaranteed to terminate. The same is true for any program that operates on a continuous stream of data, since we cannot arbitrarily define the beginning and end of the data stream.

If the program always terminates, then there are indeed a finite number of complete paths that can be enumerated from the path graph. This leads us to the next question: Does it make sense to exercise every path?

The answer to this question is no for most programs, since the number of paths, especially for any program with a loop, is extremely large. However, the choice of an appropriate subset of paths to test requires some thought. Example 5-12 below illustrates the consequences of two different choices of testing strategies.

Example 5-12 Choosing the Paths to Test
Two reasonable choices for a set of paths to test follow:

These conditions are equivalent for structured programming languages without gotos, but are not the same for unstructured code. Most assembly language is unstructured, and state machines may be coded in high-level languages with gotos.

To understand the difference between statement and branch coverage, consider the CDFG below. We can execute every statement at least once by executing the program along two distinct paths.

However, this leaves branch a out of the lower conditional uncovered. To ensure that we have executed along every edge in the CDFG, we must execute a third path through the program. This path does not test any new statements, but it does cause a to be exercised.

How do we choose a set of paths that adequately covers the program's behavior? Intuition tells us that a relatively small number of paths should be able to cover most practical programs. Graph theory helps us get a quantitative handle on the different paths required.

In an undirected graph, we can form any path through the graph from combinations of basis paths. (Unfortunately, this property does not strictly hold for directed graphs such as CDFGs, but this formulation still helps us understand the nature of selecting a set of covering paths through a program.) The term "basis set" comes from linear algebra.

Figure 5-29 below shows how to evaluate the basis set of a graph. The graph is represented as an incidence matrix. Each row and column represents a node; a 1 is entered for each node pair connected by an edge.

We can use standard linear algebra techniques to identify the basis set of the graph. Each vector in the basis set represents a primitive path. We can form new paths by adding the vectors modulo 2. Generally, there is more than one basis set for a graph.

The basis set property provides a metric for test coverage. If we cover all the basis paths, we can consider the control flow adequately covered. Although the basis set measure is not entirely accurate since the directed edges of the CDFG may make some combinations of paths infeasible, it does provide a reasonable and justifiable measure of test coverage.

Figure 5-29. The matrix representation of a graph and its basis set.

There is a simple measure, cyclomatic complexity [McC76], which allows us to measure the control complexity of a program. Cyclomatic complexity is an upper bound on the size of the basis set earlier. If e is the number of edges in the flow graph, n the number of nodes, and p the number of components in the graph, then the cyclomatic complexity is given by

M = e - n +2p. (EQ 5-3)

For a structured program, M can be computed by counting the number of binary decisions in the flow graph and adding 1. If the CDFG has higher-order branch nodes, add b ' 1 for each b-way branch. In the example of Figure 5-30 below, the cyclomatic complexity evaluates to 4. Because there are actually only three distinct paths in the graph, cyclomatic complexity in this case is an overly conservative bound.

Another way of looking at control flow"oriented testing is to analyze the conditions that control the conditional statements. Consider the following if statement:

if ((a == b) || (c =d)) { ... }

Figure 5-30. Cyclomatic complexity.

This complex condition can be exercised in several different ways. If we want to truly exercise the paths through this condition, it is prudent to exercise the conditional's elements in ways related to their own structure, not just the structure of the paths through them. A simple condition testing strategy is known as branch testing [Mye79]. This strategy requires the true and false branches of a conditional and every simple condition in the conditional's expression to be tested at least once. Example 5-13 below illustrates branch testing.

Example 5-13. Condition Testing with the Branch Testing Strategy
Assume that the code below is what we meant to write.

if (a || (b = c)) { printf("OK\n"); }

The code that we mistakenly wrote instead follows:

if (a && (b=c)) {printf("OK\n");}

If we apply branch testing to the code we wrote, one of the tests will use these values: a = 0, b = 3, c = 2 (making a false and b = c true). In this case, the code should print theOK term [0 || (3 = 2) is true] but instead doesn't print [0 && (3 = 2) evaluates to false]. That test picks up the error.

Let's consider another more subtle error that is nonetheless all too common in C. The code we meant to write follows:

if ((x == good_pointer) && (x-field1 == 3)) { printf("got the value\n"); }

Here is the bad code we actually wrote:

if ((x = good_pointer) && (x-field1 == 3)) { printf("got the value\n"); }

The problem here is that we typed = rather than ==, creating an assignment rather than a test. The code x = good_pointer first assigns the value good_pointer to x and then, because assignments are also expressions in C, returns good_pointer as the result of evaluating this expression.

If we apply the principles of branch testing, one of the tests we want to use will contain x != good_pointer and x-field1 == 3. Whether this test catches the error depends on the state of the record pointed to by good_pointer. If it is equal to 3 at the time of the test, the message will be printed erroneously.

Although this test is not guaranteed to uncover the bug, it has a reasonable chance of success. One of the reasons to use many different types of tests is to maximize the chance that supposedly unrelated elements will cooperate to reveal the error in a particular situation.

Another more sophisticated strategy for testing conditionals is known as domain testing [How82], illustrated in Figure 5-31 below. Domain testing concentrates on linear inequalities. In the figure, the inequality the program should use for the test is

j <= i + 1.

We test the inequality with three test points - two on the boundary of the valid region and a third outside the region but between the i values of the other two points. When we make some common mistakes in typing the inequality, these three tests are sufficient to uncover them, as shown in the figure.

A potential problem with path coverage is that the paths chosen to cover the CDFG may not have any important relationship to the program's function. Another testing strategy known as data flow testing makes use of def-use analysis (short for definition-use analysis). It selects paths that have some relationship to the program's function.

Figure 5-31. Domain testing for a pair of variables.

The terms def and use come from compilers, which use def-use analysis for optimization [Aho86]. A variable's value is  defined when an assignment is made to the variable; it is  used when it appears on the right side of an assignment (sometimes called a c-use for computation use) or in a conditional expression (sometimes called p-use for predicate use).

A def-use pair is a definition of a variable's value and a use of that value. Figure 5-32 below shows a code fragment and all the def-use pairs for the first assignment to a. Def-use analysis can be performed on a program using iterative algorithms. Data flow testing chooses tests that exercise chosen def-use pairs.

The test first causes a certain value to be assigned at the definition and then observes the result at the use point to be sure that the desired value arrived there. Frankl and Weyuker [Fra88] have defined criteria for choosing which def-use pairs to exercise to satisfy a well-behaved adequacy criterion.

Figure 5-32. Definitions and uses of variables.

Testing Loops
We can write some specialized tests for loops. Since loops are common and often perform important steps in the program, it is worth developing loopcentric testing methods. If the number of iterations is fixed, then testing is relatively simple. However, many loops have bounds that are executed at run time. Consider first the case of a single loop as follows:
for (i = 0; i < terminate(); i++)
proc(i,array);

It would be too expensive to evaluate the above loop for all possible termination conditions. However, there are several important cases that we should try at a minimum. These cases are summarized below.

  1. Skipping the loop entirely [if possible, such as when terminate() returns 0 on its first call].
  2. One loop iteration.
  3. Two loop iterations.
  4. If there is an upper bound n on the number of loop iterations (which may come from the maximum size of an array), a value that is significantly below that maximum number of iterations.
  5. Tests near the upper bound on the number of loop iterations, that is, n ' 1, n, and n + 1.

We can also have nested loops such as the following:

for (i = 0; i < terminate1(); i++)
for (j = 0; j < terminate2(); j++)
for (k = 0; k < terminate3(); k++)
proc(i,j,k,array);

There are many possible strategies for testing nested loops. One thing to keep in mind is which loops have fixed versus variable numbers of iterations. Beizer [Bei90] suggests an inside-out strategy for testing loops with multiple variable iteration bounds. First, concentrate on testing the innermost loop as above - the outer loops should be controlled to their minimum numbers of iterations. After the inner loop has been thoroughly tested, the next outer loop can be tested more thoroughly, with the inner loop executing a typical number of iterations. This strategy can be repeated until the entire loop nest has been tested. Clearly, nested loops can require a large number of tests. It may be worthwhile to insert testing code to allow greater control over the loop nest for testing.

Black-Box Testing
Black-box tests are generated without knowledge of the code being tested. When used alone, black-box tests have a low probability of finding all the bugs in a program. But when used in conjunction with clear-box tests they help provide a well-rounded test set, since black-box tests are likely to uncover errors that are unlikely to be found by tests extracted from the code structure.

Blackbox tests can really work. For instance, when asked to test an instrument whose front panel was run by a microcontroller, one acquaintance of the author used his hand to depress all the buttons simultaneously. The front panel immediately locked up. This situation could occur in practice if the instrument were placed face-down on a table, but discovery of this bug would be very unlikely via clear-box tests.

One important technique is to take tests directly from the specification for the code under design. The specification should state which outputs are expected for certain inputs. Tests should be created that provide specified outputs and evaluate whether the results also satisfy the inputs.

We can't test every possible input combination, but some rules of thumb help us select reasonable sets of inputs. When an input can range across a set of values, it is a very good idea to test at the ends of the range.

For example, if an input must be between 1 and 10, 0, 1, 10, and 11 are all important values to test. We should be sure to consider tests both within and outside the range, such as, testing values within the range and outside the range. We may want to consider tests well outside the valid range as well as boundary-condition tests.

Random tests form one category of black-box test. Random values are generated with a given distribution. The expected values are computed independently of the system, and then the test inputs are applied. A large number of tests must be applied for the results to be statistically significant, but the tests are easy to generate.

Another scenario is to test certain types of data values. For example, integer-valued inputs can be generated at interesting values such as 0, 1, and values near the maximum end of the data range. Illegal values can be tested as well.

Regression tests form an extremely important category of tests. When tests are created during earlier stages in the system design or for previous versions of the system, those tests should be saved to apply to the later versions of the system. Clearly, unless the system specification changed, the new system should be able to pass old tests.

In some cases old bugs can creep back into systems, such as when an old version of a software module is inadvertently installed. In other cases regression tests simply exercise the code in different ways than would be done for the current version of the code and therefore possibly exercise different bugs.

Some embedded systems, particularly digital signal processing systems, lend themselves to numerical analysis. Signal processing algorithms are frequently implemented with limited-range arithmetic to save hardware costs. Aggressive data sets can be generated to stress the numerical accuracy of the system. These tests can often be generated from the original formulas without reference to the source

Evaluating Function Tests
How much testing is enough? Horgan and Mathur [Hor96] evaluated the coverage of two well-known programs, TeX and awk. They used functional tests for these programs that had been developed over several years of extensive testing. Upon applying those functional tests to the programs, they obtained the code coverage statistics shown in Figure 5-33 below.

The columns refer to various types of test coverage: block refers to basic blocks, decision to conditionals, puse to a use of a variable in a predicate (decision), and c-use to variable use in a nonpredicate computation. These results are at least suggestive that functional testing does not fully exercise the code and that techniques that explicitly generate tests for various pieces of code are necessary to obtain adequate levels of code coverage.

Figure 5-33. Code coverage of functional tests for TeX and awk (after Horgan and Mathur [Hor96])

Methodological techniques are important for understanding the quality of your tests. For example, if you keep track of the number of bugs tested each day, the data you collect over time should show you some trends on the number of errors per page of code to expect on the average, how many bugs are caught by certain kinds of tests, and so on.

One interesting method for analyzing the coverage of your tests is error injection. First, take your existing code and add bugs to it, keeping track of where the bugs were added. Then run your existing tests on the modified program. By counting the number of added bugs your tests found, you can get an idea of how effective the tests are in uncovering the bugs you haven't yet found.

This method assumes that you can deliberately inject bugs that are of similar varieties to those created naturally by programming errors. If the bugs are too easy or too difficult to find or simply require different types of tests, then bug injection's results will not be relevant. Of course, it is essential that you finally use the correct code, not the code with added bugs.

Performance Testing
Because embedded systems often have real-time deadlines, we must concern ourselves with testing for performance, not just functionality. Performance testing determines whether the required result was generated within a certain amount of time. In many cases, we are interested in the worst-case execution time, although in some cases we may want to verify the best-case or averagecase execution time.

Performance analysis is very important here. Performance analysis can tell us what path causes the worst-case (or other case) execution time. From there we can determine the inputs required and the expected outputs. Real-time code with deadlines must always terminate—we know that the code must finish for any possible set of inputs. Functional testing can help us determine whether the program actually terminates.

How do we measure program performance? Functional testing can be performed on any computer in most cases so long as we can compile the code on that platform. Performance testing is not easy. We may be able to use a simulator to measure performance. However, CPU simulators vary widely in the amount of timing information they provide.

Performance can also be measured directly on the hardware. We can use a timer in the system to count execution time. The program resets the timer at the start and checks the timer's value at the end of execution. We can also use logic analyzers as long as the start and end points of the program are visible on the bus.

To obtain truly accurate performance measurements, we must run the program in the environment in which it will operate. For example, the CPU used must have the same size cache as the target platform will provide. The program should also be run along with other programs that operate concurrently. When several programs (or programs and interrupt drivers) operate concurrently, they can interfere with each other in the cache causing severe performance problems.

Conclusion
The program is a very fundamental unit of embedded systems design and it usually contains tightly interacting code. Because we care about more than just functionality, we need to understand how programs are created. Because today's compilers do not take directions such as "compile this to run in less than 1 microsecon," we have to be able to optimize programs ourselves for speed, power and space.

Our earlier understanding of computer architecture is critical to our ability to perform these optimizations. We also need to test programs to make sure they do what we want. Some of our testing technicques can also be useful in exercising the programs for performance optimization.

To read Part 1, go to  "Program design and analysis."
To read Part 2 , go to "Models of program, assemblers and linkers."
To read Part 3, go to  "Basic Compilation Techniques"
To read Part 4, go to  "The creation of procedures"
To read Part 5, go to  "Register allocation and scheduling"
To read Part 6, go to  "Analysis and optimization of execution time"
To read Part 7, go to  "Trace-Driven Performance Analysis"
To read Part 8, go to  "Analysis and optimization of energy and power."

Used with the permission of the publisher, Newnes/Elsevier, this series of nine articles is based on copyrighted material from "Computers as Components: Principles of Embedded Computer System Design" by Wayne Wolf. The book can be purchased on line.

Wayne Wolf  is currently the Georgia Research Alliance Eminent Scholar holding the Rhesa "Ray" S. Farmer, Jr., Distinguished Chair in Embedded Computer Systems at Georgia Tech's School of Electrical and Computer Engineering (ECE). Previously a professor of electrical engineering at Princeton University, he worked at AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

References:
[Aho86]
Alfred VAho, et. al. "Compilers: Principles, Techniques and Tools. Addison Wesley,1986.
[Hor96] 
Joseph  R. Horgan  and Aditya Mathur, "Software testing and reliability" Chapter 13 in Hand book of Software Reliability Engineering," IEEE Computer Society Press/ McGraw Hill 1996.
[How82] W.E. Howden, "Weak mutation testing and the completeness of test cases," IEEE Transactions on Software Engineering.
[McC76] T.J. McCabe, "A complexity measure," IEEE Transactions on Software Engineering.