Verifying embedded software functionality: fault localization, metrics and directed testing -

Verifying embedded software functionality: fault localization, metrics and directed testing


Editor’s Note: In the third in a four part series Abhik Roychoudhury, author of Embedded Systems and software validation, discusses the pros and cons of metric base fault localization and directed testing for assessing software functionality.

So far, in Part 1 andPart 2 in this series, we have presented the dynamic slicing method, which is fully formal and requires examination of the control/data dependencies in an execution trace.

But, the difficulties in using it include (a) time and space overheads for storing/analyzing program traces and (b) potentially large slice sizes. In the preceding, we examined methods to deal with the second problem – comprehension of large slices.

However, we still have to grapple with the time and space overheads of dynamic slicing. As observed earlier, state-of-the-art dynamic slicing tools employ various tricks such as online compaction of the execution trace and program dependence analysis on the compact trace (without decompressing it).

Nevertheless, the time and space overheads for large real-life programs is still substantial, and the quest for lightweight methods remains. In this part in the series, we will discuss a class of such lightweight methods. In the following we use the terms execution trace and execution run interchangeably. Indeed, the existing literature on software debugging also uses these two terms interchangeably. Before proceeding any further, let us first give an illustrative example.

Illutrating the problem with Siemens TCAS
Our example is a fragment of the TCAS program from the Siemens benchmark suite, which has been extensively used in the software engineering community for research in testing/debugging. The TCAS program is an embedded software for altitude control.

In Figure 5.9 below , we show a fragment of the program. Note that Climb and Up are input variables of the program. There is a bug in the following program fragment, namely, lines 2 and 4 are reversed in order. In other words, line 2 should be separation = Up + 100 and line 4 should be separation = Up.

Now, consider an execution of the some program fragment with the inputs Climb = 1and Up = 100. The execution will lead to “Downward” being printed. Clearly, this is unexpected, because the developer would expect “Upward” to be printed for these inputs. Thus, the trace for the inputs Climb = 1, Up = 100 is a failing run that needs to be debugged.

We now have an example of a failing run, but what is a successful run?Asuccessful run is simply one where the program output is as expected. So, if the programmer expects the output to be “Upward,” the program should print “Upward,” and if the programmer expects the output to be “Downward,” the program should print “Downward.”

Consider the program execution with the inputs Climb = 0 and Up = 0. The output in this case is “Downward,” and this matches the developer’s expectations. Hence we deem this as a successful run. Usually, the process of determining whether a given run is failed or successful cannot be fully automated.

This involves matching the program output with the developer’s expectation, so the task of articulating the developer’s expectation remains manual. We have now explained what we mean by failing run and successful run. Our task is to debug a given “failed” run—to explain why it failed, that is, why the program output was not as expected.

We are trying to do so by comparing it with a successful run (where the program output was as expected) in order to gain insights about what went wrong in the failed run. The computed “difference” between the failed run and the chosen successful run is reported to the programmer as the bug report. The key questions now are:

1 – Given a failed run, how do we choose a successful run?
2 – Given a failed and a successful run, how do we compute their difference?

Both the questions have their answers in a evaluation metric for execution runs. A common (and very rough) metric is the set of statements executed in an execution run. If we have a successful run and a failed run, we can compute their difference by computing the difference of the set of statements executed.

The question now is how to get a successful run? In other words, how do we choose a successful run corresponding to a given failed run α f ? We will choose a successful run α s such that the set of statements executed in α s is “close” to the set of statements executed in α f .

In fact, given a program P and failed execution run αf in P, we can do the following:

1 – Typically the program P will be endowed with a test suite (set of test cases) based on some coverage criteria (covering all statements or all branches in the program).We construct the execution runs for the test cases from the test suite. Let this set of execution runs be Runsall (P).

2 – From among the execution runs in Runsall (P), we chose those that are successful, that is, runs where the program output meets the programmer’s expectations.

Let this set be Runsall clearly Succ Runsall (P)sube; Runsall (P). 3 – We choose an execution run αs ∈ Suc Runsall (P) such that the quantity | stmt αf – stmtαs | is minimized. Here stmt(_) is the set of statements in an execution run α and | S| is the cardinality or the number of elements in a set S. Note that for two sets S1 and S2 , the quantity S1 – S2 denotes the set difference, that is, elements appearing in S1 but not in S2 .

Thus, we choose a successful execution run αs , such that there are only a few statements appearing in the failed run α’f , but not in α’s . The idea here is that if a statement appears only in the failed run but not in the successful run, it is a likely error cause.

In our running example, the inputs Climb = 1 and Up = 100 lead to an unexpected output. The set of statements executed in this failed execution run is {1, 2, 5, 7, 8, 9, 13,14}. Furthermore, the inputs Climb = 0 and Up = 0 lead to an expected output. The set of statements executed in this successful execution run is {1, 3, 4, 5, 7, 8, 9, 13,14}. So, the bug report is the difference between these two sets:

{1, 2, 5, 7, 8, 9, 13,14}_{1, 3, 4, 5, 7, 8, 9, 13,14} = {2}

Once this line is pointed out to the developer, he or she should be able to locate the error in line 2. Note here that the choice of successful execution run is crucial. Consider an execution of the program in Figure 5.9 with the inputs Climb = 1 and Up = 200. When executed with these inputs, the program will print “Upward,” which is what the developer expects.

So, the execution run for these inputs is deemed as a successful run. What would have happened if we chose this execution run to compare with our failed execution run (the one resulting from the inputs Climb = 1 and Up = 100)?

The set of statements executed for the inputs Climb = 1 and Up = 200 is {1, 2, 5, 6, 9, 10,11}. So, in this case the bug report (the difference in executed statements between the failed run and the chosen successful run) would have been

{1, 2, 5, 7, 8, 9, 13,14}_{1, 2, 5, 6, 9, 10,11} = {7, 8, 13,14}

The bug report in this case consists of more statements. Moreover, the statements do not pinpoint to the actual error cause in the program; they are only manifestations of the error cause.

This simple example should demonstrate to the reader that the choice of successful run is crucial for the usefulness of the bug report generated by fault localization methods. Figure 5.10 below summarizes the class of trace-based debugging methods we are presenting here – commonly called fault localization.

Figure 5-10. Fault localization methods.

The aim of these methods is to find the error cause in a failed execution run (a run with unexpected behavior) in a given program. This is done by comparing the failed execution run with a successful execution run. The successful execution run is chosen from a test pool, possibly constructed by coverage-based testing. The chosen successful run is then compared with the failed execution run, and the difference is reported back to the developer as a bug report. Of course, there are many metrics by which to compare the execution traces, such as:

Sets of statements executed,
Sequence of branch statements executed,
Sequence of statements executed,
Sequence of variable values taken (for selected variables), and so on.

Trace Comparison Metric Based on Statement Sets
The simplest trace comparison metric (which is very efficient to compute) is simply: Set of statements executed in failed run – Set of statements executed in successful run. Given a failed run, we choose a successful run that minimizes this quantity. However, this trace comparison metric does not distinguish between runs that execute exactly the same statements but in a different order. Consider the schematic program fragment in the following, consisting of an if-then-else within a loop:

for (…) {
if (…) S1 else S2;

Consider two runs of the program S1,S2,S1,S2 and S2,S1,S2,S1. These two runs execute the conditional branch in the if-then-else statement differently in every iteration of the loop. Yet a trace comparison metric based on sets of statements executed cannot distinguish between these two runs. Given a failed run, say, S1,S2,S1,S2, suppose we choose S2,S1,S2,S1 as the successful run with which we compare our failed run. The result is a null bug report, which is not only useless for debugging, but also misleading to the developer.

Trace Comparison Metric Based on Branch Alignments
For the reasons mentioned in the preceding, the software debugging community has studied trace comparison metrics that compare traces of a program by finding out which branches are evaluated differently.

Such a difference metric measures the difference between two execution runs π and π’ of a program, by comparing behaviors of “corresponding” branch statement instances from π and π ‘. The branch statement instances with differing outcomes in π, &pi: are captured in diff (π π’) – the difference between execution run π and execution run π’. In order to find out “corresponding” branch instances, a notion of alignment is defined, to relate statement instances of two execution runs.

Typically, such a branch alignment can be based on dynamic control dependence. Here we illustrate the distance metric with an example. Consider the program fragment in Figure 5.11 below, taken from the Siemens benchmark suite. This piece of code changes all substrings s1 in string lin matching a pattern to another substring s2 .

Figure 5.11. An example program fragment from the Siemens benchmark suite. We use the example to illustrate trace comparison metrics.

Here variable i represents the index to the first unprocessed character in string lin, variable m represents the index to the end of a matched substring s1 in string lin, and variable lastm records variable m in the last loop iteration. The bug in the code lies in the fact that the branch condition in line 3 should be if (m >= 0) && (lastm != m). At the i-th iteration, if variable m is not changed at line 2, line 3 is wrongly evaluated to true, and substring s2 is wrongly returned as output, deemed by programmer as an observable “error.”

In Figure 5.12 below, we show some traces from the program in Figure 5.11. The difference between execution runs π and π ‘ is: diff (π π’) = 3 3 ,714 , as indicated in Figure 5.12 below.

This is because branch instances 33 ,714 are aligned in runs π and π’ and their outcomes are different in π π ‘ If the branches at lines 33 ,714 are evaluated differently, we get π’ from π. Similarly, the difference between execution runs π and π’ is diff (π’ π”) = 76 ,714 .

Clickon image to enlarge.

Figure 5.12. Example to illustrate alignments and difference metrics.

Why do we capture branch event occurrences of π that evaluate differently in π ‘ in the difference diff (π π’)? Recall that we want to choose a successful run for purposes of fault localization. If π is the failing run and π ‘ is a successful run, then diff (π’ π”) tells us which branches in the failing runπ need to be evaluated differently to produce the successful run π’.

Clearly, if we have a choice of successful runs, we would like to make minimal changes to the failing run to produce a successful run. Thus, given a failing run π and two successful runs π’ π” we choose π ‘ over π “ if diff (π, π’) < diff (π, π”). This requires us to compare differences. How we do so is elaborated in the following. Given a failing run π and two successful runs π’ π” we can say that diff (π, π’) < diff (π, π”) based on a combination of the following criteria:

1 – Fewer branches of π need to be evaluated differently to get π” as compared to the number of branches of π that need to be evaluated differently to get π”.

2 – The branches of π that need to be evaluated differently to get π ‘ appear closer to the end of π (where the error is observed), as compared to the branches of π that need to be evaluated differently to get π”.

To illustrate our comparison of differences, consider the example in Figure 5.12. Recall that diff (π π’) = 33 , 712 ,, and diff (ππ') = 76 ,714 , as illustrated in the last two columns of Figure 5.12. Comparing 33 ,,714 , with 76 ,714 ,, we see that 76 , 714 < 33 , 714 ,,because statement instance 76 , occurs after statement instance 33 , in execution run π”.

Summary of Trace Comparison Methods
In summary, which trace comparison metric is chosen and how the traces are compared is a matter of choice. However, based on the metric, we can choose the successful run from a pool of successful runs (say, the test suite of a program). In particular, suppose we have an execution trace αf that is failed, meaning it shows an unexpected behavior.We want to compare it with another execution trace αs such that:1 – α s ,, does not show any unexpected behavior, that is, the program outputs in run α s ,, are as per the developer’s expectations, and2 – α s , is the closest to α f in terms of the comparison metric being used.Thus, based on the trace comparison metric being used, we choose the successful run against which we compare a given failed execution run, and report the difference between the two runs as a bug report.Directed Testing Methods
So far, we have studied different debugging methods which either (a) analyze the dependencies in a failed trace or (b) compare a failed trace with a “chosen” successful trace. These methods can be combined with testing techniques in a postmortem fashion. In other words, given a program P, we generate a test suite (set of test inputs) for P, and the traces for the test inputs in the test suite are subjected to analysis.On the other hand, one could envision testing methods that are more directed to exposing errors. Conventional software testing methods are often driven with the sole goal of coverage. What do we mean by coverage in the context of test generation? Let us take the statement coverage criteria.We say that a set of test inputs S achieves statement coverage if each statement in the program appears in the trace for at least one test input in S. Similarly, one can define other coverage criteria such as branch edge coverage.Standard test coverage criteria such as statement coverage provide very weak guarantees about the software’s reliability. Statement coverage merely says that each statement in the program is executed for some test input in the test suite.However, this does not mean that executing the tests in the test suite will expose the bug in a buggy statement. If a statement is buggy, its execution does not guarantee the manifestation of the bug in the statement.Of course, if a statement in the program is buggy and it is executed for some input, there is more chance of the bug in the statement being manifested in the form of an unexpected output. Ideally, what we would like to do via systematic program testing is to expose the different paths in a program.However, enumerating all paths in a program and finding inputs that exercise these paths is not easy. The number of program paths is exponential on the number of branch instances, and the number of branch instances (individual executions of a branch statement) itself can be very large.Exhaustive testing by trying out all inputs is simply not an option with real-life embedded software, because there are too many inputs to test. Often, there are even infinitely many inputs—consider an image compression program; we cannot test it with all possible input images.So, we are stuck between the frying pan and the fire!

We cannot employ brute force methods such as exhaustive testing of all inputs, because these methods do not scale up for real-world programs. We also cannot hope to cover all program paths by lightweight methods such as random testing; successive experimental studies have shown that random testing leads to poor coverage of program paths.

Nor can we expect to cover all paths (or even a large fraction of them) simply by covering some other code artifact such as statement coverage. How do we proceed?

One answer to this problem seems to be systematic path exploration via directed testing. In this approach, we start testing our program P with a random input i. Suppose executing the program with input i goes along program path π . During the execution of program P with input i, we collect the condition under which path π is executed. This is the path condition of π and captures the set of inputs (one of which is, of course, i) whose executions go along path π. We then slightly modify the path condition π to produce another path conditionπ’, which is solved to produce another test input.

This process goes on, every time modifying the current path condition, thereby getting a new path and then finding a test input that exercises this new path. This is essentially a method of systematic exploration of program paths by finding suitable inputs to exercise the paths. The hope is that, by exploring more program paths, we have a better chance of encountering errors in the program.

Figure 5-13. Example program fragment from Siemens benchmark suite.

We now illustrate this method with an example. Consider the program fragment in Figure 5.9, reproduced here as Figure 5.13 for convenience. The inputs to this program fragment are Climb (boolean variable) and Up (integer variable). Suppose we start with a random input Climb == 0 and Up == 457. This produces the following path:

1. if (Climb)

3. else

4. separation = Up + 100;
5. if (separation > 150)

6. upward = 1;

9. if (upward > 0)

10. …
11. printf(“Upward”);

The path condition (i.e., the conditions on the input on which the foregoing path is executed) is

Climb == 0^Up=100 > 150^upward > 0

The preceding is a conjunction of three primitive constraints. To systematically explore other paths, we can negate the last primitive constraint to get

Climb == 0^Up =100 > 150^ upward ≥ 0

This path turns out to be infeasible—no program input exercising this path. So, we negate the next primitive constraint and get

Climb = = 0^Up +100 ≥150

Solving this constraint we can get a sample input Climb = = 0, Up = = 0, allowing us to explore a new path given as follows.

1. if (Climb)

3. else
4. separation = Up + 100;
5. if (separation > 150)

7. else
8. upward = 1;
9. if (upward > 0)

12. else
13. …
14. printf(“Downward”);

Continuing further in this fashion, we can explore the different paths in the program and get concrete inputs that exercise these program paths. This method is called directed testing, because we modify the path constraint of the current path being explored to explore a new path.

Thus, the method is directed toward exploring more paths in the program. It will achieve significantly more coverage than random testing, where several randomly generated inputs may exercise the same program path.

One can employ such a directed testing approach for exposing more program behaviors in a systematic way, thereby hoping to encounter corner cases leading to exceptions/crashes during the testing phase.

Before concluding our discussion on directed testing, we should mention that several subtle issues in a directed testing algorithm have not been discussed here. In our example, the first path we obtained in the preceding had a path constraint

Climb == 0^Up -100 > 150^upward > 0

Here Climb and Up are input variables. However, upward is a program variable whose value is related to the value of the input variable Up. In fact, the (boolean) value of upward is directly correlated to the condition Up + 100 > 150. Directed testing methods/tools will usually try to detect such correlations. As a result, they can avoid exploring infeasible paths, such as the path with constraint

Climb == 0^Up >100 > 150^upward ≥ 0

in our example. Other issues in directed testing methods include search strategies for exploring paths. The method we outlined here via an example essentially employs depth-first search to explore paths.

New paths are obtained from the current path being explored by backtracking (where we negate the condition corresponding to the last branch encountered). Other search strategies such as breadth-first or best-first have also been explored in various tools.

To read Part 1, go to “Why functional testing is needed?
To read Part 2, go to “Dynamic slicing”
Next in Part 4 : Formal verification and testing

Used with permission from Morgan Kaufmann, a division of Elsevier.Copyright 2009, from “Embedded Systems and Software Validation,” by Abhik Roychoudbury. For more information about this title and other similar books, visit

Abhik Roychoudhury is associate professor at the National Univdrsity of Singapore. He received his Ph.D. in computer science from the State University of New York at Stony Brook. His research interests are system modeling and validation with a specific focus on embedded software and systems.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.