A guide to systematic program debugging: Part 2 - Embedded.com

A guide to systematic program debugging: Part 2

Editor’s Note:  In this two part series excerpted from Why programs fail: A guide to systematic debugging , author Andreas Zeller defines the essential concepts for effective software debugging. Part 2: Following the TRAFFIC rules for debugging.

Let’s put our knowledge about software program states and dependences into practice, following the TRAFFIC steps outlined in Part 1 in this series.

Track the Problem. The first step in debugging is to track the problem—that is, to file a problem report such that the defect will not go by unnoticed. In our case,we have already observed the failure symptom: the output of sample, when invoked with arguments 11 and 14, contains a zero.

   $ ./sample 11 14
   Output: 0 11
   $ _

An actual problem report would include this invocation as an instruction on how to reproduce the problem.

Reproduce the Failure. In case of the sample program, reproducing the failure is easy. All you need do is reinvoke sample, as shown previously. In other cases, though, reproducing may require control over all possible input sources.

Automate and Simplify the Test Case. If sample were a more complex program, we would have to think about how to automate the failure (in that we want to reproduce the failure automatically) and how to simplify its input such that we obtain a minimal test case. In the case of sample, though, this is not necessary (for more complex programs,

Find Possible Infection Origins. Where does the zero in the output come from? We must find possible infection origins.To find possible origins,we need the actual C source code of sample, shown in Example 1.1 .

1    /* sample.c — Sample C program to be debugged */
3    #include
4    #include
6    static void shell_sort(int a[], int size)
7    {
8       int i, j;
9       int h = 1;
11     do {
12        h = h * 3 + 1;
13     } while (h <= size);
14     do {
15          h /= 3;
16          for (i = h; i < size; i++)
17          {
18              int v = a[i];
19              for (j = i; j >= h && a[j — h] > v; j —= h)
20                 a[j] = a[j — h];
21              if (i != j)
22                 a[j] = v;
23           }
24       } while (h != 1);
25    }
27    int main(int argc, char *argv[])
28    {
29       int *a;
30       int i;
32       a = (int *)malloc((argc — 1) * sizeof(int));
33       for (i = 0; i < argc --- 1; i++)
34          a[i] = atoi(argv[i + 1]);
36       shell_sort(a, argc);
38       printf(“Output: “);
39       for (i = 0; i < argc --- 1; i++)
40          printf(“%d “, a[i]);
41       printf(“N”);
43       free(a);
45       return 0;
46    }

EXAMPLE 1.1: The sample program sorts given numbers—that is, mostly

We quickly see that the Program consists of two functions: shell_sort() (which implements the shell sort algorithm) and main,which realizes a simple test driver around shell_sort() . The main function does the following:

  • Allocates an array a[] (line 32).
  • Copies the command-line arguments into a[] (lines 33–34).
  • Sorts a[] by invoking shell_sort() (line 36).
  • Prints the content of a[] (lines 38–41).

By matching the output to the appropriate code, we find that the 0 printed by sample is the value of the variable a[0] , the first element of the array a[] . This is the infection we observe: at line 39 in sample.c, variable a[0] is obviously zero.

Where does the zero in a[0] come from? Working our way backward from line 40, we find in line 36 the call shell_sort(a, argc) , where the array a[] is passed by reference. This function might well be the point at which a[0] was assigned the infected value.

Unfortunately, shell_sort() in lines 6–25 is quite obscure. We cannot trace back the value of a[0] to a specific origin simply by deduction from the program code. Instead,we have to observe what actually happens in the failing run.

In principle, we can observe anything about the sample run, as sketched in Figure 1.6 . We can even “execute” it on paper. However, this approach does not scale. We must focus on specific parts of the state or on specific moments in time.

Figure 1.6: Observing the sample run. Using observation tools, we can observe the program state as it progresses through time.

Relying on our earlier deduction on the origins of a[0],we focus on the execution of shell_sort() . We can easily find out that shell_sort() does not access any nonlocal variables. Whatever comes out of shell_sort() is determined by its input. If we observe the arguments at the invocation of shell_sort() , two things can happen:

  1. The arguments at the invocation of shell_sort(_) are sane (i.e., are just as intended). In this case, the infection must take place during the execution of shell_sort() , as sketched in Figure 1.7 .
  2. The arguments are already infected. In this case,the infection must have taken place before the execution of shell_sort() .

Figure 1.7: Observing a transition from sane to infected. If we know that an earlier state is sane and a later state is infected we can narrow down our search to isolate the transition between these two states.

To find out how shell_sort() was actually invoked, we need a means of observing the state during execution. Here we will use the simplest of all observation techniques: We insert output statements in the code that log specific variables and their values when executed.

For instance, we could insert the following code in line 10 to have the values of the parameters a[] and size logged on the standard error channel whenever shell_sort() is invoked.

   fprintf(stderr, “At shell_sort”);
   for (i = 0; i < size; i++)
      fprintf(stderr, “a[%d] = %dn”, i, a[i]);
   fprintf(stderr, “size = %dn”, size);

Focus on the Most Likely Origins. After inserting the code and restarting sample with the arguments 11 and 14, you will find that at shell_sort() the values of the parameters are as follows.

   a[0] = 11
   a[1] = 14
   a[2] = 0
   size = 3

We see that shell_sort() is invoked with three elements—that is, the array a[] to be sorted is [11, 14, 0]. This state is infected—that is, a[] should contain only two elements. As discussed previously, an infected state is likely to cause failures, and this particular state may well be the cause of our failure.

Our hypothesis is that shell_sort() properly sorts the three elements of a[] in place to [0, 11, 14]. Later on, though, only the first two elements of a[] will be printed, resulting in the failure output.Isolate the Origin of the Infection. According to our earlier reasoning, the infection must have occurred before the invocation of shell_sort() . Obviously,the parameter size is wrong.We can trace back its origin to the point at which shell_sort() is invoked: In line 36,we find the invocation shell_sort(a, argc) ; and find that the size parameter gets its value from the argc variable. However, argc is not the number of elements in a[] .

It is the number of arguments to the sample program, including the name sample itself (argc is always one more than the number of elements in a ). Thus, the following is our speculation about what is happening in our program:

1. The array a[] is allocated and initialized with the correct number of elements (2).
2. shell_sort() is invoked such that the size parameter is 3 instead of 2 (the state is infected).
3. size being 3 causes shell_sort() to access a[] beyond the allocated space (namely, at a[2] ).
4. The uninitialized memory at a[2] happens to be 0.
5. During the sort, a[2] is eventually swapped with a[0] , thus setting a[0] to zero (the infection has spread to a[0] ).
6. Thus, the zero value of a[0] is printed, causing the failure.

Youmay wonder why sample actually worked when being invoked with thearguments 9 7 8. The defect was the same, and it caused the sameinfection. However, as a[3] in this case turned out to be larger than 9 it did not get swapped with another array element. At the return of shell_sort() the infection was gone, and thus the defect never showed up as a failure.

Correct the Defect. Sofar,we are still speculating about the failure cause.To deliver thefinal proof,we have to correct the defect. If the failure no longeroccurs,we know that the defect caused the failure. In addition toprohibiting the failure in question we want to prohibit as many failuresas possible. In our case,we achieve this by replacing line 36, shell_sort(a, argc) ; with the correct invocation shell_sort(a, argc – 1) ;

Repeating the test with the fixed program,as follows,shows that the original failure no longer occurs.

   $ ./sample 11 14
   Output: 11 14
   $ _

This resolves the sample problem.

Automated debugging techniques
Essentially,wehave solved the sample problem manually—that is,without using anyspecific tools. In principle, all debugging problems can be solvedmanually—by deduction from the source code and observation of what isgoing on in a program. (Purists might even argue that deduction alonesuffices to prove a program correct, removing the need to fix defects.)

Inpractice, though, it is unwise to rely on manual debugging alone, asthe computer can relieve you of most boring and tedious tasks. Inparticular, the sample program discussed earlier can be debugged almostautomatically. Figure 1.8 depicts the automated debugging techniques available:

Figure1.6 Some automated debugging techniques: (a) program slice, (b)observing state, (c) watching state, (d) asserting an invariant, (e)anomalies, and (f) cause–effect chain.

Simplified input with delta debugging. This is a technique that automatically narrows down the differencebetween a passing and a failing run. Applied to program input, delta debugging returns a simplified input wherein each part contributes to thefailure. Applied to the failing sample run, delta debugging determinesthat each of the arguments 11 and 14 is relevant. The failure no longeroccurs if sample is being called with one argument only.

Programslice: Deducing from the (abstract) program code what can and cannothappen in the (concrete) program run. The most important technique isslicing—separating the part of a program or program run relevant to thefailure. In Figure 1.8(a) , we can see that only a fraction of thestate actually could have caused the failure.Applied to sample, aprogram slice could determine that a[0] got the zero value because of the values of a[2] and size, which is already a good hint.

Figure1.8: Some automated debugging techniques: (a) program slice, (b)observing state, (c) watching state, (d) asserting an invariant, (e)anomalies, and (f) cause–effect chain.

Observing entire states using debuggers. A debugger can make a program stop under specific conditions, which allows a programmer to observe the entire state (see Figure 1.8b ).This allows us to tell the sane program state from the infected state.Using a debugger on sample,we would be able to observe the values of a[] and size at any moment in time without changing or recompiling the program.

Watching state parts. Another important feature of debuggers is that they allow us to watchsmall parts of the state to determine if they change during execution.As sketched in Figure 1.8(c) , this allows us to identify theprecise moment at which a variable becomes infected. Using a debugger onsample,we would be able to watch the value of a[0] to catch the precise statement that assigns the zero from a[2] .

Assertions. When observing a program state, the programmer must still compare
the observed values with the intended values—a process that takes time and is
error prone. Assertions are are used to delegate this comparison process to the computer.

Theprogrammer specifies the expected values and has the computer checkthem at runtime—especially at the beginning and ending of functions(i.e., pre- and post conditions). Such assertions typically includeinvariants over data structures that hold during the entire run.

If all assertions pass, this means that the state is just as expected.When used to check invariants, as sketched in Figure 1.8(d) , assertions can mark large parts of the state as “sane,” allowing the programmer to focus on the other parts.

Onespecific instance of assertions is memory assertion, checking whethermemory is accessed in a legal way.Applied to sample, tools exist thatcan easily identify that a[2] is accessed without being allocated or initialized.

Looking for anomalies. In general,we can assume that a program works well most of the time. Ifa programfails nonetheless,we can use our knowledge about the passingruns and focus on the differences between the passing runs and thefailing run. Such differences point out anomalies, as sketched in Figure 1.8(e) .

Detecting anomalies. This requires techniques designed to compare program runs. It alsorequires techniques for creating abstractions over multiple runs.Applied to sample, we can, for instance, compare the coverage of the tworuns sample 11 (passing) and sample 11 14 (failing).

It turns out that the statements where a[j] is assigned a value are executed only in the failing run, but not inthe passing run. Thus, if we are looking for a zero value in a[0] these two lines might be a good starting point.

Cause–effect chains. By applying delta debugging to program states, it is possible toidentify each state which particular variable(s) caused the failure.This results in a cause–effect chain, as sketched in Figure 1.8(f).Although causes are not necessarily errors, they help to narrow down therelevant elements of a failure.

All ofthese techniques can be combined to locate the defect systematically—andwith a substantial degree of automation. Of all debugging activities,locating the defect (the find–focus–isolate loop in TRAFFIC) is by farthe most time consuming.

Correcting a defect is usually simple,unless it involves a major redesign (in which case we call the defect aflaw). Not every defect results in an infection, and not every infectionresults in a failure. Yet, every failure can be traced back to someinfection, which again can be traced back to a defect.

Part 1:Defining the TRAFFIC rules for debugging.


1. Humphrey, W.S. (1999), “Bugs or defects?” Technical Report, Volume 2, Issue 1, Carnegie Mellon Software Engineering Institute.

2. Dijkstra, E.W. (1982), “On Webster, Users, Bugs, and Aristotle,” in Selected Writings on Computing: A Personal Perspective,” Springer-Verlag.

Andreas Zeller is chair of software engineering at the University of Saarland wherehis research involves programmer productivity with a particular interestin finding and fixing problems in code and development processes. He isbest known for the visual GNU DDD debugger and the delta debuggingtechnique for automatically isolating failure causes in program code.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.