A guide to systematic program debugging: Part 1 - Embedded.com

A guide to systematic program debugging: Part 1

Editor’s Note:  In this two part series excerpted from Why programs fail: A guide to systematic debugging , author Andreas Zeller defines the essential concepts for effective software debugging and uses them to show how to isolate and then fix defects in the program code once a failure has occurred. Part 1: Defining the TRAFFIC rules for debugging.

Your program fails. How can this be? The answer is that the programmer creates a defect in the code. When the code is executed, the defect causes an infection in the program state, which later becomes visible as a failure. To find the defect, one must reason backward, starting with the failure.

This is a common situation that interrupts our routine and requires immediate attention. Because the program mostly worked until now, we assume that something external has crept into our machine—something that is natural and unavoidable; something we are not responsible for—namely, a bug.

If you are a user, you have probably already learned to live with bugs. You may even think that bugs are unavoidable when it comes to software. As a programmer, though, you know that bugs do not creep out of mother nature into our programs.

Defining terms: bugs, faults, or defects?
The word bug suggests something humans can touch and remove—and are probably not responsible for. This is already one reason to avoid the word bug. Another reason is its lack of precision. Applied to programs, a bug can mean:

  • An incorrect program code [“This line is buggy”].
  • An incorrect program state [“This pointer, being null, is a bug”].
  • An incorrect program execution [“The program crashes; this is a bug”].

This ambiguity of the term bug is unfortunate, as it confuses causes with symptoms as in, for example:

The bug in the code caused a bug in the state, which caused a bug in the execution—and when we saw the bug we tracked the bug, and finally found and fixed the bug.

This series of articles uses the following, more precise, terms:

  • Defect: An incorrect program code [a bug in the code].
  • Infection: An incorrect program state [a bug in the state].
  • Failure: An observable incorrect program behavior [a bug in the behavior].

The wording of the previous example thus becomes clearer: The defect caused an infection , which caused a failure —and when we saw the failure we tracked the infection, and finally found and fixed the defect.

The industry uses several synonyms of these terms.The IEEE standards define the term fault as defect as defined here. They also define bug as a fault , thus making it a synonym of defect —and debugging thus becomes the activity of removing defects in the software.

The terms error and fault are frequently used as synonyms of infection, but also for mistakes made by the programmer. Failures are also called issues or problems . In this series,we use problem as a general term for a questionable property of the program run. A problem becomes a failure as soon as it is considered incorrect. Some defects cannot be attributed to a specific location in the software, but rather to its overall design or architecture. I call such defects flaws . In general, flaws are bad news, because they suggest major changes involved in fixing the problem.

So much for our family of bugs and related terms. Actually, your choice of one term over another shows two things. First, your wording shows how seriously you take the quality of your work.

As Humphrey [1] points out, the term bug “has an unfortunate connotation of merely being an annoyance; something you could swat away or ignore with minor discomfort.” He suggests defect instead, and this book follows his advice. Likewise, the word failure is to be taken more seriously than issue . If you find a flaw, you should be truly alarmed.

Second, your choice shows whether you want to attribute failures to individuals. Where bugs seem to creep into the software as if they had a life of their own, Errors and faults are clearly results of human action. These terms were coined by Edsger W. Dijkstra [2] as pejorative alternatives to bug in order to increase the programmers’ sense of responsibility.

After all, who wants to create a fault? However, if a program does not behave as intended this may not be the effect of a human mistake. In fact, even a program that is correct with respect to its specification can still produce surprises. This is why I use the terms defect and infection instead of the guilt-ridden faults and errors.

How defects cause failures, and how can we fix them?
The following is a small program I once produced. The sample program is a very simple sorting tool. Given a list of numbers as command-line arguments, it prints them as a sorted list on the standard output [$ is the command-line prompt].

   $ ./sample 9 7 8
   Output: 7 8 9
   $ _

Unfortunately, sample does not always work properly, as demonstrated by the following failure:

   $ ./sample 11 14
   Output: 0 11
   $ _

Although the sample output is sorted and contains the right number of items, some original arguments are missing and replaced by bogus numbers. Here, 14 is missing and replaced by 0. (Actual bogus numbers and behavior on your system may vary.) From the sample failure, we can deduce that sample has a bug (or, more precisely, a defect). In general, a failure such as that in the sample program comes about in the four stages discussed in the following.

Stage 1. The programmer creates a defect . A defect is a piece of the code that can cause an infection. Because the defect is part of the code, and because every code is initially written by a programmer, the defect is technically created by the programmer. If the programmer creates a defect, does that mean the programmer was at fault? Not necessarily. Consider the following:

  • The original requirements did not foresee future changes. Think about theY2K problem, for instance.
  • A program behavior may become classified as a “failure” only when the user sees it for the first time.
  • In a modular program, a failure may happen because of incompatible interfaces of two modules.
  • In a distributed program, a failure may be the result of some unpredictable interaction of several components.

In such settings, deciding on who is to blame is a political, not a technical, question. Nobody made a mistake, and yet a failure occurred.

Stage 2. The defect causes an infection. The program is executed, and with it the defect. The defect now creates an infection—that is, after execution of the defect, the program state differs from what the programmer intended. A defect in the code does not necessarily cause an infection. The defective code must be executed, and it must be executed under such conditions that the infection actually occurs.

Stage 3. The infection propagates. Most functions result in errors when fed with erroneous input. As the remaining program execution accesses the state, it generates further infections that can spread into later program states. An infection need not, however, propagate continuously. It may be overwritten, masked, or corrected by some later program action.

Stage 4. The infection causes a failure.
A failure is an externally observable error in the program behavior. It is caused by an infection in the program state.

The program execution process is sketched in Figure 1.1 . Each program state consists of the values of the program variables, as well as the current execution position (formally, the program counter). Each state determines subsequent states, up to the final state [at the bottom in the figure],in which we can observe the failure (indicated by the X).

Figure 1.1: A program execution as a succession of states. Each state determines the following states, and where from defect to failure errors propagate to form an infection chain.

Not every defect results in an infection, and not every infection results in a failure. Thus,having no failures does not imply having no defects. This is the curse of testing, as pointed out by Dijkstra. Testing can only show the presence of defects, but never their absence.

In the case of my sample program, though, we have actually experienced a failure. In hindsight, every failure is thus caused by some infection, and every infection is caused by some earlier infection, originating at the defect. This cause–effect chain from defect to failure is called an infection chain.

The issue of debugging is thus to identify the infection chain, to find its root cause [the defect], and to remove the defect such that the failure no longer occurs. This is what we shall do with the sample program.
Seven steps to TRAFFIC control
In general, debugging of aprogram such as sample can be decomposed into seven steps of which theinitial letters form the word TRAFFIC.

  1. Track the problem in the database.
  2. Reproduce the failure.
  3. Automate and simplify the test case.
  4. Find possible infection origins.
  5. Focus on the most likely origins.
  6. Isolate the infection chain.
  7. Correct the defect.

Ofthese steps, tracking the problem in a problem database is merebookkeeping and reproducing the problem is not that difficult fordeterministic programs such as sample. It can be difficult fornon-deterministic programs and long-running programs, though. Automatingthe test case is also rather straightforward, and results in automaticsimplification. The last step, correcting the defect, is usually simpleonce you have understood how the defect causes the failure.

Thefinal three steps—from finding the infection origins to isolating theinfection chain—are the steps concerned with understanding how thefailure came to be.This task requires by far the most time, as well asother resources. Understanding how the failure came to be is what therest of this article is about.

Why is understanding the failureso difficult? Considering Figure 1.1, all one need do to find the defectis isolate the transition from a sane state [i.e., non-infected, asintended] to an infected state. This is a search in space (as we have tofind out which part of the state is infected) as well as in time (as wehave to find out when the infection takes place).

However,examination of space and time are enormous tasks for even the simplestprograms. Each state consists of dozens, thousands, or even millions ofvariables. For example, Figure 1.2 shows a visualization of theprogram state of the GNU compiler [GCC] while compiling a program. Theprogram state consists of about 44,000 individual variables, each with adistinct value, and about 42,000 references between variables.

Figure1.2: A program state of the GNU compiler. The state consists of 44,000individual variables (shown as vertices) and about 42,000 referencesbetween variables (shown as edges).

Not only is asingle state quite large, a program execution consists of thousands,millions, or even billions of such states. Space and time thus form awide area in which only two points are well known (Figure 1.3):initially, the entire state is sane [√], and eventually some part of thestate is infected [x].

Figure1.3: Debugging as search in space and time. Initially, the programstate is sane [√], eventually, it is infected [x]. The aim of debuggingis to find out where this infection originated.

Withinthe area spanned by space and time, the aim of debugging is to locatethe defect—a single transition from sane to infected that eventuallycauses the failure (Figure 1.4 ).

Figure1.4: The defect that is searched. A defect manifests itself as atransition from sane state [√] to infected state [x], where an erroneousstatement causes the initial infection.

Debugging as a search problem
Thinkingabout the dimensions of space and time, this may seem like searchingfor a needle in an endless row of haystacks—and indeed, the fact is thatdebugging is largely a search problem. This search is driven by thefollowing two major principles:

1. Separate sane from infected. If a state is infected, it may be part of the infection propagatingfrom defect to failure. If a state is sane, there is no infection topropagate.

2. Separate relevant from irrelevant. Avariable value is the result of a limited number of earlier variablevalues. Thus, only some part of the earlier state may be relevant to thefailure.

Figure 1.5 illustrates this latter technique.The failure, to reiterate, can only have been caused by a small numberof other variables in earlier states [denoted using the exclamationpoint, !], the values of which in turn can only have come from otherearlier variables. One says that subsequent, variable values depend onearlier values.

This results in a series of dependencies fromthe failure back to earlier variable values. To locate the defect, itsuffices to examine these values only—as other values could not havepossibly caused the failure—and separate these values into sane andinfected.

Figure1.5: Deducing value origins. By analyzing the program code, we can findout that an infected variable value [x] can have originated only from asmall number of earlier variables [!].

If we findan infected value, we must find and fix the defect that causes it.Typically, this is the same defect that causes the original failure.

Whyis it that a variable value can be caused only by a small number ofearlier variables? Good programming style dictates division of the stateinto units such that the information flow between these units isminimized.

Typically, your programming language provides a meansof structuring the state, just as it helps you to structure the programcode. However, whether you divide the state into functions, modules,objects, packages, or components, the principle is the same: a dividedstate is much easier to conquer.

This article was excerpted from Andreas Zeller’s book Why programs fail: A guide to systematic debugging (Second Edition, Copyright 2009), published by Morgan Kauffmann, an imprint of Elsevier Inc.

Part 2: “Following the debugging TRAFFIC rules” will be posted shortly.

Andreas Zeller is chair of software engineering at the University of Saarland wherehis research involves programmer productivity with a particular interestin finding and fixing problems in code and development processes. He isbest known for the visual GNU DDD debugger and the delta debuggingtechnique for automatically isolating failure causes in program code.


1. Humphrey, W.S. [1999], Bugs or defects? Technical Report, Volume 2, Issue 1, Carnegie Mellon Software Engineering Institute.

2. Dijkstra, E.W. [1982], “On Webster, Users, Bugs, and Aristotle,” in Selected Writings on Computing: A Personal Perspective,” Springer-Verlag.

2 thoughts on “A guide to systematic program debugging: Part 1

  1. While I find articles like this interesting, these techniques and definitions are often more theoretical than what is useful in the real world.

    It is incorrect to say that a defect is something that is created by a programmer. Defects are frequently, even

    Log in to Reply
  2. I'll be interested to read the second part of this article. It is true that the process of locating a defect and fixing it is first a search problem. I've seen many engineers go at it different ways – there are those that, based on hunches of likely culpri

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.