Editor's note: In Part 2 of a series excerpted from his book Why Programs Fail , author Andreas Zeller discusses algorithmic methods of debugging.
As an alternative to what was discussed in Part 1 in this series, another way of organizing the debugging process scientifically is to automate it at least partially using an algorithmic approach. The idea of algorithmic debugging (also called declarative debugging) is to have a tool that guides the user along the debugging process interactively. It does so by asking the user about possible infection sources:
1. Assume an incorrect result R has the origins O1,O2, . . . ,On .
2. For each of the origins Oi , algorithmic debugging inquires whether the origin Oi is correct or not.
3. If one of the origins Oi is incorrect, algorithmic debugging restarts at step 1 with R_Oi .
4. Otherwise, all origins Oi are correct. Then, the infection must have originated at the place where R was computed from the origins. The process terminates.
Let’s illustrate algorithmic debugging via an example. Example 6.1 shows a PYTHON sorting function: sort(L) is supposed to return a sorted copy of the list L. Unfortunately, sort() does not work properly: sort([2, 1, 3]) returns [3, 1, 2] rather than [1, 2, 3] .
Our sort() function is based on insertion sort. It thus relies on a function insert(X, L) ,which returns a list where X is inserted between the elements of L : insert(2, [1, 3]) should return [1, 2, 3] .
Figure 6.4 summarizes the execution of sort([2, 1, 3]) (a line stands for functions being invoked). Each invocation of sort(L) first calls itself for the tail of L and then calls insert() to insert the first element of L into the freshly sorted list.
Figure 6.4. Execution tree of the sorting function in Example 6.1. Each computation of a function (top) relies on further function calls (bottom).
The execution tree shown in Figure 6.4 now becomes the base for the algorithmic debugging process. Assume we have a tool that implements algorithmic debugging for PYTHON,working on the console. Such a tool would first ask us whether the end result is correct,which we decline.
sort([2, 1, 3]) = [3, 1, 2]? no
The error can originate either from sort([1, 3]) or the subsequent insert() call. Algorithmic debugging starts querying about the first origin.
sort([1, 3]) = [3, 1] ? no
Again, the error could originate from the earlier sort() call. Is the result correct?
sort() = ? yes
Because the call sort() =  was fine but sort([1, 3]) = [3, 1] is wrong, and the error could have originated in the insert() call. It actually does, because insert(1, ) should return [1, 3] , and not [3, 1] .
insert(1, ) = [3, 1]? no
Because insert(1, ) invokes no further functions, we have isolated the defect. The infection originates at the call insert(1, ) = [3, 1] (shown in gray in the figure). Our algorithmic debugging tool reports:
An error has been localized in the body of insert() .
We can even narrow down the infection to the code executing in the call insert(1, ) . This leaves us with the return statement:
if elem <= head:
return list + [elem]
This statement is wrong. If the element to be inserted is smaller than the head of the list, it should be inserted at the beginning rather than at the end. The statement thus must read:
if elem <= head:
return [elem] + list
This fixes the sort() function from Example 6.1.With this fix applied, it sorts just fine.
The general idea of having an algorithm drive the debugging process is applicable to arbitrary debugging techniques.
Wherever we search an error—and need to rely on human input to decide what is correct, right, or true—algorithmic debugging can drive the search in a systematic way. Unfortunately, algorithmic debugging has not been demonstrated to be effective for real-world applications described below:
The process does not scale.In a large imperative program, there are millions and millions of functions being executed. Many of these functions communicate via shared data structures, rather than simple arguments and returned values.Worse yet, the data structures being accessed are far too huge to be checked manually. Imagine this response when debugging a compiler:
“Are these 4 megabytes of executable code correct (yes/no)?”
For these reasons, algorithmic debugging works best for functional and logical programming languages. Functional and logical programs have few or no side effects—that is, there is no shared state that is updated, and the user does not have to check the entire program state. For logical languages such as PROLOG,an execution tree (Figure 6.4) becomes a proof tree,which is part of every program execution.
Programmers prefer driving to being driven.The algorithmic debugging process, as implemented in early tools, is extremely rigid. Many programmers do not like being instrumented in such a mechanical way. Making the process user friendly in the sense that it provides assistance to programmers (rather than having the programmer assist the tool) is an open research issue.
It is conceivable that future tools, combined with the analysis techniques defined in this article, will provide guidance to the programmer by asking the right questions. In particular, one can think of programmers providing specifications of particular function properties—specifications that can then be reused for narrowing down the incorrect part.
All of these problems disappear if we replace the programmer being queried by an oracle—an automatic device that tells correctness from non-correctness. To determine its result, such an oracle would use an external specification. Here we assume that there is no such specification (except from the final test outcome)—at least not in a form a mechanical device could make use of it. Therefore, scientific method still needs human interaction.Deriving a hypothesis
The scientific method gives us a generalprocess for turning a hypothesis into a theory— or,more specifically,aninitial guess into a diagnosis. But still,within each iteration of thescientific method we must come up with a new hypothesis.
This isthe creative part of debugging: thinking about the many ways a failurecould have come to be. This creative part is more than just mechanicallyenumerating possible origins, as in algorithmic debugging.
Unfortunately,being creative is not enough: We must also be effective. The better ourhypotheses, the less iterations we need, and the faster the diagnosisis done. To be effective, we need to leverage as many knowledge sourcesas possible. These are the ingredients of debugging, as shown in Figure6.1.
The Description of the Problem. Without a concisedescription of the problem, you will not be able to tell whether theproblem is solved or not.A simplified problem report also helps.
The Program Code. The program code is the common abstraction across all possible programruns, including the failing run. It is the basis for almost alldebugging techniques.
Without knowledge about the internals ofthe program, you can only observe concrete runs (if any) without everreferring to the common abstraction. Lack of program code makesunderstanding (and thus debugging) much more difficult. As you cannotrecreate the program from code,you must work around defects, which isfar less satisfactory than fixing the code itself.
As an example,consider the sort() algorithmic debugging session earlier in thisarticle.In principle,we (as users) could have run the session withoutknowing the source code. To determine whether a result is correct ornot, all we need is a specification. However,the tool itself must haveaccess to the code (Example 6.1) in order to trace and instrument theindividual function calls.
The Failing Run. The programcode allows you to speculate about what may be going on in a concretefailing run. If you actually execute the programsuch that the problem isreproduced, you can observe actual facts about the concrete run. Suchfacts include the code being executed and the program state as itevolves. These observation techniques are the bread and butter ofdebugging.
Again, debugging the sort() code in Example 6.1becomes much easier once one can talk about a concrete (failing) run. Inprinciple, one could do without observation. This is fine for provingabstract properties but bad for debugging concrete problems.
Alternate Runs. A single run of a nontrivial program contains a great deal ofinformation, and thus we need a means of focusing on specific aspects ofthe execution. In debugging,we are most interested in anomalies—thoseaspects of the failing run that differ from “normal” passing runs. Forthis purpose,we must know which “normal” runs exist, what their commonfeatures are, and how these differ in the failing run.
In the sort() example, algorithmic debugging has used alternate runs of individual functions to narrow down the defect. From the fact that sort() worked, and sort([1, 3]) failed, algorithmic debugging could deduce that the error must have originated in the insert() call taking place between the two sort() calls.
Inpractice, we seldom have a specification available to tell us whethersome aspect of a run is correct or not. Yet,with a sufficient number ofalternate runs we can classify what is“normal”or not.
Earlier Hypotheses. Depending on the outcome of a scientific method experiment, one musteither refine or reject a hypothesis. In fact, every new hypothesis must
- include all earlier hypotheses that passed (the predictions of which were satisfied), and
- exclude all hypotheses that failed (the predictions of which were not satisfied).
Anynew hypothesis must also explain all earlier observations, regardlessof whether the experiment succeeded or failed—and it should be differentenough from earlier hypotheses to quickly advance toward the target.
Again,the algorithmic debugging session is a straightforward example of howthe results of earlier tests (i.e., answers given by the user) drive thescientific method and thus the debugging process.The final diagnosis ofinsert() having a defect fits all passed hypotheses and explains all earlier observations.
Toautomate the process, we would like to reuse earlier hypotheses withoutasking the user for assistance. If a hypothesis is about a cause (suchas a failure cause),the search for the actual cause can be conductedsystematically by narrowing the difference between a passing and afailing scenario. These techniques can be automated and applied toprogram runs.
Conclusion: Reasoning about programs
Dependingon the ingredients that come into play, humans use different reasoningtechniques to learn about programs. These techniques form a hierarchy,as shown in Figure 6.5.
Deduction. Deductionis reasoning from the general to the particular. It lies at the core ofall reasoning techniques. In program analysis, deduction is used forreasoning from the program code (or other abstractions) to concreteruns— especially for deducing what can or cannot happen.
Thesedeductions take the form of mathematical proofs. If the abstraction istrue, so are the deduced properties. Because deduction does not requireany knowledge about the concrete, it is not required that the program inquestion actually be executed.
A technique is called staticanalysis if it infers findings without executing the program—that is,the technique is based on deduction alone. In contrast, dynamic analysistechniques use actual executions.
As Nethercote (2004) points out, this distinction of whether a program is executed or not maybe misleading. In particular, it raises the issue of what exactly ismeant by “execution.” Instead, he suggests that “static techniquespredict approximations of a program’s future; dynamic analysis remembersapproximations of a program’s past.”
Because in code debuggingwe are typically concerned about the past, most interesting debuggingtechniques fall into the “dynamic” categories,
Observation. Observationallows the programmer to inspect arbitrary aspects of an individualprogram run. Because an actual run is required, the associatedtechniques are called dynamic. Observation brings in actual facts of aprogram execution. Unless the observation process is flawed, these factscannot be denied.
A technique is observational if it generatesfindings or approximations from a single execution of the program. Mostobservational techniques make use of the program code in some form oranother and thus also rely on deduction.
Induction. Induction is reasoning from the particular to the general. In programanalysis, induction is used to summarize multiple program runs (e.g.,atest suite or random testing) to some abstraction that holds for allconsidered program runs.
In this context, a “program” may alsobe a piece of code that is invoked multiple times from within aprogram—that is, some function or loop body. In this book,we call atechnique inductive if it generates findings from multiple executions ofthe program. By definition,every inductive technique makes use ofobservation.
Experimentation.Searching for the cause of a failureusing scientific method requires a series of experiments, refining andrejecting hypotheses until a precise diagnosis is isolated. This impliesmultiple program runs that are controlled by the reasoning process.
Inthe context of programming, a technique experimental if it generatesfindings from multiple executions of the program that are controlled bythe technique. By definition, every experimental technique usesinduction and thus observation.
1. Kerningham, B.W. and Pike, R.(1999). The Practice of Programming. Addison Wesley.
2. Nethercote, N. (2004), Dynamic binary analysis and instrumentation, PhD. Thesis, University of Cambridge, U.K.
Andreas Zeller is chair of software engineering at the University of Saarland wherehis research involves programmer productivity with a particular interestin finding and fixing problems in code and development processes. He isbest known for the visual GNU DDD debugger and the delta debuggingtechnique for automatically isolating failure causes in program code.
This article was excerpted from Andreas Zeller’s book Why programs fail: A guide to systematic debugging (Second Edition, Copyright 2009) , published by Morgan Kauffmann, an Imprint of Elsevier Inc.