Avoiding Heisenberg Debugging

The traditional way to find errors in aprogram using a debugger is to stop the program, examine its state,let it run some more, and so on. Each time you stop the program,aeons go by from the machine's point of view.

This isn't a problem if your program is interacting only withother systems that will wait for it. However, if your code hasimportant real-time deadlines to meet, the system may fall out ofsync. Even worse, if your program is controlling a piece ofmachinery with physical parts (disk heads, robot arms) in motion,depending on your program to control them, you may do real damageby stopping the program. Worse yet, if the system you are debuggingis deployed in the field, and controlling a traffic light, anelevator, or a motor vehicle, lives could be lost while you stepthrough your code!

This amounts to what could be called the “Heisenberg Principleof Software Development”—debugging the system may change itsbehavior.

Debugging aids such as emulators and logic analyzers are verygood at reducing or nearly eliminating this intrusive effect, butthey require expensive hardware, and many developers don't reallyknow how to use them. The learning curve can be steep, since thesetools usually use quite a different metaphor from the traditional“stop, look around, and continue” cycle ofbreakpoint-based debugging. Some of the high-end (expensive)versions of these tools are at least loosely integrated with adebugger, but many of the middle- and low-end ones are not (or onlysuperficially so).

The poor man's approach to debugging time-critical code, coderunning in the field, and sometimes code that only manifests a bugat odd intervals, has often been to instrument the source code byhand with instructions that will write debugging state informationout to a console or to a file or buffer. We call this method“printf debugging”, and it has an ancient and honorablehistory. However, as useful as it may be, it is also cumbersome andlimited. To add or remove a trace, you must go through an entireedit / compile / load cycle. The output is only as readable andinformative as you make it. Unless you're really dedicated to thetechnique (and keep libraries of debugging routines), you mustreinvent the wheel every time you use it, and you usually have toremove it all before you deploy your software to the field.

Still, these methods all share the advantage that you can letyour program run at close to native speed, and then analyze what itdid after the fact. The challenge is to obtain this advantagewithout the disadvantages of expensive hardware, unfamiliarinterface, and non-reusability.

Agent-based systems provide a flexible, efficient, andrelatively non-intrusive means of debugging for embedded systems.They typically consist of code in the debugger and a software‘agent’ running on the target system, which togetherallow developers to specify ‘tracepoints’ (analogous tobreakpoints). A tracepoint is a code location and list of data tobe collected whenever execution reaches that location. You choosewhat data to collect, and when to collect it, interactively atdebug time. You can examine data collected, choose new tracepoints,and begin collection immediately, without recompiling and reloadingthe program.

Data collection using an agent is not instantaneous, but it isquick. Recording the data for a typical tracepoint takes a fewthousand machine cycles—less than the time required to send asingle character over a 9600 baud serial line. This latency issmall enough to be tolerable when debugging many embeddedapplications.

As an example of the power of tracepoints, suppose you want toobserve the behavior of the following code:

struct point {
double x, y;
};

/* A vector is an array of points. N is the number of points,and p points to the first point in the array.*/
struct vector {
int n;
struct point *p;
};
/* A binary tree of vectors, ordered by KEY. */
struct tree {
struct tree *left, *right;
int key;
struct vector *vector;
};
/* Return the node in TREE whose key is KEY. Return zero if thereif no such node. */
struct tree *
find (struct tree *tree, int key)
{
if (! tree)
return 0;
if (key < tree->key)
return find (tree->left, key);
else if (key > tree->key)
return find (tree->right, key);
else
return tree;
}

Each time the ‘find’ function is invoked, you wouldlike to see the call stack, the tree node being visited, and (tomake things interesting), the last point in that tree node'svector.

We could certainly do this with an ordinary debugger. Manydebuggers will let you set a breakpoint at a certain location, andthen list some commands or macros that will be executed when thebreakpoint is hit. You could use these to “collect” thevalues of program variables.

However, most debuggers don't provide any convenient way torecord the data collected. Moreover, although the debugger couldcollect data much faster than a human could, it would often not befast enough for some purposes—communication between thedebugger and the target is usually slow, and the typical debugger'smacro language is not especially efficient.

Analyzing the collected data, however, is something that anormal debugger is good at. Debuggers know how to reinterpret a setof raw register values and memory contents as source-levelconstructs like stack frames, variables with names and types, datastructures, and so on. Given the name of a variable, a debuggerknows how to look up the binding currently in scope for that name,determine its size and type, and find its value in a register, onthe stack, or in static memory.

So we see that a debugger could be used for two of the threetasks involved in trace debugging, but would not really be good atthe central task—the actual data collection. Suppose wedelegate that task to a separate trace collection agent, running onthe target system and configured by the debugger? Then a tracedebugging experiment might look something like this:

Phase 1: Specify the Trace Experiment:

Using the debugger, the user places tracepoints in his program.For each tracepoint, the user specifies the data to be collectedusing source code names and expressions, just as one would use in aprint, watch or display command.

Phase 2: Run the Experiment:

After downloading the tracepoints to the trace collection agent,the debugger allows the program to run. Each time a tracepoint isreached, the trace collection agent (which may in fact be linkeddirectly into the program) wakes up, quickly records the desireddata in a memory buffer on the target board, and allows the programto resume. (Note that this involves no interaction with thedebugger, and therefore no communication over slow seriallinks.)

Phase 3: Analyze the Results:

By querying the trace collection agent, the debugger can accessthe collected data, and “replay” the tracepoint events.The contents of each record in the trace buffer (each correspondingto the execution of a tracepoint) can be displayed in sequence orin any order.

Although intrusive, this method will affect the timing of therunning system far less than would be the case if the user, or eventhe debugger, were involved in collecting the data. No matter howlong the trace collection agent requires to service an interruptand collect the data, it will surely be less time than would berequired to send a message over a serial line, or move a human'sfinger over a keyboard! The degree of intrusiveness can be reducedby careful optimization of the trace collection agent.

Let's look at the three phases in more detail.

Specification Phase

Using the traditional debugging model, you might ask yourdebugger to stop at the beginning of the ‘find’function, display the function call stack, and show the values of‘*tree’ and ‘tree->vector.p [tree->vector.n – 1 ]’.

For instance, using Cygnus' debugger, GDB, you might accomplishthat task using commands like these:

(gdb) break find
(gdb) commands
> where
> print *tree
> print tree->vector.p [ tree->vector.n – 1 ]
> continue
> end
(gdb)

Trace specifications would be similar, but because collecting achunk of the program stack is especially useful during the analysisphase (see below), some trace agent implementations, such asCygnus' Introspect, provide a special syntax for collecting certaincommonly useful sets of data:

>collect $regs // allregisters
>collect $locals // all locals and arguments
>collect $stack // a fixed-size chunk of stack

The analogous commands for a trace experiment, then, might looklike this:

(gdb) trace find
(gdb) actions
> collect $stack
> collect $locals
> collect *tree
> collect tree->vector.p [ tree->vector.n – 1 ]
> end
(gdb)

In both cases, the debugger does not immediately do anything,other than to remember what the user wants to happen later. Nothingreally happens until the program is run, and ‘find’ iscalled. Then, in the case of the breakpoint, the backtrace,‘*tree’, and the vector's point are displayed rightaway. In the case of the tracepoint, the values are stored in thetrace buffer for later retrieval.

Collection Phase:

Each time the program reaches a tracepoint, the agent logs thetracepoint's data in a buffer on the target machine. Each log entryis called an ‘event’; each event contains the number ofthe tracepoint reached, and any register values and memory contentsneeded to evaluate the tracepoint's expressions.

It is important to understand that an event does not simplyrecord the values of each expression to be collected. Rather, itrecords everything the debugger might need to re-evaluate thatexpression later. In the example above, to collect‘*tree’, the event would record both the registercontaining the variable ‘tree’, and the memory the treenode occupies.

To begin collection in this example, we use GDB's‘tstart’ command (which downloads the trace experimentto the trace collection agent), and then let the program run:

(gdb) tstart
(gdb) continue

As the program runs, the agent will collect trace data.

Analysis Phase:

Again using the traditional debugging model, you might:

  1. Run until you reach a breakpoint
  2. Note where you are in the program
  3. Look at the values of data and/or registers
  4. Continue to the next breakpoint

If instead you were debugging the results of a trace experiment,you would:

  1. Select a particular tracepoint event to example
  2. Note where that event occurred
  3. Look at the values of collected data and/or registers
  4. Select another tracepoint event

Continuing the example, we can use the ‘tfind start’command to select the first recorded event:

(gdb) tfind start
Tracepoint 1, find (tree=0x8049a50, key=5) at samp.c:24
24 if (! tree)

Since we have collected ‘$stack,’ we can use GDB's‘where’ command to show the currently active frames.‘$stack’ saves only a fixed (configurable) number ofbytes from the top of the stack, but usually saves enough tocapture the top few frames.

(gdb) where
#0 find (tree=0x8049a50, key=5) at samp.c:24
#1 0x8048744 in main () at main.c:8

Since we have collected ‘*tree,’ we can examine thatdata structure.

(gdb) print *tree
$1 = {left = 0x80499b0, right = 0x8049870, key = 100,
vector = 0x8049a68}
(gdb) print tree->key
$2 = 100
(gdb) print tree->left
$3 = (struct tree *) 0x80499b0

Note that only those objects actually collected are availablefor inspection. Although the left subtree was collected at the nexttracepoint event, it was not collected in this one:

(gdb) print *tree->left
Data not collected.

However, in order to collect ‘tree->vector.p [tree->vector.n – 1 $#093;’, the agent had to collect both‘tree->vector.p’ and‘tree->vector.n’, so the entire‘tree->vector’ structure is covered. Since the datais available, we can print it normally:

(gdb) print *tree->vector
$4 = {n = 2, p = 0x8049a78}

Introspect does not collect the entirety of every objectmentioned in the expression. Rather, it collects the final value ofthe expression, along with any other data needed to evaluate theexpression. Thus, although the last point in the vector wascollected, none of the other points in the vector areavailable—they were never referenced while evaluating theexpression:

(gdb) print tree->vector.p [ 1]
$5 = {x = 3, y = -46}
(gdb) print tree->vector.p( [ 0 ]
Data not collected.

So far, we've been inspecting the first tracepoint event. Let'swalk forward through a few events, to see where the tree searchended. The ‘tfind’ command, given no arguments, selectsthe next trace event record:

(gdb) tfind
Tracepoint 1, find (tree=0x80499b0, key=5) at samp.c:24
24 if (! tree)
(gdb) where
#0 find (tree=0x80499b0, key=5) at samp.c:24
#1 0x80484fa in find (tree=0x8049a50, key=5) at samp.c:28
#2 0x8048744 in main () at main.c:8
(gdb) print *tree
$6 = {left = 0x8049950, right = 0x80498f0, key = 3, vector =0x80499c8}
(gdb) tfind
Tracepoint 1, find (tree=0x80498f0, key=5) at samp.c:24
24 if (! tree)
(gdb) where
#0 find (tree=0x80498f0, key=5) at samp.c:24
#1 0x8048523 in find (tree=0x80499b0, key=5) at samp.c:30
#2 0x80484fa in find (tree=0x8049a50, key=5) at samp.c:28
#3 0x8048744 in main () at main.c:8
(gdb) print *tree
$7 = {left = 0x0, right = 0x0, key = 5, vector = 0x8049908}

Note that successive events record the growing stack as function‘find’ walks the tree recursively. Since we have foundthe tree node we were looking for, this is the last call to‘find’ and the last tracepoint event in the log:

(gdb) tfind
Target failed to find requested trace event.

Because all the tracepoint events are stored in a buffer whichmay be accessed at random, Introspect provides the (somewhat eerie)ability to travel backwards in time. For example, the command‘tfind –’ will select the tracepoint eventimmediately preceding the current event. As earlier events areselected, the program will appear to ‘un-make’ itsrecursive calls to find:

(gdb) tfind –
Tracepoint 1, find (tree=0x80499b0, key=5) at samp.c:24
24 if (! tree)
(gdb) where
#0 find (tree=0x80499b0, key=5) at samp.c:24
#1 0x80484fa in find (tree=0x8049a50, key=5) at samp.c:28
#2 0x8048744 in main () at main.c:8
(gdb) tfind –
Tracepoint 1, find (tree=0x8049a50, key=5) at samp.c:24
24 if (! tree)
(gdb) where
#0 find (tree=0x8049a50, key=5) at samp.c:24
#1 0x8048744 in main () at main.c:8

Since all the events are available in the agent's buffer, youcan examine them in any order you want. After examining one event,you could travel backwards in time, and examine the previousevent.

Using these types of commands in combination with a debugger'sbuilt-in scripting language, the user can generate a listing orreport of the trace results, in any format desired.

Implementing An Agent-Based Debugger

In embedded systems development, the debugger and the programbeing debugged are often separated by a relatively slow serialchannel. In this case it is especially important to have a tracecollection agent that is separate from the debugger, and resides onthe target side of the serial channel. The debugger handlesinteractive requests—creating tracepoints, examining traceevents—while the trace collection agent handles the actualruntime collection of the trace data.

This target-side agent needs its own local tables describing thetracepoints (code locations, what to collect, and so on), and itsown local buffers in which to store the collected data, so that itdoes not need to interact with the debugger in any way while thetrace experiment is running. It is desirable to consume as littletarget memory and interrupt the user program for as short a time aspossible, so the agent should not parse source language expressionsor look up symbols. Therefore, we make the debugger download ahighly simplified version of the trace experiment to thetarget-side agent, with all symbols replaced by addresses,registers and offsets, and all expressions compiled into a compactand efficient bytecode language.

The trace buffer itself is organized as a log of tracepointevents; each log entry records the data collected when the programreached a tracepoint. When the user selects a tracepoint event toexamine, the target-side agent acts as a server, feeding thecollected data back to the debugger on request.

In this example, each time the program reaches the tracepoint atthe beginning of the ‘find’ function, the GDB agentmust record the contents of the tree node, and then evaluate theexpression:

tree->vector.p [ tree->vector.n- 1 ]

to find the element of the vector to record.

Evaluating this expression is a non-trivial task. In order tofind the expression's value, one needs to know:

  • The syntax of C expressions, to parse our expression
  • The list of arguments and local variables for‘find’, to help us associate the identifier‘tree’ with a specific variable in the program
  • The location of the argument ‘tree’', whether in aregister or on the stack,
  • The types of ‘tree’, ‘tree->vector’,‘tree->vector.p’, and so on,

so one can find the offsets of members within their structures,scale integer values appropriately for pointer arithmetic, and soon. Knowledge of C expression semantics, to help us execute thevarious operators.

Representing this sort of information requires relatively largeand complex data structures, and code for lexical analysis andparsing can be bulky. Although a source-level debugger has all thisinformation readily available, such complexity is usually beyondthe scope of the agent, which must execute on a target machine withlimited resources.

So, the critical question is: How can an agent accept arbitrarysource-level expressions from user at debug time, and evaluate themon the target system using only a small amount of code and asquickly as possible?

The solution is to make the agent responsible for carrying outonly machine-level operations—loads, register references,two's-complement arithmetic, and so on—and to isolate allsymbolic processing in the debugger, which has the necessaryinformation and resources. When the user gives GDB a tracepointexpression, for example, it compiles the expression to a simplemachine-independent bytecode language, and sends the bytecode tothe agent for evaluation.

The vocabulary of machine-level operations needed to evaluate Cexpressions is rather small. In the case of Introspect, the currentbytecode interpreter understands around forty bytecodes, each ofwhich is implemented by a line or two of C code. The interpreter,including error checking code, occupies three kilobytes of code onthe target machine.

Consider the first expression collected in the exampleabove:

*tree

In the ‘find’ function, ‘tree’ is thefirst argument. For the SPARC, GDB might compile this expression tothe following bytecode sequence:

reg 8 const8 16 trace end

The bytecode engine maintains a stack of values, which is emptywhen evaluation begins. Most instructions pop an argument or twooff the stack, perform some operation on them, and push the resultback on the stack, where the next instruction can find it. Takingeach of the instructions above in turn:

reg 8
(This bytecode pushes the value of register number eight on thestack. (Each stack element is as wide as the largest value theprocessor directly supports.) In our example, GDB knows that‘tree’ lives in register eight, so this instructionpushes the value of ‘tree’ on the stack.)

const8 16
(This bytecode pushes the constant value 16 on the stack. The‘8’ in the instruction's name indicates that itsoperand is a single byte long; there are bytecode instructionsnamed ‘const16’ and ‘const32’, for pushinglarger instructions. In our example, GDB knows that the structureis sixteen bytes long, so this instruction pushes the size of‘*tree’ on the stack.)

trace
(This bytecode pops an address and a length from the stack, andrecords that memory in the tracepoint event log. In our example,the stack contains the ‘tree’ pointer, and the size ofthe node to which it points, so this instruction will record thefull contents of that node. The trace instruction does not assumethat the address is valid; if the interpreter gets a memory accessfault when it attempts to record the value, it aborts execution ofthat expression.)

end
(This instruction tells the engine to stop executing, marking theend of the bytecode expression.)

In addition to the bytecode expression, GDB also sends the agenta mask of the registers the expression uses. When the agent hits atracepoint, it records all the specified registers, in addition torunning the bytecode expressions.

Thus, after evaluating this expression, the agent will havesaved the value of register eight and the contents of the tree nodein the event log. This gives us both the value of the pointer“tree”, and the value of the node that it points to.Later, if the user selects this tracepoint event and asks GDB to“print *tree”, GDB will:

  1. Ask the target agent for the value of register 8 (the pointer“tree”), and then
  2. Ask the target agent for the contents of memory at the addresspointed to by the value it just fetched (the tree node).

Since both values are in the trace buffer, both requests willsucceed, and the expression “*tree” will be printedsuccessfully.

Let's consider a more complex example:

tree->vector. [ tree->vector.n- 1 ]

GDB will compile this expression to the bytecode:

reg 8
const8 8
(These first instructions push the value of ‘tree’ onthe stack, followed by the constant value 8.)
add
(The ‘add’ instruction pops the top two values off thestack (in this case, ‘tree’ and 8), adds them, andpushes the sum on the stack. This computes the address of‘tree->vector’, since ‘offsetof (struct tree,vector)’ is eight.)
trace_quick 4
(This bytecode logs the value of the ‘n’ bytes whoseaddress is on the top of the stack, without disturbing the stack.In our case, the top of the stack is the address of‘tree->vector’, which is a pointer; since pointersare four bytes long on our target architecture, this records thevalue of ‘tree->vector’ in the event log. Like‘trace’, it aborts evaluation of the expression if theaddress is invalid.)
ref32
(This bytecode pops an address off the stack, and pushes thecontents of the 32-bit word it points to. Thus, the stack nowcontains the value of ‘tree->vector’, which is theaddress of the node's ‘struct vector’. (Naturally,there are analogous instructions, ‘ref16’ and‘ref8’, for fetching values of other sizes.)
const8 4
add
trace_quick 4
ref32
(These instructions compute the address of‘tree->vector.p’, record its contents in the eventlog, and leave its value on the top of the stack. At this point,the value of ‘tree->vector.p’ is the only thing onthe stack.)
reg 8
const8 8
add
trace_quick 4
ref32
trace_quick 4
ref32
(These instructions do the same for‘tree->vector.n’; since ‘n’ is the firstelement of a ‘struct vector’', its offset is zero, andwe do not need to generate an ‘add’ bytecode. At thispoint, the stack contains ‘tree->vector.p’ (on thebottom) and ‘tree->vector.n’ (on top).)
const8 1
sub
(The ‘sub’ instruction pops the top two values off thestack, subtracts the first from the second, and pushes theirdifference back on the stack. Thus, these instructions subtract onefrom the top of the stack, yielding ‘tree->vector.n– 1’).
const8 16
mul
add
(The C language's rules for pointer arithmetic state that, when weadd a pointer to an integer, we must first multiply the integer bythe size of the object pointed to. These bytecodes carry out thatscaling operation, and then perform the addition.)

(The top of the stack now contains the address of‘tree->vector.p [ tree->vector.n – 1 ]’, which isa ‘struct point’.)
const8 16
trace
end

A ‘struct point’ is sixteen bytes long, containingtwo doubles. We record the value of the ‘struct point’whose address is on the top of the stack, and exit.

There are several things to note about this example.

  • Information implicit in the source code (the size of a‘struct tree’; the offset of `p' within a ‘structvector’) is made explicit in the bytecode. This means theagent does not need information about the expression'scontext—types, scopes, etc.—in order to evaluate it.
  • The expression records not just the final value of theexpression, but any data touched in the process of evaluating it.Thus, when GDB evaluates the same expression itself, everything itneeds is sure to be in the agent's buffer.
  • The expression only records the values it actually touches. Forexample, since this expression does not reference‘tree->left’ or ‘tree->right’, thosevalues will not be in the trace buffer. (In the example sessionabove, we actually collected ‘*tree’ as well, sotree->left and tree->right were available.) There arebytecodes available for manipulating 64-bit values, on thosetargets that support them. This example was created on a 32-bitmachine, so the 64-bit instructions do not appear.

Agent-based debugging is as un-intrusive as possible. InIntrospect, for instance, Cygnus currently has a highly portableand configurable version of the target collection agent. It can bebuilt with or without certain components (such as the expressionevaluator) in order to fit in a small memory footprint. The maximumsize of the trace buffer is also configurable.

In addition, improvements in the technology are still possible,and Cygnus is looking into implementing them. The currentimplementation of the expression evaluator in Introspect is limitedto expressions without side effects. Adding operators with sideeffects would not be at all difficult. A somewhat more ambitiousenhancement would be adding function calls to the expressionlanguage. With both side-effect operators and function calls, youwould have a fairly extensive ability to patch your program andmodify its behavior on the fly.

Since we do have the expression evaluator on the target, itwould be possible to attach a condition expression to eachtracepoint, such that data would be collected only if the conditionwere true. The same might be done for breakpoints, makingconditional breakpoints much faster (since the condition would nolonger have to be evaluated by the debugger.)

Finally, the current portable implementation requires that thetarget program be stopped at a breakpoint before GDB can query thetrace buffer. A very powerful enhancement would be to allow GDB toread from the trace buffer even when the target is running.

Even without such improvements, however, agent-based debuggingprovides an inexpensive opportunity for developers to look intotheir software's behavior without the timing impact of traditionalprintf debugging. You can't entirely defeat the Heisenbergprinciple, but agent-based debuggers make it a far less imposingobstacle.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.