Programmers often use visualization tools to investigate bugs they already know exist. But when used in novel, almost casual ways, these tools can be more powerful. Is there a new tool here for finding elusive bugs?
You design a tool to solve a nasty debugging problem, you research the history of the tool, and you test it extensively to ensure it works. After your employer sells it for several years, only then do you discover that the garden sprayer you tried to build turned out to be a crop duster instead.
This is the story of software visualization tools and how they serve as powerful debuggers even though that's not the purpose for which they were created.
A software visualization tool logs the activity of a system and displays a graphical representation of that activity on a timeline. The result looks like a strip chart of system activity.
The first graphical multitasking visualization tool I saw was coded for the BBN Butterfly multiprocessor by John Mellor-Crummey. For him, it was an afterthought. His dissertation turned on a tool called Instant Replay that recorded the details of interactions in the execution of a multiprocessor program. Using the log, he could force the program to execute again with precisely the same interactions. This allowed a programmer to replay a failed execution until the bug was found.
The main purpose of the execution log was to enable replay, but Mellor-Crummey also wrote a program that used the log data to chart the activity of the system at various levels of detail. The most exciting chart hid details about flow of control and just showed access to shared memory, especially locking activity in shared memory. That type of chart exposed problems like deadlocks and starvation as if the tool had been designed for that purpose. Since that time, commercial tools that create a chart of system activity over time have become available.
I know of at least three commercial system-activity visualization tools specifically for real-time/embedded systems.
These tools are designed to find bugs that are hard to find with the usual “breakpoint and explore” debugging paradigm. They work beautifully on “Heisenberg bugs”-so-called because they disappear when you start looking for them.
A logging facility has much less impact on a system than other debugging tools (so it is unlikely to disturb the system so much that the bug disappears), but it freezes a record of the bug in a log that can be inspected at leisure. You specify a trigger for the logging component, perhaps a bus fault or the nth invocation of some system service. The logging daemon captures a record of events around the trigger event and transfers them to the tool that formats them graphically for analysis.
The normal cycle for using a visualization tool starts with a bug. Say the system freezes up. Standard debugging utilities show that two tasks are making no progress; their last activity was to wait for a lock. Inspection of the code shows that they are both supposed to wait for that lock, but not at the same time. In fact, the design specifically calls for those two tasks to hand the lock back and forth between themselves. This is fodder for the visualization tool.
The visualization cycle iterates between refining the trigger specification and inspecting the log until it contains a clear occurrence of the bug and enough surrounding activity to show the bug's original cause. In this example, it shows that one task tried to acquire the lock twice in a row without allowing the other task to run. The programmer says Doh! and hits himself on the forehead.
Several years ago, I designed a visualization tool called HawkEye. Since then I've done most of the coding on it and feel rather possessive about that piece of software. I designed it for two kinds of use: (1) finding bugs caused by race conditions in communication between tasks and (2) performance tuning over short intervals.
I had the charts from Instant Replay at the front of my mind when I designed HawkEye. I was also inspired by hundreds of hours I spent working with an HP logic analyzer. Basically, HawkEye is a logic analyzer for software. You select a trigger and start it running. When the trigger event occurs, HawkEye gives you a chart of the timeline around the trigger. You choose the trigger so the trace includes the time interval that contains the bug, or the period that needs to be optimized. HawkEye builds a graphical presentation of the data it collects. It tries to make race conditions and performance problems easy to see.
Upon completion of the beta testing of a new release of HawkEye, we decided to collect stories: we thought we could use them to refine HawkEye's operation.
It turns out that people use it “wrong.” They seldom follow the suggested methodology; instead they rely on serendipity. In fact, most times users use HawkEye to inspect systems that have no known bugs. They randomly browse system activity and, in the process, find lots of things to fix.
Here are five representative stories, one about “classical” use of the tool in a normal debugging setting, one about classical debugging in a totally nonclassical setting, and three about finding unexpected bugs. Except two that I personally recorded, these stories have their identifying marks removed to preserve the dignity of the bug creators.
First a story about the classical use of a visualization tool. I once wrote a benchmark program designed to measure context-switching times, but it was locking up after a few context switches. It reported that system calls that should have taken microseconds were actually taking seconds. The benchmark was clearly deadlocking soon after it started, so I chose to trigger on the initial fork of the benchmark.
Figure 1: A view to a kill
Full sized version of this image
You can see the bug in Figure 1. Yellow lines are threads waiting for a semaphore. Little yellow arrows are semV (unlock) system calls. Yellow triangles are semP (lock) system calls. Icons with dark outlines mark entry to system calls. Icons with light outlines mark return from system calls. In the middle of the screen you can see threads 11 and 12 locking a semaphore and waiting. That's the deadlock. Just to the left of the semP in thread 12 you can see the cause. There's one semV that releases thread 11 (see the interaction line), then thread 12 releases the semaphore again.
Releasing a binary semaphore twice does no more than releasing it once. Somehow I had thought that thread 11 would wake up immediately when 12 released the lock, and the second semV would not happen until thread 11 had awakened and relocked the semaphore. Maybe those two purple icons (representing a set priority system call) between the semV's in thread 12 had something to do with it.
I didn't record how I removed the deadlock, but I did. The benchmark now runs correctly.
This is what visualization tools were designed for. I knew there was a bug that involved interaction of multiple components in the system. The tool gave recorded a snapshot of system activity around the problem, which allowed me to find it.
Story number two: John was testing a production set-top box, delivered for use as a sales tool. It would not run Java with graphics. If Java was run without graphics, it would work fine, but no amount of fussing with scripts and environment variables would get Java to open a window. The software that came with the set-top box worked fine. The software John added to it worked fine on other platforms. It looked like something went wrong when they were combined.
John is a good technician, not a programmer, but even a skilled programmer would have had difficulties. Most of the software in the set-top box executed from flash memory programmed at the factory. None of the information needed for symbolic debugging was present.
John had one clue. A process that should have run the window manager was missing. It looked like it should have started, but it was not there.
He added the logging daemons to the system and started them. After checking with a programmer, he set the trigger to place the first instance of a particular system call in the middle of the log.
The problem was right before the trigger. A low-level graphics component was calling for a service from the graphics driver; the request returned an unknown service error. Microseconds later that process got a memory fault on instruction fetch and was terminated by the operating system.
The problem lay in two places. The component calling for the service should have checked the return code, and the driver should have supported the service. John brought the log file to a programmer who immediately identified the problem and solved it.
The key thing here is that John was not a programmer, and the failing system was for all practical purposes a deployed product. A graphical view of system execution changed the bug report from “I cannot get Java to work with graphics,” to “There's something wrong in the implementation of this graphics service or the way it is used.”
Wag the dog
A system was running correctly, but slowly. The system seemed very sluggish and was showing heavy CPU use, but diagnostics found no problem. There were plenty of good explanations for the heavy load, and the performance was good enough. The system was declared correct.
Later, more software was added to the system. A programmer was randomly looking around, playing. He recorded a snapshot from the execution of a test program; looking at this snapshot, his attention was drawn to a specialized logging process. Since the device the logging program was monitoring was not running, the logger should have been blocked in an event wait, but the display showed that it was furiously busy. It was executing event- wait system calls as fast as it could, using up every available CPU cycle. All those event-wait system calls returned immediately. They were all waiting on the same event, and they all came from the same address. The logging process should have been waiting in one event-wait call during the entire execution of the test program; something was clearly wrong.
On a little investigation it turned out that the event was not configured properly. After one event wait had returned, subsequent event waits would return without blocking. Fortunately (or perhaps unfortunately) the program checked to see if whatever it was waiting for had happened and waited again if it had not.
The programmer fixed the event configuration. The system sped up wonderfully.
This error might never have been found. It did not cause a software exception, generate incorrect results, or make the system lock up. The system even passed its test suite. Nevertheless, the problem was blatantly obvious as soon as someone looked at the system with a visualization tool.
Not a counting semaphore
Here's another story of a bug that didn't cause an obvious failure or even a major performance problem. The system seemed fine, but a programmer ran a visualization trace just to see what it looked like. The trace had an unusual number of semV operations. Many of them came in tight groups-10 semV calls in quick succession, all but the first one returning an error.
It turns out that the person who ported that program from UNIX did not notice that he used binary semaphores where the original program used counting semaphores. Binary semaphores do not get any more released if you semV several times in a row. The program was working with the binary semaphore, but the queue the semaphore was managing was never being filled beyond one entry. The system was context switching too often by about a factor of 10. It was still well within its performance requirements, but it ran (and looked) better when the binary semaphore was replaced with a counting semaphore.
Not a bug
One HawkEye trace contained a nice picture of a bug I was chasing, and out toward the end of the trace, a nasty looking wedge. (See Figure 2.) Zooming in on the wedge I could see that it consisted of about a hundred event-set system calls, all on the same event. All those event sets combined to release one event wait. The disturbing thing was that one of the event-set system calls would have been enough to release the waiting task. This happened at the bottom of a write call to a pseudo-TTY device (telnet sessions use these to make a socket connection look like a terminal). The terminal part of the pseudo-TTY was using events to control the movement of data to the socket side of the device. It was setting the event for every character. I was sure I'd found a bug.
Figure 2: A nasty looking trace that's not what it seems
Full sized version of this image
Although the trace has the jarring look of a bug, it is not. It is a graphical depiction of a relatively crude interface between a character device and a packet device. Fixing it by letting the operating-system context switch on each character slows communication to about 300bps. Setting the event once per line instead of for every character sounds like a good idea, but a pseudo-TTY is not line oriented. Sending a line at a time breaks programs like full-screen editors.
Sometimes correct implementation looks ugly.
These and other stories show that visualization can find many bugs in “fully-debugged” software. It never failed when it was focused on finding the cause of a particular defect, but it also often seemed to find a problem when the programmer was just looking around. The latter category dominated our log. The bugs that programmers tripped over could all have been found by single-stepping through the software or by making more thorough assertions in the programs. They were not the tricky race conditions that visualization tools are designed to target. They were discovered because visualization tools make it easy to collect data on the execution of the system and present it in a way that brings the pattern-detection part of an engineer's brain to bear on it.
I have to wonder whether these stories are about problems that would be called bugs outside the embedded domain. Perhaps I've been working on a tight CPU and memory budget so long that I've lost my perspective. Most of the serendipity stories here are about systems that only suffered from performance problems. Here are more examples:
- A program allocates dozens of little chunks of memory one after the other. Later it frees all of them in sequence. There's no reason it could not allocate one contiguous chunk of memory for all those structures.
- A program allocates and immediately frees a chunk of memory again and again. It could have allocated and freed it once.
- A program opens and closes a file many times.
- Another program opens a file many times without closing it.
- A program masks signals repeatedly, then unmasks them the same number of times.
- A program tries a system call that returns an unknown service error. Immediately it tries it again. (What was the programmer thinking?)
Is this kind of behavior acceptable in desktop or mainframe software? Maybe visualization is much more valuable for embedded/real-time programmers than for other programmers.
We've been designing and using visualization tools as if they were logic analyzers for software. The tools collect and present details of execution around a particular bug. That is an important use, but the tools also show high-level patterns in software execution. The user stories tell us that programmers should inspect their products using every tool available, and that we need more and better tools for watching software run.
I expected the user stories to reveal that HawkEye needed to collect and display more detail and that it needed more sophisticated triggering. I learned, however, that I wasn't going the same way as my users. They were using HawkEye as an easy way to see inside software black boxes. It's fair to say that t hey were playing with it and finding bugs by mistake.
Peter Dibble is a distinguished engineer with TimeSys, a member of the Real-Time for Java Expert Group, and the technical lead for the Real Time Specification for Java. In addition to his Java and real-time research, he has written numerous technical articles and several books. He has a PhD in computer science from the University of Rochester. Contact him at .