The software detective: first-fault data capture -

The software detective: first-fault data capture

It is a dark and stormy night. At the customer's site, thenewly installed system hums quietly. Suddenly, at 3:00 AM, everythingstops. An hour later, you are wakened from your bed by the piercingscreams of the telephone. As you listen to the support engineer explainthe situation, you immediately ask the vital questions, “Are there anyerror messages? Is the system still in that state? Can I get remoteaccess to take a look?”

As embedded system developers, we produce systems that oftenfulfill vital needs. Our systems must be both highly reliable andconstantly available. Their very criticality shapes the answers to thequestions we ask on those late-night emergency calls. Yes, the systemproduced error messages, but it's no longer in the problem state. Thecustomer couldn't wait around to troubleshoot it. He restarted thesystem, or it restarted itself, because it needs to be online. In manycases, system downtime is completely unacceptable. Similarly, when thesystem processes sensitive information, it may still be in the failedstate but the customer won't permit remote access to debug it.

How can developers debug critical applications, when by thetime they learn of the problem, the trail is long cold? This questionis particularly important to developers of enterprise-class systems andservers, telecommunications and switching boxes, highly distributednetworked systems, and mobile devices.

Reading theerror messages
Traditionally, the first step in diagnosis is to capture the logs thatthe system produced and look at the error messages. Good error messagesgo a long way toward revealing a problem. But error messages by theirvery nature are inappropriate tools to track more subtle, insidiousbugs.

Error messages indicate only the symptoms of the underlyingproblem–they show the immediate cause of the failure. These messagestell developers the system's state when the error is detected. But forcomplex misbehavior, developers need to know how the system got intothe error state. That information is usually produced long before theproblem is detected.

In practice, error messages usually leave much to be desired.The most common problem is that the developer doesn't put enoughinformation into the message. Another is that crucial informationrelevant to the particular problem is omitted. The developer didn'tanticipate generating the message under the circumstances. Yet anotherdifficulty of relying on error messages is that many types of problemsdon't produce them.

How to lose acustomer
Developers are placed in an awkward position. There can be enormouspressure to find and fix the problem, and quickly. But the bug isinvariably transient, subtle, and insidious. All of the obviousproblems were found in testing prior to shipping the product. So it'sunlikely that the error messages contain the key information needed tosolve problems the first time they appear.

Software-development organizations traditionally rely on twoapproaches to deal with transient bugs in the field. One is to turn on”verbose logging” or “instrumentation code” at the customer site. Theadditional code logs a greater volume of information as the systemruns. Another is to make an “instrumented” version of the software andsend it to the customer. This code logs additional information thattargets the specific problem being diagnosed.

But both approaches suffer from a key flaw–the problem mustoccur again! A transient problem may not occur for many months. If theproduct is shipping to hundreds of customers each month, a delay infinding the problem can spell disaster.

It also places developers in the highly uncomfortable positionof having to utter the five magic words guaranteed to infuriate acustomer: “Can you reproduce the problem?” Something better is needed,a way to catch the problem the first time it occurs. It shouldn'trequire changing settings, building instrumented software, or asking acustomer to reboot.

First-faultdata capture
During development, we can design a software architecture that includesits own trace data recorder (TDR) for the system. Analogous toa commercial aircraft's “black box,” the TDR records information aboutthe system's dynamic behavior. After a crash, the information in theTDR helps to analyze the failure. The data can provide valuable insightinto complex behavioral problems that result from sporadic changes insystem inputs, unexpected interactions between subsystems, or bugs inlow-level software.

The TDR architecture must process “high-rate” data that wouldbog down a traditional logging mechanism. It's always on, so the TDRmust be fast, lightweight, and compact to keep it from affecting systemperformance. If using TDR functions degrade performance, developerswon't use them.

At the heart of the system is a library that applicationsinvoke to place time-tagged data into circular trace buffers in systemRAM. Another TDR component, the dump agent, catalogs the buffers. Inthe event of a system failure, the dump agent freezes the buffers toprevent critical data from being overwritten. It then dumps the tracebuffers to nonvolatile storage. After a dump occurs, a retrieval agenttransports it out of the system. The dump may be analyzed using asoftware tool generated automatically from the system's source code.Figure 1 illustrates a simplified architecture for a TDR system.

Viewthe full-size image

Next Page

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.