CMP EMBEDDED.COM

Login | Register     Welcome Guest  
HOME DESIGN PRODUCTS COLUMNS E-LEARNING CONFERENCES CODE FORUMS/BLOGS NEWSLETTERS CONTACT FEATURES RSS RSS

The software detective: first-fault data capture



Embedded Systems Design

Develop a software architecture for troubleshooting high-availability systems.

It is a dark and stormy night. At the customer's site, the newly installed system hums quietly. Suddenly, at 3:00 AM, everything stops. An hour later, you are wakened from your bed by the piercing screams of the telephone. As you listen to the support engineer explain the situation, you immediately ask the vital questions, "Are there any error messages? Is the system still in that state? Can I get remote access to take a look?"

As embedded system developers, we produce systems that often fulfill vital needs. Our systems must be both highly reliable and constantly available. Their very criticality shapes the answers to the questions we ask on those late-night emergency calls. Yes, the system produced error messages, but it's no longer in the problem state. The customer couldn't wait around to troubleshoot it. He restarted the system, or it restarted itself, because it needs to be online. In many cases, system downtime is completely unacceptable. Similarly, when the system processes sensitive information, it may still be in the failed state but the customer won't permit remote access to debug it.

How can developers debug critical applications, when by the time they learn of the problem, the trail is long cold? This question is particularly important to developers of enterprise-class systems and servers, telecommunications and switching boxes, highly distributed networked systems, and mobile devices.

Reading the error messages
Traditionally, the first step in diagnosis is to capture the logs that the system produced and look at the error messages. Good error messages go a long way toward revealing a problem. But error messages by their very nature are inappropriate tools to track more subtle, insidious bugs.

Error messages indicate only the symptoms of the underlying problem--they show the immediate cause of the failure. These messages tell developers the system's state when the error is detected. But for complex misbehavior, developers need to know how the system got into the error state. That information is usually produced long before the problem is detected.

In practice, error messages usually leave much to be desired. The most common problem is that the developer doesn't put enough information into the message. Another is that crucial information relevant to the particular problem is omitted. The developer didn't anticipate generating the message under the circumstances. Yet another difficulty of relying on error messages is that many types of problems don't produce them.

How to lose a customer
Developers are placed in an awkward position. There can be enormous pressure to find and fix the problem, and quickly. But the bug is invariably transient, subtle, and insidious. All of the obvious problems were found in testing prior to shipping the product. So it's unlikely that the error messages contain the key information needed to solve problems the first time they appear.

Software-development organizations traditionally rely on two approaches to deal with transient bugs in the field. One is to turn on "verbose logging" or "instrumentation code" at the customer site. The additional code logs a greater volume of information as the system runs. Another is to make an "instrumented" version of the software and send it to the customer. This code logs additional information that targets the specific problem being diagnosed.

But both approaches suffer from a key flaw--the problem must occur again! A transient problem may not occur for many months. If the product is shipping to hundreds of customers each month, a delay in finding the problem can spell disaster.

It also places developers in the highly uncomfortable position of having to utter the five magic words guaranteed to infuriate a customer: "Can you reproduce the problem?" Something better is needed, a way to catch the problem the first time it occurs. It shouldn't require changing settings, building instrumented software, or asking a customer to reboot.

First-fault data capture
During development, we can design a software architecture that includes its own trace data recorder (TDR) for the system. Analogous to a commercial aircraft's "black box," the TDR records information about the system's dynamic behavior. After a crash, the information in the TDR helps to analyze the failure. The data can provide valuable insight into complex behavioral problems that result from sporadic changes in system inputs, unexpected interactions between subsystems, or bugs in low-level software.

The TDR architecture must process "high-rate" data that would bog down a traditional logging mechanism. It's always on, so the TDR must be fast, lightweight, and compact to keep it from affecting system performance. If using TDR functions degrade performance, developers won't use them.

At the heart of the system is a library that applications invoke to place time-tagged data into circular trace buffers in system RAM. Another TDR component, the dump agent, catalogs the buffers. In the event of a system failure, the dump agent freezes the buffers to prevent critical data from being overwritten. It then dumps the trace buffers to nonvolatile storage. After a dump occurs, a retrieval agent transports it out of the system. The dump may be analyzed using a software tool generated automatically from the system's source code. Figure 1 illustrates a simplified architecture for a TDR system.

View the full-size image

1 | 2 | 3 | 4

Rate this article: Low High
Current rating
  • .
Embedded.com Career Center
Looking for a new job?
SEARCH JOBS

Browse all jobs

SPONSOR
RECENT JOB POSTINGS





 :