Develop a software architecture for troubleshooting high-availability systems.
It is a dark and stormy night. At the customer's site, the newly installed system hums quietly. Suddenly, at 3:00 AM, everything stops. An hour later, you are wakened from your bed by the piercing screams of the telephone. As you listen to the support engineer explain the situation, you immediately ask the vital questions, “Are there any error messages? Is the system still in that state? Can I get remote access to take a look?”
As embedded system developers, we produce systems that often fulfill vital needs. Our systems must be both highly reliable and constantly available. Their very criticality shapes the answers to the questions we ask on those late-night emergency calls. Yes, the system produced error messages, but it's no longer in the problem state. The customer couldn't wait around to troubleshoot it. He restarted the system, or it restarted itself, because it needs to be online. In many cases, system downtime is completely unacceptable. Similarly, when the system processes sensitive information, it may still be in the failed state but the customer won't permit remote access to debug it.
How can developers debug critical applications, when by the time they learn of the problem, the trail is long cold? This question is particularly important to developers of enterprise-class systems and servers, telecommunications and switching boxes, highly distributed networked systems, and mobile devices.
Reading the error messages
Traditionally, the first step in diagnosis is to capture the logs that the system produced and look at the error messages. Good error messages go a long way toward revealing a problem. But error messages by their very nature are inappropriate tools to track more subtle, insidious bugs.
Error messages indicate only the symptoms of the underlying problem–they show the immediate cause of the failure. These messages tell developers the system's state when the error is detected. But for complex misbehavior, developers need to know how the system got into the error state. That information is usually produced long before the problem is detected.
In practice, error messages usually leave much to be desired. The most common problem is that the developer doesn't put enough information into the message. Another is that crucial information relevant to the particular problem is omitted. The developer didn't anticipate generating the message under the circumstances. Yet another difficulty of relying on error messages is that many types of problems don't produce them.
How to lose a customer
Developers are placed in an awkward position. There can be enormous pressure to find and fix the problem, and quickly. But the bug is invariably transient, subtle, and insidious. All of the obvious problems were found in testing prior to shipping the product. So it's unlikely that the error messages contain the key information needed to solve problems the first time they appear.
Software-development organizations traditionally rely on two approaches to deal with transient bugs in the field. One is to turn on “verbose logging” or “instrumentation code” at the customer site. The additional code logs a greater volume of information as the system runs. Another is to make an “instrumented” version of the software and send it to the customer. This code logs additional information that targets the specific problem being diagnosed.
But both approaches suffer from a key flaw–the problem must occur again! A transient problem may not occur for many months. If the product is shipping to hundreds of customers each month, a delay in finding the problem can spell disaster.
It also places developers in the highly uncomfortable position of having to utter the five magic words guaranteed to infuriate a customer: “Can you reproduce the problem?” Something better is needed, a way to catch the problem the first time it occurs. It shouldn't require changing settings, building instrumented software, or asking a customer to reboot.
First-fault data capture
During development, we can design a software architecture that includes its own trace data recorder (TDR) for the system. Analogous to a commercial aircraft's “black box,” the TDR records information about the system's dynamic behavior. After a crash, the information in the TDR helps to analyze the failure. The data can provide valuable insight into complex behavioral problems that result from sporadic changes in system inputs, unexpected interactions between subsystems, or bugs in low-level software.
The TDR architecture must process “high-rate” data that would bog down a traditional logging mechanism. It's always on, so the TDR must be fast, lightweight, and compact to keep it from affecting system performance. If using TDR functions degrade performance, developers won't use them.
At the heart of the system is a library that applications invoke to place time-tagged data into circular trace buffers in system RAM. Another TDR component, the dump agent, catalogs the buffers. In the event of a system failure, the dump agent freezes the buffers to prevent critical data from being overwritten. It then dumps the trace buffers to nonvolatile storage. After a dump occurs, a retrieval agent transports it out of the system. The dump may be analyzed using a software tool generated automatically from the system's source code. Figure 1 illustrates a simplified architecture for a TDR system.
Triggering the dump
Because disaster strikes in myriad forms, there should be several paths to trigger a TDR dump. It's important to note that the TDR subsystem itself never triggers a dump. The dump occurs only when commanded by another software subsystem.
The most valuable time to generate a dump is when software detects a catastrophic error. This dump can be centralized through the system's logging subsystem. The logging subsystem must allow developers to assign a severity to each message the software logs. The severity indicates the importance of the message. Popular severities include Info, Warning, Error, and Fatal. When software reports a Fatal message, it means that a catastrophic error has occurred and the system is going to reboot. Before this occurs, the logging subsystem triggers a TDR dump.
Another method to trigger a dump is to add code to any panic or exception handler the operating system provides. If the operating system crashes, the panic handler triggers a TDR dump. This method may require the TDR dump code to run in kernel space. The TDR may also need to access the nonvolatile memory using different drivers than are used during normal operation. Linking the dump to the panic handler is particularly valuable if the system has a significant amount of kernel-level code that can cause crashes.
The capture library
The TDR library provides functions that other subsystems call to preserve data about their execution. The data is time-tagged by the library and placed into circular capture buffers. Listing 1 shows an example API for a capture library.
Software subsystems use TDR_BufferInit() to create and initialize a capture buffer. In an RTOS that provides lightweight tasking, the buffer may be dynamically allocated from the heap. For an operating system that institutes heavyweight tasking, the buffer can be created in shared memory or memory mapped to a device driver. The TDR_BufferInit() function also “registers” the buffer with the dump agent so that it'll be included in the next dump. The buffer is marked with a module_ID to indicate the subsystem that created it and a string to help identify the specific trace buffer in the dump. The module_ID can also be used to group and selectively dump buffers in more complex implementations of TDR. The number of entries argument indicates the number of fixed-sized data entries the circular buffer holds. This amount is tuned according to the rate that data is placed into the buffer. In this way, developers control how much history the buffer stores.
The TDR_ThreadAddIdentity() and TDR_ThreadRemoveIdentity() functions are used in a heavyweight tasking environment, such as Linux. In this case, multiple threads in a process may share the same capture buffer. These functions allow each thread to associate a string with the entries it makes.
The TDR_Store() function is used to store the data into the circular buffer. The module_ID indicates the component of the software that stores the data. The entry_ID uniquely identifies the format of the data placed into the buffer. The data itself is a fixed-size entry that the caller formats as desired. The entry_ID is used by software tools to parse the entry when encountered in a dump.
The implementation of the TDR_Store() function is critical. The application software must call the TDR library as part of normal execution so that the data is always available if a problem occurs. As a result, TDR_Store() is called many hundreds of times per second. Because it's called so often, the function must execute quickly. Ideally, it should execute in fast, constant time and not make any system calls that can block the caller's execution. In addition, the library should avoid mutual exclusion primitives that can cause unrelated execution contexts to serialize as they contend for access to a capture buffer.
To meet these performance requirements and minimize unwanted serialization, the TDR library should take advantage of “atomic” functions provided by the hardware and operating system. For example, in Linux, the atomic_set() and atomic_inc_return() functions can index into the buffer and freeze the buffer prior to initiating the dump.
The function TDR_ThreadRemoveIdentity() is used when a thread exits normally. It removes the identifying information from the trace buffer so that its entries won't be confused with a later instance of the same software. The TDR_BufferDestroy() function is used when the system shuts down normally. It removes the buffer from RAM and unregisters it with the TDR system.
If a process/subsystem terminates without calling TDR_BufferDestroy() , the buffer becomes orphaned. The TDR subsystem should detect orphaned buffers but wait several minutes before destroying them. This waiting period gives the system a chance to recognize a catastrophic error that may be related to the orphaned buffer. Two mechanisms for recognizing an orphaned buffer include monitoring the operating system process that created the buffer and inspecting the buffer itself to see if it's still being referenced.
Organizing the buffers
The capture buffers reside in RAM until they're dumped to nonvolatile storage. Each buffer must be accessible both from the process that creates it and the dump agent that generates the dump. For most RTOS environments, the buffers can be allocated dynamically from the heap when the system is started. For heavyweight tasking environments, the buffers may need to be placed into shared memory or mapped into the memory space of a device driver that performs the dump.
Figure 2 illustrates two options for organizing the buffers and how the data is grouped within. Option 1 groups all of the data from the same process into one buffer. This means that if a thread calls a function in library #1 and then a function in library #2, each function places its data into the common capture buffer. For option 2, the data is grouped by module. If a thread calls a function in library #1 and then a function in library #2, each function places its data into a separate capture buffer dedicated to the library.
For debugging, option 1 is frequently the best choice. Developers are most interested in tracing the logic as it executes on the system. Placing all entries from a process into the same buffer makes it easy to follow each thread's execution and observe the ordering of inputs and relationships of actions and outputs. Option 2 can also provide this information, but requires developers to time-correlate entries in different buffers. This increases the amount of processing that must occur during debugging.
The most elegant way to accomplish option 1 is to associate the buffer's address with the operating-system process that owns it. Then, libraries, called by a thread in the process, place data into the correct capture buffer just by calling the capture library. They don't need to know the buffer address into which they place data. This method can be accomplished by storing the capture buffer address in process-specific storage or task control block.
Figure 3 shows a data structure that can be used to implement a capture buffer. Not all of the fields in the capture buffer are dumped to nonvolatile storage. The process owns a header that contains a pointer to the capture buffer shared with the dump agent.
The buffer's body is shared with the dump agent. It identifies the process that created it (PID), the functional subsystem that created it (Module ID), and contains a human-readable string that uniquely identifies the instance of the buffer. Besides the size of the buffer and the number of entries, it contains a next-available field that indicates that next entry to be written/overwritten. When dumped, the value shows the post-processing tool where the oldest and latest entries are in the buffer. The freeze flag is set by the dump agent prior to initiating the dump. It inhibits the capture library from placing further information into the buffer. This restriction preserves the buffer's contents while the dump occurs.
Also included in the buffer's body are statistics about how the buffer is used. These statistics help software engineers tune the buffer's size during development. Helpful values include the number of entries placed into the buffer, the number of times the buffer has rolled over, and the average roll-over period.
Each buffer entry contains the ID of the thread/process that placed it into the buffer. The module ID field indicates the functional component of the software that produced the entry. The Entry ID field identifies the data's format and identifies the unique line of code that produced it. The entry must also contain a highly accurate time-stamp having sub-millisecond resolution. This permits developers to construct a high-fidelity timeline of events leading to the system failure.
Defining entry IDs
Each entry placed into a capture buffer should have a unique ID that parses the entry in the dump and locates the unique line of code that produced the entry. Developers define a new ID each time they write code to place data into the buffer. When defining the ID, the developer can provide information to help generate a parser to post-process the raw dump to produce human-readable output.
One way to do this is to define a preprocessor macro. Developers supply a mnemonic identifier, numeric value, and printf-like formatting information. When compiling the system source code, the macro evaluates to the code in Listing 2.
But this definition can also help produce a dump post-processing tool. If all entry definitions are placed in the same header file, the macro can be conditionally compiled to evaluate to an entry in a lookup table, as shown in Listing 3.
The lookup table can be used as part of a post-processing tool to parse and display the entry when it's encountered in a dump. Note that the post-processing code will have to interpret the format_string field to unpack the data bytes in the entry. Listing 4 shows an example of the macro used to define an entry ID for a buffer entry containing two integers.
The dump agent
The dump agent writes the capture buffers to nonvolatile storage. Dumps are triggered by external subsystems, such as the logging subsystem when a Fatal error is declared. The dump agent usually operates in the execution context of the initiator to produce the dump. As a result, the dump agent is commonly implemented as a library invoked by the logging subsystem, from a panic handler or ISR, or from a command-line interface.
If executing from a panic handler, the dump agent functions that write to nonvolatile storage may have to use different device drivers than those employed when the operating system is still running. This is because the system may not be servicing interrupts or scheduling appropriately.
The main data structure that the dump agent uses is the billboard , which is essentially a table of pointers to the capture buffers. The billboard also contains the size of each capture buffer. The dump agent, in the context of the capture library, checks the size of each newly created buffer. If the sum of the capture buffer sizes exceed the amount of nonvolatile storage available for the dump, the agent generates errors to alert developers that there's not enough storage to contain a potential dump.
When the dump agent is commanded to produce a dump, it traverses the billboard data structure. It freezes each buffer either by setting its freeze flag or setting a global freeze flag in the billboard area. The flag prevents the capture library from placing any more data into the capture buffers. The agent then writes each capture buffer listed in the billboard to nonvolatile storage.
Having a trace data recorder in a high-availability system gives developers a tremendous advantage when diagnosing problems in the field. Its “always on” nature gives developers visibility when problems occur without needing to ask, “Can you reproduce the problem?” Having such a system in place before a product is shipped gives developers a powerful tool in tracking down the subtle, insidious, intermittent, “3:00 AM” bug.
Steven A. Stolper is an embedded software consultant. He specializes in spacecraft flight software but also works with teams that develop commercial systems. Stolper can be contacted at email@example.com.