Advertisement

Maximizing visibility through logging, as on Mars Pathfinder

July 12, 2016

DavidMCummings-July 12, 2016

Much has been written over the years about the importance of logging in software systems in general, and in embedded systems in particular. Many discussions of logging in embedded systems focus on how to build an effective set of logging services, including how those services should behave at runtime and what APIs they should offer to users. These are important issues for the development of reliable, robust, and maintainable embedded systems. In this article, however, we will focus instead on the users of the logging services and how they can employ those services to maximize visibility into the runtime behavior of their systems. Using a real-world spacecraft example, we will see how important this can be.

I note that, since embedded systems are extremely diverse, there is no "one-size-fits-all" approach that is right for all systems. Depending upon a variety of factors, different approaches to logging may be appropriate for some systems, and -- while logging is beneficial in many systems -- it is not required in all systems.

One of the main benefits of logging (but not the only benefit) is help in troubleshooting nasty, subtle problems, especially those that cannot be easily reproduced. Often, the only information you have when trying to figure out what went wrong is the data in the log file. In these situations, every bit of information you can extract from the log is critical. This information can mean the difference between figuring out what went wrong versus having to instrument the code further and praying that the error manifests itself again at some point in the future. The design of the logging infrastructure is critical in this regard. Ideally, it would provide information such as timestamps (with as much granularity as practical), sequence numbers (so you can detect log entries that may be missing due to, e.g., buffer overflows within the logging subsystem), and stack trace information (so you can see the call history that led to the error). However, each and every individual call to the logging services is also critical.

Two important considerations when making a call to the logging services are as follows:

  1. When do you make a logging call?
  2. What information do you provide in that logging call?

As to point (1), my motto is, with tongue only slightly in cheek, "the more the better" (assuming your system is not significantly memory constrained). Some developers worry that excessive logging clutters the code and makes it more difficult to read. This can indeed be an issue, but one that can be mitigated by well-structured code. Moreover, the benefits provided by increased visibility when troubleshooting and debugging can outweigh this concern. Anywhere you see an opportunity for something to go wrong that can easily be checked (e.g., a return code from a function call, an out-of-bounds array index, a null pointer, an out-of-range result, an unexpected queue-full condition, an unexpected timing delay, an unexpected input from another function/thread/process/computer), you should consider putting in a test for that condition with a logging call if the test fails (an assertion). This will not slow you down significantly as you write your code (it becomes almost second nature after a while), and the payback in terms of ease of integration testing, system testing, and debugging more than makes up for extra time spent during the coding phase.

As to point (2), again my motto is, "the more the better." There are certain items that are obvious candidates for inclusion in a logging call. For example, if you determine that a function call has returned an error code, you typically would include the value of that error code in the logging call. Other items, however, may not be so obvious, but they can sometimes mean the difference between figuring out what went wrong on the spot versus having to try to reproduce the problem, possibly in vain. As an example, I'll describe a real-world bug that occurred during the development of the Mars Pathfinder spacecraft, which I've mentioned in an earlier article in a different context. (This robotic spacecraft consisted of a lander, renamed the Carl Sagan Memorial Station, and a robotic Mars rover named Sojourner.) Because this happened quite some time ago, and I don't have ready access to the correct version of the Pathfinder source code (see acknowledgment below), I'm going to reconstruct the relevant C code in spirit. Although the actual Pathfinder code was undoubtedly different to my code below, the point of this example remains the same.


The Sojourner rover on Mars on sol 22 (Source: NASA)

As described in the earlier article, one of my colleagues, Steve Stolper, had a computation that, in the absence of error conditions, would always produce an even integer result. Although there was no reason to doubt that the result would be even, he checked it anyway; for example:

unsigned int instrument_ID = (instrument_index + instrument_offset) * 2;
if (instrument_ID % 2 != 0)
{
    EVR_LOG_FATAL ("Odd result! instrument_ID=%u, instrument_index=%u, 
instrument_offset=%u.", instrument_ID, instrument_index,
instrument_offset); }

Now, a parsimonious programmer might instead code the logging call as follows:

    EVR_LOG_FATAL ("instrument_ID is odd!");

This more minimalist version would save memory. It would also require less data to be sent from the spacecraft to Earth if this logging call ever got executed, thereby conserving precious telemetry bandwidth. Additional rationale for coding it this way might be that all we have to know is that something went wrong. If it did, then it likely was some sort of transient hardware malfunction due to the radiation effects in space (which we were quite concerned about as we designed the software). All we needed to do in that event is report the error and force a reboot, which is what the parsimonious call to EVR_LOG_FATAL does. However, such a rationale would have been misguided, as I'll now show.

As it turned out, we saw this error once (completely unexpectedly) during testing in our lab, without any exposure to induced radiation. We tried and tried to reproduce it, to no avail. We had to get to the bottom of it. Although it might have been tempting to chalk it up to something that was wrong with our testbed or some other environmental issue, we knew we couldn't leave it at that. What if it was a software problem or a problem with the spacecraft hardware?

Here is where the importance of the logging call comes in. This software system was highly multi-threaded, with a great deal of concurrency, managed by an off-the-shelf multitasking OS (VxWorks). Not surprisingly, there was no hardware memory protection. Therefore, there were many other VxWorks tasks (threads), as well as interrupt service routines, that could potentially corrupt any of the three variables -- instrument_ID, instrument_index, and instrument_offset -- as well as the constant values 2 and 0. By having all three of the variables in the log entry, we could -- with confidence -- rule out a corruption of all but one of the above five items, as follows.

The text of the log entry resulting from this error looked something like this:

Odd result! instrument_ID=29, instrument_index=11, instrument_offset=3.

We could see that the result, which should have been 28 ([11+3]*2), was instead 29. This meant that the if statement executed correctly, so the problem was not in that line of code. We could also see that this result could not be explained by corruption of the variables instrument_index or instrument_offset, nor by corruption of the constant value 2 (there is no modification of that value that would cause the arithmetic result to be 29). That left as a possibility the corruption of the variable instrument_ID by another thread or ISR (interrupt service routine), but that just didn't smell right given that the value was only off by one. Yes, it's possible that a bug caused another thread or ISR to overwrite the value 28 (0x11100) with the value 29 (0x11101) by setting the low bit. And perhaps that's a line of investigation we would have ultimately pursued if necessary. But our immediate attention was drawn to the arithmetic computation and how this might produce a result that was off by just one.

This allowed us to eliminate a huge universe of software explanations from the first round of consideration; namely, those explanations that required corruption by another thread or ISR. We focused instead on the arithmetic computation itself. We looked at the object code to make sure the compiler did not have a bug in that area. It did not, although a very nasty and subtle compiler bug did end up biting us later in a completely different area of the code.

We assumed that this error was due to a software problem. (If this assumption had led to a dead end, then we would have started looking at the hardware.) There was only one software explanation we could come up with that seemed credible: the carry bit was not being correctly preserved across interrupts, and an interrupt had occurred between just the right two machine instructions that comprised the arithmetic computation for instrument_ID. We knew that Wind River, the VxWorks supplier, had recently ported VxWorks to the specific processor architecture we were using on the spacecraft. Such an architecture port involves reworking the interrupt handling code and context switching code. Maybe there was a bug in the architecture port that was not correctly preserving the carry bit across interrupts, which Wind River had not yet seen in their testing because the port was so recent? This seemed to fit the symptoms, and it was consistent with the object code we had examined for the arithmetic computation.

We looked at the VxWorks source code, and found what appeared to be a bug in the preservation of the carry bit across interrupts, which was consistent with the symptoms. We called Wind River, and spoke to one of the engineers responsible for the architecture port. He looked at the source code and confirmed that it was indeed a bug. We breathed a huge sigh of relief.

If Steve's logging call had not contained all the information we needed in order to rule out, from our first round of consideration, corruption of variables by other threads or interrupt service routines, we likely would not have focused on the line of code that performed the arithmetic computation, other than confirming that it contained no bugs and that the object code looked correct. In fact, we wouldn't have known whether the problem was in that line of code or in the if statement. We probably would have started a search for a thread or ISR that could have corrupted one of the relevant variables or constants. (After all, how often do you think of the OS failing to preserve the carry bit as a possible explanation for an error?) In that case, the search for corruption due to a concurrency-related error would have been completely in vain. We might never have found the actual bug prior to launch. That could have caused catastrophic consequences if the bug had manifested itself while the spacecraft was traveling to Mars or -- even worse -- while the spacecraft was going through its entry, descent, and landing, all of which required split-second accuracy of the software.

This is just one of many examples I've encountered over the years in which a complete set of information provided in a logging call saved days, weeks, or even months of debugging effort. And because you never know which logging call is going to be the one that gets triggered, it is important to make sure that each and every call contains as much relevant information as practical (assuming you have the memory to support it). Sometimes you only get one chance to diagnose a bug that is not easily reproducible, and you need to be able to make the most of that single opportunity.

Acknowledgment: Many thanks to Steve Stolper and to Glenn Reeves, who led the Pathfinder flight software team and continues to be at the forefront of space exploration at JPL. Glenn and Steve reviewed this article, and they also spent a significant amount of time helping me try to locate the correct version of the Pathfinder source code. That proved to be more time-consuming than we had expected and we were forced to abandon the effort and get back to our day jobs.

David M. Cummings is the Executive Vice President of the Kelly Technology Group in Santa Barbara, CA. He has over 35 years of experience in the design and implementation of software systems, many of which are embedded systems. Nine of those years were spent at the Jet Propulsion Laboratory, where he designed and implemented flight software for the Mars Pathfinder spacecraft. He holds a bachelor's degree from Harvard University, and a master's degree and a Ph.D. from UCLA.

Loading comments...