Analyzing past failures is one of the best ways to prevent them from happening again. Here, Jack distills some lessons from high-profile disasters.
We're not terribly good at learning from our successes; smug with the satisfaction of a job well done, most of us proceed immediately to the next task at hand. It's a shame that we can't look at a successful project and then dig deeply into what happened, how, and when, to suck the educational content of the project dry.
Ah, but failures are indeed a different story. High-profile disasters inevitably produce an investigation, calls for Congress to “do something,” and, in the best of circumstances, a change in the way things are built so the accident is not repeated.Isn't it astonishing that airplane travel is so reliable? That we can zip around the sky at 600 knots, seven miles up, in an ineffably complex device created by flawed people? Perhaps aviation's impressive safety record is a by-product of the way the industry manages failures. Every crash is investigated; each yields new training requirements, new design mods, or other system changes to eliminate or reduce the probability of such a disaster striking again.
Though crashes are rare, they do occur, so airliners carry expensive flight data recorders whose sole purpose is to produce post-accident clues to the safety board. What a shame that we firmware folks don't have a similar attitude. Mostly we're astonished when our systems break or a bug surfaces.
I hope that in the future we learn to write code proactively, expecting bugs and problems but finding or trapping them early, and leaving a trail of clues as to what went wrong.I believe we should examine disasters, our own and others, because so many embedded systems crash in similar ways. I collect embedded disaster stories not from morbid fascination but because I think they offer universal lessons. Here are a few that are instructive.
Controllers spent a few days analyzing data to understand what happened, then initiated a series of burns that will ultimately lead to NEAR's successful rendezvous with the asteroid. But two thirds of the spacecraft's fuel had been dumped, using all of the mission's reserves. Enough fuel was left — barely — to complete the original goals of the mission. But reduced fuel means things happen more slowly, so NEAR's rendezvous with 433 Eros would happen 13 months later than planned.
Like so many system failures, a series of events, each not terribly critical, led to the fuel dump. Immediately after the engine fired up for the planned 15 minute burn, accelerometers detected a lateral acceleration that exceeded a limit programmed into the firmware. This momentary under-one-second transient was not out of bounds for the mechanical configuration of the spacecraft. But the propulsion unit is cantilevered from the base of the spacecraft, creating a bending response that, according to the report, “was not appreciated.”1 Quoting further, “In retrospect, the correct thing for the G&C software to have done would have been to ignore (blank out) the accelerometer readings during the brief transient period.” In other words, though the transient wasn't anticipated, the software was too aggressive in qualifying accelerometer inputs.
With the software figuring lateral movement exceeded a pre-programmed limit, it shut the motor down and put the spacecraft into a safe mode. The firmware used thrusters to rotate NEAR to an earth-safe attitude. Code then ran a script designed to change over from thrusters to reaction wheels (heavy spinning wheels that absorb or impart spin to the spacecraft) for attitude control. According to the report, “Due to insufficient review and testing of the clean-up script, the commands needed to make a graceful transition to attitude control using reaction wheels were missing.” Wow!
Excessive spacecraft momentum meant that the reaction wheels just weren't up to the task of putting NEAR into the earth-safe mode. The firmware did try, for the programmed 300 seconds, but then gave up and started warming up thrusters, which offer much more kick than the momentum wheels. Now the only chance to save the spacecraft was to go to the lowest level save mode, “sun-safe,” where it spun slowly around an axis pointing towards the sun. This would keep the batteries charged till ground intervention could help out.
Seven minutes later an error in a data structure (that is, a parameter stored in the firmware) led to the system thinking a momentum wheel that was running at its maximum speed was stopped. A series of race conditions, exacerbated by low batteries, led to some 7,900 seconds of thruster firing over the course of many hours. Eventually NEAR did stabilize in sun-safe mode, though now missing 29kg of critical propellant.
So NEAR's troubles stem ultimately from a transient due to an odd vibration mode — something the firmware design team could not have anticipated. This rather small transient revealed flaws in the firmware that, in large part, led to a near-catastrophe (pun intended).
The review board inspected some, but not all, of the system's 80,000 lines of code (C, Ada, and assembly). They uncovered nine software bugs and eight data structure errors. Bugs included poorly designed exception handlers and critical variables that could be erroneously overwritten.
Hindsight is certainly a powerful microscope, especially when zooming in on a specific problem that causes a mishap. But I can't help but wonder why the post-failure review board's firmware review was so much more effective than those — if there were any — performed during original design. The report's recommendation 1c insists that from now on all command scripts must be tested, especially those critical to spacecraft safety — including abort cases. Well, duh!
You'd think configuration management would be a no-brainer for a mission costing many megabucks. Turns out the flight software was version 1.11, but two different version 1.11s existed. The one that was not flying had the proper command script to handle the thruster to reaction wheel changeover. Astonished? I sure was. From the report: “Flight code was stored on a network server in an uncontrolled environment.” Version control is not rocket science!
Unfortunately, I've been unable to obtain more detailed information about the nature of the software error. However, there's an enticing — and as yet unavailable — reference in the appendix of the NEAR report to a memo called “How Clementine Really Failed and What NEAR Can Learn.” Is it possible that NEAR's software failure had been anticipated four years earlier?
NASA published a report on the mission that mentions the failure but does not delve into root causes.2 Clementine was a technology demonstrator operated by the Ballistic Missile Defense Organization; NASA was a partner, not the main force behind the mission. The Clementine report, though short on firmware details, does delve into the human price of a schedule that's too tight. A few quotes follow; there's not much one can add!
“The tight time schedule forced swift decisions and lowered costs, but also took a human toll. The stringent budget and the firm limitations on reserves guaranteed that the mission would be relatively inexpensive, but surely reduced the mission's capability, may have made it less cost-effective, and perhaps ultimately led to the loss of the spacecraft before the completion of the asteroid flyby component of the mission.”
“The mission operations phase of the Clementine project appears to have been as much a triumph of human dedication and motivation as that of deliberate organization. The inadequate schedule…ensured that the spacecraft was launched without all of the software having been written and tested.”
“Further, the spacecraft performance was marred by numerous computer crashes. It is no surprise that the team was exhausted by the end of the lunar mapping phase.”Ariane 5
In May 1998 I described the 1996 failure of Ariane 5 (“Disaster,” p. 113), the large launch vehicle that tumbled and destroyed itself 40 seconds after blast-off. Since then more information has come to my attention.
Shortly after launch, the Inertial Reference System (SRI, the apparently scrambled acronym a result of translation from French to English) detected an overflow when converting a 64-bit floating-point number to 16-bit signed integer. An exception handler noted the problem and shut the SRI down. Due to the incredible expense of these missions (this maiden flight itself had two commercial spacecraft aboard, each valued at about $100 million) a back-up SRI stood ready to take over in case of the primary's failure. SRI number two did indeed assume navigation responsibility, but it ran identical code, encountered the same error, and shut down as well.
Why did the overflow occur? This code had been ported from the much smaller Ariane 4. According to the report “…it is important to note that it was jointly agreed [between project partners at several contractual levels] not to include the Ariane 5 trajectory data in the SRI requirements and specification.”3 Clearly, a decision that doomed the project to failure. Here's a case where the firmware was, in fact, perfect — if perfection is measured by how well the code meets the spec. Again, “The supplier of the SRI was only following the specification given it.”
As with NEAR, Ariane's crash resulted from a series of coupled events rather than any single problem. The exception was largely a result of poor specification. But designers did realize that some variables might go out of range; in fact they specifically wrote code to monitor four of the seven critical variables. Why were three left exposed? An assumption was made that physical limits made it impossible for these three to overflow (an assumption that proved expensively faulty). Further, a target of 80% processor loading meant that checking all calculations would be prohibitively expensive.
But the exception itself didn't cause Ariane's crash. When both SRIs failed, they did so gracefully, and even returned diagnostic data to the vehicle's main computer that indicated the flight data was invalid. But the main computer ignored the diagnostic bit, assumed the data was valid, and used this incorrect information to guide the vehicle.
As a result of trying to use bad data, the computer commanded the engine nozzles to hard-over deflection, resulting in the tumbling and destruction of the rocket.
To complicate the picture further, the floating-point operation that overflowed was not even a calculation required for normal flight operations. It was left-over code, a relic of the firmware's Ariane 4 heritage, code that had meaning only before lift-off. The review board also noted that, though testing of the SRI is hard, it's quite possible and (gasp!) maybe even a good idea: “Had such a test been performed by the supplier or as part of the acceptance test the failure mechanism would have been exposed.”
To summarize: poorly tested code that should not have been running caused a floating point conversion error because the spec didn't call for an understanding of real flight dynamics. In an effort to keep processor loading low, the variables involved weren't monitored, though others were. Two redundant SRIs running the same code performed identically and shut down. The main computer ignored the SRI “bad data” bit and tried to fly using corrupt information. Another interesting tidbit from the report: “…the view had been taken that software should be considered correct until it is shown to be at fault.” This is the rationale behind using identical code on redundant SRIs. It does beg the question of why sufficient testing to isolate those potential software faults was not performed.
Several common threads run through many of these stories. The first is that of error handling. Look at Ariane: when the software failed, it properly set a diagnostic bit that meant “ignore this data.” Yet the main CPU blithely carried on, ignoring the error bit instead.
Inadequate testing, too, appears repeatedly as a theme in disasters. The NEAR team had simulators and prototypes, but these test platforms worked poorly. Their fidelity was suspect, leaving the engineers to wonder, when problems surfaced, if the element at fault was the simulator or the code. Ariane, too, had poor simulators and thus only partially tested software. On Clementine it appears that some code was not tested at all.
Interprocessor communications is a constant source of trouble. Though I'm a great believer in using multiple CPUs to reduce software complexity and workload, problems result when too much comm is required. NEAR's computers ran into race conditions. Ariane's error bit was disregarded.
Those who don't learn from the past are sentenced to repeat it
Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. He founded two companies specializing in embedded systems. Contact him at .