Safety First: Avoiding Software Mishaps
Charles Knutson and Sam Carmichael
Accidents happen. That's just part of life. But when mission- or safety-critical systems experience failures due to faulty software, serious questions are raised.
Despite the risks, software is increasingly making its way into mission- and safety-critical embedded devices. This article explores the challenges inherent in balancing the tremendous flexibility and power provided by embedded software against the risks that occur when software failure leads to loss of life or property. This article also explores the root causes of several famous embedded software failures, including the Therac-25, Ariane 5, and recent failed Mars missions.
The problem of safety.
And so we learn to live with the inherent risks that surround us, because the cost of avoidance just seems too high. However, as technology becomes more and more ubiquitous, with more of that technology being controlled by software, a greater portion of the risk we face is ultimately in the hands of software engineers. Most of the time, the risks we face don't bear fruit. But when they do, we call the event an accident or a mishap. The kinds of accidents we're primarily concerned with in this article are the type that lead to personal injury, loss of life, or unacceptable loss of property.
So how significant has the risk become? Consider that software is now commonly found in the control systems of nuclear power plants, commercial aircraft, automobiles, medical devices, defense systems, and air traffic control systems. It's likely that almost everyone has at one time or another put themselves, their lives, or their property into the hands of the engineers that built the software controlling these systems. The spread of software into safety-critical systems is likely to continue, despite the risks.Hazards, accidents, and risks
A hazard is a set of conditions, or a state, that could lead to an accident, given the right environmental trigger or set of events. An accident is the realization of the negative potential inherent in a hazard. For example, a pan of hot water on a stove is a hazard if its handle is accessible to a toddler. But there's no accident until the toddler reaches up and grabs the handle.In software, faults or defects are errors that exist within a system, while a failure is an error or problem that is observable in the behavior of the system. A fault can lie dormant in a software system for years before the right set of environmental conditions cause the problem to manifest itself in the functioning system (think Y2K)1 .
Our ultimate concern here is not whether hazards should exist, or whether software faults should exist. You can debate whether they should or should not, and you can even argue whether it's theoretically possible to eliminate them at all. But the reality is that they do exist, that they represent risk, and that we have to deal with that risk.
Risks can be addressed at three fundamental levels:
As we build safety-critical software, we need to be concerned with mitigating risk at each of these levels.
Software in safety-critical devices
In assessing the level of risk inherent in turning over safety-critical control functions to software, it is valuable to compare the track record of software with other types of engineering. Software is indeed unique in many ways. As Frederick Brooks pointed out, certain essential characteristics of software are unique and challenging.1 As a result, when compared to other engineering fields, software tends to have more errors, those errors tend to be more pervasive, and they tend to be more troublesome. In addition, it is difficult to predict the failure of software because it doesn't gracefully or predictably degrade with use (such as the way tires or brake shoes will gradually wear out until it's time to replace them). Software may break immediately upon installation due to unforeseen environmental or usage conditions. It may work reliably until a user tries something unexpected. It may work well for years until some operating condition suddenly changes. It may fail intermittently as sporadic environmental conditions come and go.
With all of these legitimate concerns over software, it begs the question, “Why use software in safety-critical systems at all?” If a risk is worth taking, it's because some return or advantage accompanies the risk. One of the most significant reasons for using software in embedded devices (whether safety-critical or not) is that a higher level of sophistication and control can be achieved at a cheaper cost than is possible with hard-wired electronics or custom designed mechanical features. As we come to expect more out of embedded devices, software is currently the best way to keep up with the steep growth in complexity.
In a changing environment, devices must either adapt or face early obsolescence. Hard-wired devices must be either replaced or face expensive upgrades. On the other hand, software can be upgraded relatively easily, without swapping out expensive hardware components.
Finally, because of the power and flexibility of software, devices can deliver a great deal of information to users and technicians. Such software-controlled devices can gather useful information, interpret it, perform diagnostics, or present more elegant interfaces to the user, at a more acceptable cost than is possible with hardware.
For these reasons, tremendous value and power lie in using software to control embedded devices. Still, we need to clearly understand the risks. By understanding the nature of software, we may more effectively build embedded control software while minimizing the risks.
In his article, Brooks states that software has “essential” properties as well as “accidental.”1 The essential properties are inherent, and in a sense, unremovable or unsolvable. They represent the nature of the beast. The accidental properties are coincidental, perhaps just the result of an immature field. The accidental properties are those that might be solved over time. The following sections identify some of the essential properties of software.2 In order to build safe software, each of these must be dealt with to minimize risk.
Complexity. Software is generally more complex than hardware. The most complex hardware tends to take the form of general purpose microprocessors. The variety of software that can be written for these hardware systems is almost limitless, and the complexity of such systems can dwarf the complexity of the hardware system on which it depends. Consider that software systems consist not only of programs (which may have an infinite number of possible execution paths), but also data, which may be many orders of magnitude greater than the hardware states present in our most complex integrated circuits.
The most complex hardware takes the form of ASICs (application-specific integrated circuits), but these are essentially general purpose microprocessors with accompanying system-specific control software. In such cases, it's still common for the complexity of the software to dwarf that of the hardware.
Error sensitivity. Software can be extremely sensitive to small errors. It has been said that if architects built houses the way software engineers built software, the first woodpecker that came along would destroy civilization. While the story hurts, it's part of the nature of software that small errors can have huge impacts. In other fields, there is a notion of “tolerance.” For example, some play typically exists in the acceptable range of tensile strength of a mechanical part. There's little in the way of an analogous concept in software. There's no concept that the software is still fit if some small percentage of the bits change. In some situations the change of a single bit in a program can mean the difference between successful execution and catastrophic failure.
Difficult to test. For most real software systems, complete and exhaustive testing is an intractable problem. A program consisting of only a few hundred lines of code may require an infinite amount of testing to exhaustively cover all possible cases. Consider a single loop that waits for a key press. What happens if the user presses during the first loop? The second? The third? One can argue that all subsequent iterations of that loop are part of an equivalence class, and the argument would probably be valid. But what if something catastrophic occurs only if the key is pressed during the one millionth time through? Testing isn't going to discover that until the millionth test case. Not likely to happen.
All testing deals with risk management, and all testers understand the near impossibility of exhaustive testing. And so they deal with equivalence classes based upon assumptions of continuous functions. But when functions suddenly show themselves to be non-continuous (such as the Pentium floating-point bug), you still have a problem.
Correlated failures. Finding the root cause of failures can be extremely challenging with software. Mechanical engineers (and even electrical engineers) are often concerned with manufacturing failures, and the rates and conditions that lead things to wear out. But software doesn't really wear out. The bits don't get weak and break. It is true that certain systems can become cluttered with incidental detritus (think Windows 9x), but they don't wear out in the same way a switch or a hard drive will. Most of the failures in software are actually design errors. One can attempt to avoid these failures with redundant systems, but those systems simply duplicate the same design error, which doesn't help much. One can also attempt to avoid these failures by employing competing designs of a system, but the backup may suffer from the same blind spots as the original, despite the fresh design. Or even more pernicious, the backup may suffer from new and creative blind spots, different from the first, but equally harmful.
Lack of professional standards. Software engineering is very much a fledgling field. Individuals who once proudly proclaimed themselves to be “computer programmers” are now typically mildly insulted at the notion of being only a “programmer” and now tend to prefer to be called “software engineers.” But there are really few, if any, “software engineers” in practice. There are no objective standards for the engineering of software, nor is there any accrediting agency for licensing professional software engineers. In a sense, any programmer can call himself a software engineer, and there's no objective way to argue with that. Steve McConnell argues for the creation of a true discipline of software engineering.3 Given our increasing dependency on the work of these “engineers” for our lives and property, the idea of licensing software engineers is increasingly appealing.
Approaches to safety
So if software is so difficult to get right, do we stand a fighting chance? Of course. But the first step is to understand where the challenges lie. Then we can reasonably pursue solutions. The value of software in safety-critical systems is huge, but it has to be balanced against the risks. The following sections deal with approaches that may hold promise as we seek to improve the quality and safety of software in embedded systems.
Hazard analysis. In the old “George of the Jungle” cartoons, our hero routinely smacked face first into oncoming trees. Yes, he was engaging in a relatively hazardous activity (swinging through the jungle on vines), but the problem was typically one of inattention to the events in process. In other words, he performed relatively poor hazard analysis.
The keys to hazard analysis involve, first of all, being aware of the hazards. As obvious as that is, a certain percentage of accidents could have been avoided by simply being aware in the first place of potential hazards. Once a hazard has been identified, the likelihood of an accident stemming from it needs to be assessed, and the criticality of an accident should one occur. Once the hazards are understood at this level, devices can be designed that either eliminate the hazards or control them to avoid accidents. The process of risk management must be on-going, constantly gauging the derived value against the potential risk.
In order to build safe embedded systems, hazards must be discovered early in the software life cycle. Safety critical areas must be identified so that extra care can be given to exploring the implications of the application of software to this particular domain. Within these safety-critical areas, specific potential hazards must be identified. These analyses become foundational pieces that feed into the design of the system. The software can now be designed in such a way that these potential hazards can either be avoided or controlled.
A number of approaches can be used to discover potential hazards, including subsystem hazard analysis, system hazard analysis, designs and walkthrough, checklists, fault tree analysis, event tree analysis, and cause-consequence diagrams (which use fault and event trees).3 Once you understand the potential hazards within a system, design decisions can be made to mitigate the risks associated with these hazards. The following are examples:
Testing. Testing involves actually running a system in a specifically orchestrated fashion to see what it will do in a given situation. A number of challenges are inherent in testing, and they strike at the heart of why it's essential that quality be built into software as it's created, rather than tested in after the fact. The following are a number of dangerous assumptions that are frequently made, and which testing will not fix:
Even if we are wary of these dangerous assumptions, we still have to recognize the limitations inherent in testing as a means of bringing quality to a system. First of all, testing cannot prove correctness. In other words, testing can show the existence of a defect, but not the absence of faults. The only way to prove correctness via testing would be to hit all possible states, which as we've stated previously, is fundamentally intractable.
Second, one can't always make confident predictions about the reliability of software based upon testing. To do so would require accurate statistical models based upon actual operating conditions. The challenge is that such conditions are seldom known with confidence until after a system is installed! Even if a previous system is in place, enough things may change between the two systems to render old data less than valuable.
Third, even if you can test a product against its specification, it does not necessarily speak to the trustworthiness of the software. Trustworthiness has everything to do with the level of trust we place in a system (of course). Testing can give us some idea concerning its relative reliability, but may still leave us wary with respect to the safety-critical areas of the system.
For as many disclaimers as we've just presented, there still is a role for testing. At the very least every boat should be put in water to see if it floats. No matter how much else you do correctly (such as inspections, reviews, formal methods, and so on) there is still a need to run the software through its paces. Ideally this testing exercises the software in a manner that closely resembles its actual operating environment and conditions. It should also focus strongly on the specific areas identified as potential hazards.
Effective testing in safety-critical software should involve independent validation. Where safety is concerned, there should be no risk of the conflict of interest inherent in a development engineer testing his own code. Even when such an engineer is sincere, blind spots can and do exist. The same blind spot responsible for the creation of a fault will likely lead the engineer to not find that fault through the design of tests. Equally important, when independent validation engineers create a test suite, it should not be made available to development engineers. The rationale is the same that guides the GRE people to not give you the actual test you're going to take as a study guide for taking it!
Situations may arise in which developers must be involved in assessing the quality of the software they designed or implemented. In such cases, the test design should be written independently, development should not have access to the design until after implementation, and the results should be independently reviewed.
Reviews and inspections. One of the most effective methods for eliminating problems early in the product life cycle is to employ reviews and inspections of nearly every artifact created during development of the product. That applies to requirements documents, high-level and low-level design documents, product documentation, and so on. In the case of safety-critical software, these reviews and inspections should be particularly focused around areas of potential hazard.
Case study: Titanic
Clearly, sailing across the Atlantic Ocean carries certain risks, most notably icebergs. Ships had previously sunk after striking one.
The designers of the Titanic had analyzed the hazards, and had discovered that no ship had ever experienced a rupture of more than four chambers. So they built their luxury liner with the capacity to survive up to four compartment ruptures. Marketing dubbed the ship unsinkable, they under-equipped it with life boats, the crew got careless, and they struck an iceberg, rupturing five compartments. The rest, as they say, is history.
A number of reasonable solutions to this tragedy could have been applied at any of the levels we described earlier.
Principle 1: avoid or remove the hazard. They could have chosen not to sail across the Atlantic-not a particularly viable option since people needed to get across the ocean. People were aware of the risk of crossing the ocean and found it acceptable. The crew could have chosen to take a more southerly route, which presumably was longer. This would have added time and cost, perhaps prohibitive given the business model under which they were operating at the time. Any of these actions would have mitigated the risk of sailing in the first place, but for the sake of argument we'll assume that crossing the Atlantic needed to happen and that the route was economically necessary. They still could have mitigated risk by following the next principle.
Principle 2: if the hazard is unavoidable, minimize the risk of accident. They could have set course just slightly farther south, avoided the higher risk areas of icebergs and not added considerably to the travel time for the voyage. They could have put in place more accurate and timely reporting mechanisms for spotting icebergs. They could have been more diligent in the radio shack and in the captain's quarters as iceberg reports came in. There was not nearly enough precaution taken by the crew of the Titanic to minimize the risk of accident. Because of their confidence in their ability to withstand an accident, they didn't work diligently enough to avoid the risks prior to that point.
Principle 3: if an accident occurs, minimize the loss of life and/or property. The Titanic should have been designed to withstand a more severe collision. To design for the worst previously observed case is to remain ignorant of the future hazards that don't follow past patterns. Even if the designers were completely confident that the liner could withstand virtually any collision, one must still have failsafes in place in the event that the unthinkable occurs. All else being equal, if there had simply been sufficient life boats for all passengers, the loss of life would have been dramatically reduced.
Case study: Therac-25
The case of the Therac-25 is troubling on several counts. First, the level of personal injury is disturbing. There is nothing pleasant about the damage that a massive radiation overdose can do to a human when concentrated on a small area. Second, the attitude of the manufacturer (Atomic Energy of Canada Limited, or AECL) was unbelievably cavalier in the face of such serious claims of injury.
Even after lawsuits were being settled out of court with the families of deceased patients, AECL continued to deny that the Therac-25 could be the cause of the injuries. Their internal investigations were sloppy, their reports to the relevant government agencies inadequate, and their action plans disturbingly naÔve. As an example, their earliest investigations into the accidents did not even consider the possibility that software could have been the root cause of the overdoses, even though software controlled virtually all relevant mechanisms in the machine. Finally, a thorough investigation revealed an incredible inattention to rigor in the software development process at AECL. There appeared to be little in the way of formal software product life cycle in place, with much critical work being done in an ad hoc fashion. Even critical system tests were not necessarily documented or repeatable by engineers.
In Leveson's report, she identifies the following causal factors: overconfidence in software; confusing reliability with safety; lack of defensive design; failure to eliminate root causes; complacency; unrealistic risk assessments; inadequate investigation or follow-up on accident reports; inadequate software engineering practices; software reuse; safe vs. friendly user interfaces; and user and government oversight and standards.5 Let's see. Anything significant missing here? We can't think of anything! It's almost unbelievable that such safety-critical software could be developed in such an ineffective fashion.
Case study: Ariane-5
The core problem in the Ariane failure was incorrect software reuse. A critical piece of software had been reused from the Ariane-4 system, but behaved differently in the Ariane-5 because of differences in the operational parameters of the two rockets. During a data conversion from a 64-bit value to a 16-bit value, an overflow occurred, which resulted in an operand error. Since the code was not designed to handle such an error, the inertial reference system in which the error occurred simply shut down. This caused control to pass to a second, redundant inertial reference system, which, operating under the same information as the first one, also shut down! The failure of these two systems led to the on-board computer misinterpreting diagnostic data as proper flight data, causing a deviation in flight path. This deviation in flight path led to the activation of the rocket's self-destruct mechanism. One of the important lessons from the Ariane-5 failure is that the quality of a device's software must be considered in the context of the entire system. Software by itself has no inherent quality. It must be considered as part of a whole system. It is an important lesson to keep in mind as software reuse continues to be an important trend in software engineering.
Case study: Mars missions
On September 23, 1999 the Mars Orbiter stopped communicating with NASA and is presumed to have either been destroyed in the atmosphere by entering orbit too sharply or to have passed by Mars by entering orbit too shallowly. The root cause for this failure to approach orbit at the right angle was discovered to be an inconsistency in the units of measure used by two modules created by separate software groups. Thruster performance data was computed in English units and fed into a module that computed small forces, but which expected data to be in metric units. By accumulating errors over a nine-month journey, the Orbiter approached Mars at a