Accidents happen. That's just part of life. But when mission- or safety-
critical systems experience failures due to faulty software,
serious questions are
raised.
Despite
the risks, software is increasingly making its way into mission- and safety-critical embedded devices. This article explores the challenges inherent in balancing the tremendous flexibility and power provided by embedded software against the risks that occur when software failure leads to loss of life or property. This article also explores the root causes of several famous embedded software failures, including
the Therac-25, Ariane 5, and recent failed Mars missions.
The problem of safety.
Life is full of risks. That much is obvious. And most risks can be avoided if the cost of avoidance is acceptable. We can avoid ever being involved in an automobile accident simply by never traveling by car. Well, that works for drivers and passengers, but still doesn't necessarily help pedestrians. For pedestrians, avoiding any possibility of automobile
accident would involve staying close to home a great deal of the time, and strictly avoiding sidewalks, driveways, and curbs-not a particularly palatable set of choices for most of us.
And so we learn to live with the inherent risks that surround us, because the cost of avoidance just seems too high. However, as technology becomes more and more ubiquitous, with more of that technology being controlled by software, a greater portion of the risk we face is ultimately in the hands of software
engineers.
Most of the time, the risks we face don't bear fruit. But when they do, we call the event an accident or a mishap. The kinds of accidents we're primarily concerned with in this article are the type that lead to personal injury, loss of life, or unacceptable loss of property.
So how significant has the risk become? Consider that software is now commonly found in the control systems of nuclear power plants, commercial aircraft, automobiles, medical devices, defense systems, and air traffic
control systems. It's likely that almost everyone has at one time or another put themselves, their lives, or their property into the hands of the engineers that built the software controlling these systems. The spread of software into safety-critical systems is likely to continue, despite the risks.
Hazards, accidents, and risks
A hazard is a set of conditions, or a state, that could lead to an accident, given the right environmental trigger or set of events. An accident is the realization of the
negative potential inherent in a hazard. For example, a pan of hot water on a stove is a hazard if its handle is accessible to a toddler. But there's no accident until the toddler reaches up and grabs the handle.
In software, faults or defects are errors that exist within a system, while a failure is an error or problem that is observable in the behavior of the system. A fault can lie dormant in a software system for years before the right set of environmental conditions cause the problem to manifest itself in
the functioning system
(think Y2K).
1
Our ultimate concern here is not whether hazards should exist, or whether software faults should exist. You can debate whether they should or should not, and you can even argue whether it's theoretically possible to eliminate them at all. But the reality is that they do exist, that they represent risk, and that we have to deal with that risk.
Risks can be addressed at three fundamental levels:
- The likelihood that a hazard will occur
- If a hazard occurs, the likelihood that the hazard will lead to an accident
- If an accident occurs, the level of loss associated with the accident
As we build safety-critical software, we need to be concerned with mitigating risk at each of these levels.
Software in safety-critical devices
When we build safety-critical software, it is imperative that we ensure an acceptable level of risk. That doesn't mean that risk
won't exist. But we will have taken care at each of the three levels to eliminate the risks where possible and to reduce the risks that are unavoidable. In doing so, we must concern ourselves with the interaction of the controlling software with the rest of the system. Software, by itself, never poses a threat to life or limb. It needs some help from mechanical systems to do that.
In assessing the level of risk inherent in turning over safety-critical control functions to software, it is valuable
to compare the track record of software with other types of engineering. Software is indeed unique in many ways. As Frederick Brooks pointed out, certain essential characteristics of software are unique and
challenging.
[1]
As a result, when compared to other engineering fields, software tends to have more errors, those errors tend to be more pervasive, and they tend to be more troublesome. In addition, it is difficult to predict the failure of software because
it doesn't gracefully or predictably degrade with use (such as the way tires or brake shoes will gradually wear out until it's time to replace them). Software may break immediately upon installation due to unforeseen environmental or usage conditions. It may work reliably until a user tries something unexpected. It may work well for years until some operating condition suddenly changes. It may fail intermittently as sporadic environmental conditions come and go.
With all of these legitimate
concerns over software, it begs the question, "Why use software in safety-critical systems at all?" If a risk is worth taking, it's because some return or advantage accompanies the risk. One of the most significant reasons for using software in embedded devices (whether safety-critical or not) is that a higher level of sophistication and control can be achieved at a cheaper cost than is possible with hard-wired electronics or custom designed mechanical features. As we come to expect more out of embedded devices,
software is currently the best way to keep up with the steep growth in complexity.
In a changing environment, devices must either adapt or face early obsolescence. Hard-wired devices must be either replaced or face expensive upgrades. On the other hand, software can be upgraded relatively easily, without swapping out expensive hardware components.
Finally, because of the power and flexibility of software, devices can deliver a great deal of information to users and technicians. Such
software-controlled devices can gather useful information, interpret it, perform diagnostics, or present more elegant interfaces to the user, at a more acceptable cost than is possible with hardware.
For these reasons, tremendous value and power lie in using software to control embedded devices. Still, we need to clearly understand the risks. By understanding the nature of software, we may more effectively build embedded control software while minimizing the risks.
In his article, Brooks
states that software has "essential" properties as well as
"accidental."
[1]
The essential properties are inherent, and in a sense, unremovable or unsolvable. They represent the nature of the beast. The accidental properties are coincidental, perhaps just the result of an immature field. The accidental properties are those that might be solved over time. The following sections identify some of the essential properties of
software.
2
In order to build safe software, each of these must be dealt with to minimize risk.
Complexity.
Software is generally more complex than hardware. The most complex hardware tends to take the form of general purpose microprocessors. The variety of software that can be written for these hardware systems is almost limitless, and the complexity of such systems can dwarf the complexity of the hardware system on which it depends. Consider that software systems consist not only of programs (which
may have an infinite number of possible execution paths), but also data, which may be many orders of magnitude greater than the hardware states present in our most complex integrated circuits.
The most complex hardware takes the form of ASICs (application-specific integrated circuits), but these are essentially general purpose microprocessors with accompanying system-specific control software. In such cases, it's still common for the complexity of the software to dwarf that of the hardware.
Error sensitivity.
Software can be extremely sensitive to small errors. It has been said that if architects built houses the way software engineers built software, the first woodpecker that came along would destroy civilization. While the story hurts, it's part of the nature of software that small errors can have huge impacts. In other fields, there is a notion of "tolerance." For example, some play typically exists in the acceptable range of tensile strength of a mechanical part. There's little
in the way of an analogous concept in software. There's no concept that the software is still fit if some small percentage of the bits change. In some situations the change of a single bit in a program can mean the difference between successful execution and catastrophic failure.
Difficult to test.
For most real software systems, complete and exhaustive testing is an intractable problem. A program consisting of only a few hundred lines of code may require an infinite amount of testing to
exhaustively cover all possible cases. Consider a single loop that waits for a key press. What happens if the user presses during the first loop? The second? The third? One can argue that all subsequent iterations of that loop are part of an equivalence class, and the argument would probably be valid. But what if something catastrophic occurs only if the key is pressed during the one millionth time through? Testing isn't going to discover that until the millionth test case. Not likely to happen.
All testing deals with risk management, and all testers understand the near impossibility of exhaustive testing. And so they deal with equivalence classes based upon assumptions of continuous functions. But when functions suddenly show themselves to be non-continuous (such as the Pentium floating-point bug), you still have a problem.
Correlated failures.
Finding the root cause of failures can be extremely challenging with software. Mechanical engineers (and even electrical engineers) are
often concerned with manufacturing failures, and the rates and conditions that lead things to wear out. But software doesn't really wear out. The bits don't get weak and break. It is true that certain systems can become cluttered with incidental detritus (think Windows 9x), but they don't wear out in the same way a switch or a hard drive will. Most of the failures in software are actually design errors. One can attempt to avoid these failures with redundant systems, but those systems simply duplicate the same
design error, which doesn't help much. One can also attempt to avoid these failures by employing competing designs of a system, but the backup may suffer from the same blind spots as the original, despite the fresh design. Or even more pernicious, the backup may suffer from new and creative blind spots, different from the first, but equally harmful.
Lack of professional standards.
Software engineering is very much a fledgling field. Individuals who once proudly proclaimed themselves to be
"computer programmers" are now typically mildly insulted at the notion of being only a "programmer" and now tend to prefer to be called "software engineers." But there are really few, if any, "software engineers" in practice. There are no objective standards for the engineering of software, nor is there any accrediting agency for licensing professional software engineers. In a sense, any programmer can call himself a software engineer, and there's no objective way to argue with that. Steve McConnell argues
for the creation of a true discipline of software
engineering.
[3]
Given our increasing dependency on the work of these "engineers" for our lives and property, the idea of licensing software engineers is increasingly appealing.
Approaches to safety
The previous section laid out some serious issues associated with software. We would argue that the first four (complexity, error sensitivity,
difficult to test, correlated failures) are essential to the nature of software, and aren't going away any time soon. The fifth one (lack of professional standards) can certainly be resolved under the proper social conditions.
So if software is so difficult to get right, do we stand a fighting chance? Of course. But the first step is to understand where the challenges lie. Then we can reasonably pursue solutions. The value of software in safety-critical systems is huge, but it has to be balanced
against the risks. The following sections deal with approaches that may hold promise as we seek to improve the quality and safety of software in embedded systems.
Hazard analysis.
In the old "George of the Jungle" cartoons, our hero routinely smacked face first into oncoming trees. Yes, he was engaging in a relatively hazardous activity (swinging through the jungle on vines), but the problem was typically one of inattention to the events in process. In other words, he performed relatively
poor hazard analysis.
The keys to hazard analysis involve, first of all, being aware of the hazards. As obvious as that is, a certain percentage of accidents could have been avoided by simply being aware in the first place of potential hazards. Once a hazard has been identified, the likelihood of an accident stemming from it needs to be assessed, and the criticality of an accident should one occur. Once the hazards are understood at this level, devices can be designed that either eliminate the
hazards or control them to avoid accidents. The process of risk management must be on-going, constantly gauging the derived value against the potential risk.
In order to build safe embedded systems, hazards must be discovered early in the software life cycle. Safety critical areas must be identified so that extra care can be given to exploring the implications of the application of software to this particular domain. Within these safety-critical areas, specific potential hazards must be identified.
These analyses become foundational pieces that feed into the design of the system. The software can now be designed in such a way that these potential hazards can either be avoided or controlled.
A number of approaches can be used to discover potential hazards, including subsystem hazard analysis, system hazard analysis, designs and walkthrough, checklists, fault tree analysis, event tree analysis, and cause-consequence diagrams (which use fault and
event trees).
3
Once you understand the potential hazards within a system, design decisions can be made to mitigate the risks associated with these hazards. The following are examples:
- Automatic controls can be built in to handle hazardous conditions. For example, home electrical systems have breakers that will break a circuit if the draw of current becomes too great. This provides a mechanism to protect against electrocution or fire hazards. Similarly, an embedded device may have
hardware or mechanism overrides for certain safety-critical features, rather than depending strictly on software logic for protection
- Lockouts are mechanisms or logic designed to prevent entrance into an unsafe state. In software, a particular safety-critical section of code may be protected by some access control mechanism that will permit entrance into the critical section only when doing so would not put the system into an unsafe state
- Lockins are similar mechanisms that enforce the continuation
of a safe state. As an example, a lockin might reject any input or stimulus that would cause a currently safe state to be compromised
- Interlocks are mechanisms that constrain a sequence of events in such a way that a hazard is avoided. As an example, most new automobiles require that the brake pedal be depressed before the key can be turned to start the car. This is designed to avoid the hazard of children turning the key in an ignition when they are too small to control or stop the vehicle
Testing.
Testing involves actually running a system in a specifically orchestrated fashion to see what it will do in a given situation. A number of challenges are inherent in testing, and they strike at the heart of why it's essential that quality be built into software as it's created, rather than tested in after the fact. The following are a number of dangerous assumptions that are frequently made, and which testing will not fix:
- The software specification is
correct. If it is not correct, verifying that a software implementation matches its specification may not actually provide information about the risks that result from prospective hazards
- It is possible to predict the usage environment of the system. Certainly, much can be known about the environment, but it's not possible to predict the actual usage. Failures can happen as a result of changes in things as simple as operator typing speed and ambient room temperature
- It is possible to create an
operational profile to test against and assess reliability. Again, there is a great deal that can be predicted, but one can never completely and accurately predict the actual operational profile before the fact
Even if we are wary of these dangerous assumptions, we still have to recognize the limitations inherent in testing as a means of bringing quality to a system. First of all, testing cannot prove correctness. In other words, testing can show the existence of a defect, but not the absence of
faults. The only way to prove correctness via testing would be to hit all possible states, which as we've stated previously, is fundamentally intractable.
Second, one can't always make confident predictions about the reliability of software based upon testing. To do so would require accurate statistical models based upon actual operating conditions. The challenge is that such conditions are seldom known with confidence until after a system is installed! Even if a previous system is in place, enough
things may change between the two systems to render old data less than valuable.
Third, even if you can test a product against its specification, it does not necessarily speak to the trustworthiness of the software. Trustworthiness has everything to do with the level of trust we place in a system (of course). Testing can give us some idea concerning its relative reliability, but may still leave us wary with respect to the safety-critical areas of the system.
For as many disclaimers as
we've just presented, there still is a role for testing. At the very least every boat should be put in water to see if it floats. No matter how much else you do correctly (such as inspections, reviews, formal methods, and so on) there is still a need to run the software through its paces. Ideally this testing exercises the software in a manner that closely resembles its actual operating environment and conditions. It should also focus strongly on the specific areas identified as potential hazards.
Effective testing in safety-critical software should involve independent validation. Where safety is concerned, there should be no risk of the conflict of interest inherent in a development engineer testing his own code. Even when such an engineer is sincere, blind spots can and do exist. The same blind spot responsible for the creation of a fault will likely lead the engineer to not find that fault through the design of tests. Equally important, when independent validation engineers create a test suite, it
should not be made available to development engineers. The rationale is the same that guides the GRE people to not give you the actual test you're going to take as a study guide for taking it!
Situations may arise in which developers must be involved in assessing the quality of the software they designed or implemented. In such cases, the test design should be written independently, development should not have access to the design until after implementation, and the results should be
independently reviewed.
Reviews and inspections.
One of the most effective methods for eliminating problems early in the product life cycle is to employ reviews and inspections of nearly every artifact created during development of the product. That applies to requirements documents, high-level and low-level design documents, product documentation, and so on. In the case of safety-critical software, these reviews and inspections should be particularly focused around areas of potential hazard.
Case study: Titanic
The Titanic provides an interesting case study because it's particularly well known and the root causes are reasonably well understood. It is particularly interesting to look at the three principles of risk analysis and apply them to the famous ocean liner.
Clearly, sailing across the Atlantic Ocean carries certain risks, most notably icebergs. Ships had previously sunk after striking one.
The designers
of the Titanic had analyzed the hazards, and had discovered that no ship had ever experienced a rupture of more than four chambers. So they built their luxury liner with the capacity to survive up to four compartment ruptures. Marketing dubbed the ship unsinkable, they under-equipped it with life boats, the crew got careless, and they struck an iceberg, rupturing five compartments. The rest, as they say, is history.
A number of reasonable solutions to this tragedy could have been applied at any
of the levels we described earlier.
Principle 1: avoid or remove the hazard.
They could have chosen not to sail across the Atlantic-not a particularly viable option since people needed to get across the ocean. People were aware of the risk of crossing the ocean and found it acceptable. The crew could have chosen to take a more southerly route, which presumably was longer. This would have added time and cost, perhaps prohibitive given the business model under which they were operating at the
time. Any of these actions would have mitigated the risk of sailing in the first place, but for the sake of argument we'll assume that crossing the Atlantic needed to happen and that the route was economically necessary. They still could have mitigated risk by following the next principle.
Principle 2: if the hazard is unavoidable, minimize the risk of accident.
They could have set course just slightly farther south, avoided the higher risk areas of icebergs and not added considerably to
the travel time for the voyage. They could have put in place more accurate and timely reporting mechanisms for spotting icebergs. They could have been more diligent in the radio shack and in the captain's quarters as iceberg reports came in. There was not nearly enough precaution taken by the crew of the Titanic to minimize the risk of accident. Because of their confidence in their ability to withstand an accident, they didn't work diligently enough to avoid the risks prior to that point.
Principle 3: if an accident occurs, minimize the loss of life and/or property.
The Titanic should have been designed to withstand a more severe collision. To design for the worst previously observed case is to remain ignorant of the future hazards that don't follow past patterns. Even if the designers were completely confident that the liner could withstand virtually any collision, one must still have failsafes in place in the event that the unthinkable occurs. All else being equal, if there had simply been
sufficient life boats for all passengers, the loss of life would have been dramatically reduced.
Case study: Therac-25
The Therac-25 was the infamous medical linear accelerator that massively overdosed six radiation therapy patients over a two-year
period.
4
Three of those patients died shortly after their treatments from complications immediately attributable to the radiation overdoses.
The case of the Therac-25 is troubling on several counts. First, the level of personal injury is disturbing. There is nothing pleasant about the damage that a massive radiation overdose can do to a human when concentrated on a small area. Second, the attitude of the manufacturer (Atomic Energy of Canada Limited, or AECL) was unbelievably cavalier in the face of such serious claims of injury.
Even after lawsuits were being settled out of court with the families of deceased patients, AECL
continued to deny that the Therac-25 could be the cause of the injuries. Their internal investigations were sloppy, their reports to the relevant government agencies inadequate, and their action plans disturbingly naýve. As an example, their earliest investigations into the accidents did not even consider the possibility that software could have been the root cause of the overdoses, even though software controlled virtually all relevant mechanisms in the machine. Finally, a thorough investigation revealed an
incredible inattention to rigor in the software development process at AECL. There appeared to be little in the way of formal software product life cycle in place, with much critical work being done in an ad hoc fashion. Even critical system tests were not necessarily documented or repeatable by engineers.
In Leveson's report, she identifies the following causal factors: overconfidence in software; confusing reliability with safety; lack of defensive design; failure to eliminate root causes;
complacency; unrealistic risk assessments; inadequate investigation or follow-up on accident reports; inadequate software engineering practices; software reuse; safe vs. friendly user interfaces; and user and government oversight and
standards.
5
Let's see. Anything significant missing here? We can't think of anything! It's almost unbelievable that such safety-critical software could be developed in such an ineffective fashion.
Case study: Ariane-5
Ariane-5 was the newest in a family of rockets designed to carry satellites into orbit. On its maiden launch on June 4, 1996, it flew for just under 40 seconds before self-destructing, destroying the rocket and its payload of four satellites. In contrast to the slipshod approach taken by AECL in the Therac-25 incident, Ariane immediately set up an Inquiry Board that conducted a thorough investigation and discovered the root cause of the
accident.
[7]
The core problem in the Ariane failure was incorrect software reuse. A critical piece of software had been reused from the Ariane-4 system, but behaved differently in the Ariane-5 because of differences in the operational parameters of the two rockets. During a data conversion from a 64-bit value to a 16-bit value, an overflow occurred, which resulted in an operand error. Since the code was not designed to handle such an error, the inertial reference
system in which the error occurred simply shut down. This caused control to pass to a second, redundant inertial reference system, which, operating under the same information as the first one, also shut down! The failure of these two systems led to the on-board computer misinterpreting diagnostic data as proper flight data, causing a deviation in flight path. This deviation in flight path led to the activation of the rocket's self-destruct mechanism.
One of the important lessons from the Ariane-5 failure is
that the quality of a device's software must be considered in the context of the entire system. Software by itself has no inherent quality. It must be considered as part of a whole system. It is an important lesson to keep in mind as software reuse continues to be an important trend in software engineering.
Case study: Mars missions
Two recent back-to-back failures by NASA's Jet Propulsion Laboratory (JPL) have captured quite a bit of
news recently. In the first case the Mars Climate Orbiter (MCO) was launched on December 11, 1998 and spent nine months traveling toward Mars. Its purpose upon arrival was to orbit Mars as the first interplanetary weather satellite. In addition, it was to provide a communications relay for the Mars Polar Lander (MPL) which was scheduled to reach Mars three months later, in December of
1999.
[8]
On September 23, 1999 the Mars Orbiter stopped
communicating with NASA and is presumed to have either been destroyed in the atmosphere by entering orbit too sharply or to have passed by Mars by entering orbit too shallowly. The root cause for this failure to approach orbit at the right angle was discovered to be an inconsistency in the units of measure used by two modules created by separate software groups. Thruster performance data was computed in English units and fed into a module that computed small forces, but which expected data to be in metric units. By
accumulating errors over a nine-month journey, the Orbiter approached Mars at an incorrect orientation, and was lost.
In the preliminary report after the loss of the Mars Orbiter, Dr. Edward Stone, director of JPL, issued the following statement, "Special attention is being directed at navigation and propulsion issues, and a fully independent 'red team' will review and approve the closure of all subsequent actions. We are committed to doing whatever it takes to maximize the prospects for a
successful landing on Mars on
Dec. 3."
[9]
This statement turned out to be an unfortunate bit of foreshadowing.
On January 3, 1999 (approximately one month after the Mars Orbiter was launched), NASA launched three spacecraft using a single launch vehicle: the Mars Polar Lander (MPL) and two Deep Space 2 (DS2) probes. The Mars Lander was to land on the surface of the planet and perform experiments for 90 days. The Deep Space probes were to be released above
the planet surface and drop through the atmosphere, embedding themselves beneath the surface. According to plan, these three spacecraft ended communications as they prepared to enter the atmosphere of Mars on December 3, 1999. After arriving on the planet, they were to resume communication on the evening of December 4, 1999. Communication was never
reestablished.
[10]
As in the case of the Mars Orbiter, NASA conducted a very thorough investigation,
and explored a number of possible causes. The most probable cause seems to be the generation of spurious signals when the Lander's legs were deployed during descent. Spurious signals could give the Lander a false indication that it had landed, causing the engines to shut down. Of course, shutting down the engines before the Lander had actually landed would result in the spacecraft crashing into the surface of Mars. The following root cause analysis is from NASA's report:
"It is not uncommon for
sensors involved with mechanical operations, such as the lander leg deployment, to produce spurious signals. For MPL, there was no software requirement to clear spurious signals prior to using the sensor information to determine that landing had occurred. During the test of the lander system, the sensors were incorrectly wired due to a design error. As a result, the spurious signals were not identified by the systems test, and the systems test was not repeated with properly wired touchdown sensors. While the
most probable direct cause of the failure is premature engine shutdown, it is important to note that the underlying cause is inadequate software design and systems
test."
[11]
The theme of underlying software design, verification, and validation is certainly a common one in these and most other failures of safety-critical software.
What level of risk?
Few systems are completely free of
risk. What is required for a system to be usable is that it have acceptable risk. The level of risk that is acceptable will vary with the type of system and the potential
losses.
[12]
Builders of safety-critical software must be aware of the principles of risk, and understand how to mitigate risk at each of these levels. Doing so will almost certainly involve the application of formal verification and validation methods, in addition to effective system tests.
How does building safety-critical software differ from building any other kind of high quality software? In many ways it doesn't differ at all. Many of the same principles and methods should be applied in either case. What sets safety-critical software apart is that the risks involved are potentially huge in terms of life and property. That should mean that the amount worth investing in building it right should be much greater. Schedule must be second to quality, particularly in those parts of a
system that pose the highest risk. Safety-critical software must be held to the highest possible standards of engineering discipline.
Charles D. Knutson
is an assistant professor of computer science at Brigham Young University in Provo, Utah. He holds a PhD in computer science from Oregon State University. You can contact him at
knutson@cs.byu.edu
Sam Carmichael
is a validation engineer at Micro Systems Engineering. Contact him at
carmicha@biotronik.com
.
Endnotes
1. It's common to lump defects and failures together into the imprecise term "bug." Most people mean "failure" when they say "bug," because there was an observable problem in the software that could be demonstrated. But we are equally concerned with "bugs" that have not yet manifested themselves.
Back
2. These
properties were originally identified in
[2]
and are still significant problems in safety-critical software today. That they haven't been eliminated as concerns speaks to their essential nature.
Back
3. A detailed discussion of these techniques is beyond the scope of this paper. Refer to Nancy Leveson's book, Safeware: System Safety and Computers, for more information on these and other techniques.
Back
4. In
[5]
, Nancy Leveson gives an extremely detailed and thorough report of the investigation she and Clark Turner conducted
[6]
into the Therac-25.
Back
5. As a postscript, currently the primary business of AECL is the design and installation of nuclear reactors.
Back
References
[1] Brooks, Frederick P. The Mythical Man
Month, 20th Anniversary Edition. Reading, MA: Addison-Wesley, 1995.
Back
[2] Parnas, David L., A. John van Schouwen, and Shu Po Kwan. "Evaluation of Safety-Critical Software," Communications of the ACM, June 1990, pp. 636-648.
Back
[3] McConnell, Steve. After the Gold Rush: Creating a True Profession of Software Engineering. Redmond, WA: Microsoft Press, 1999.
Back
[4] Leveson, Nancy G. Safeware: System Safety and
Computers. Reading, MA: Addison-Wesley, 1995.
[5] Leveson, Nancy G. "Appendix A, Medical Devices: The Therac-25 Story," Safeware: System Safety and Computers. Reading, MA: Addison-Wesley, 1995, pp. 515-553.
Back
[6] Leveson, Nancy G. and Clark S. Turner. "An Investigation of the Therac-25 Accidents," IEEE Computer, July 1993, pp. 18-41.
Back
[7] Jacques-Louis Lions. "Ariane 5, Flight 501 Failure, Report of the Inquiry Board."
www.esrin.esa.it/htdocs/tidc/Press/Press96/ariane5rep.html
.
Back
[8] "Mars Climate Orbiter, Mishap Investigation Board, Phase I Report." NASA, November 10, 1999.
Back
[9] "Mars Climate Orbiter Failure Board Releases Report, Numerous NASA Actions Underway in Response." NASA Press release, November 10, 1999, mars.jpl.nasa.gov/msp98/news/mco991110.html.
Back
[10]
"Report on the Loss of the Mars Polar Lander and Deep Space 2 Missions." JPL, March 22, 2000.
Back
[11] "Mars Program, Independent Assessment Team, Summary Report." NASA, March 14, 2000.
Back
[12] Leveson. Nancy G., "Software Safety in Embedded Computer Systems," Communications of the ACM, February 1991, pp. 35-46.
Back