Applying Bayesian belief networks to fault tree analysis of safety critical software

Chris Hobbs

August 10, 2010

Chris HobbsAugust 10, 2010

When implementing a safety- or mission-critical application, one must demonstrate that the application meets its availability and reliability requirements; that is, how often it returns a response and how often the response is correct. Typically, this argument is supported by a combination of hard and soft evidence, and of evidence argued from cause to effect and from effect to cause.

A full safety case contains evidence from many sources, including a failure analysis, commonly expressed either as a Failure Mode, Effects, and Criticality Analysis (FMECA) or as a Fault Tree Analysis (FTA).

FMECA is an inductive analysis of system failure, starting with the presumed failure of a component and analyzing its effect on system stability: “What will happen if valve A sticks open?” In contrast, FTA is a deductive analysis, starting with potential or actual failures and deducing what might have caused them: “What could cause a deadlock in the application?”

Either technique can benefit greatly from the use of a Bayesian Belief Network, which provides a framework for the analysis and reduces the time to reach a quantified result.

The Fault Tree
Fault trees encapsulate the concept that the failure of a (sub)system can result from the failure of lower-level (sub)systems. For instance:

— X fails if both Y and Z fail (Y and Z may be identical units, either of which can carry the system on its own if required).
— X fails if either Y or Z (or both) fails (Y and Z may be units that act serially, and the failure of either breaks the chain).
— X fails if any two of Y, Z, and T fail.

Figure 1. A very simple fault tree. Numbers represent failures and letters identify the simplest system events.

The example in Figure 1 above illustrates the basic concepts of a fault tree:

— The system will fail if both failure 3 and failure 2 occur.
— Failure 2 will occur if either event C or event D (or both) occurs.
— Failure 3 will occur if either failure 1 or event E (or both) occurs.
— Failure 1 will occur if both event A and event B occur.

Generally, the events A to E represent types of failure of system components. Once the tree is drawn, the minimal cut sets can be identified. A cut set is a set of lowest-level events where occurrence of every event would cause the system to fail. In Figure 1, { A, B, C, D, E } is a cut set because, if these events all happened, the system would fail.

A minimal cut set is a cut set that cannot be reduced. For example, in Figure 1 the minimal cut sets are { E, C }, { E, D }, { A, B, C }, { A, B, D }. The events listed in any one of these sets will together cause a system failure. Figure 1 is, of course, very simple and, for realistic trees, computer programs are needed to identify minimal cut sets.

Knowledge of the minimal cut sets allows the analyst to provide whatever protection is required to meet the system's availability and reliability requirements.

< Previous
Page 1 of 2
Next >

Loading comments...