Applying Bayesian belief networks to fault tree analysis of safety critical software -

Applying Bayesian belief networks to fault tree analysis of safety critical software


When implementing a safety- or mission-critical application, one must demonstrate that the application meets its availability and reliability requirements; that is, how often it returns a response and how often the response is correct. Typically, this argument is supported by a combination of hard and soft evidence, and of evidence argued from cause to effect and from effect to cause.

A full safety case contains evidence from many sources, including a failure analysis, commonly expressed either as a Failure Mode, Effects, and Criticality Analysis (FMECA ) or as a Fault Tree Analysis (FTA).

FMECA is an inductive analysis of system failure, starting with the presumed failure of a component and analyzing its effect on system stability: “What will happen if valve A sticks open?” In contrast, FTA is a deductive analysis, starting with potential or actual failures and deducing what might have caused them: “What could cause a deadlock in the application?”

Either technique can benefit greatly from the use of a Bayesian Belief Network, which provides a framework for the analysis and reduces the time to reach a quantified result.

The Fault Tree
Fault trees encapsulate the concept that the failure of a (sub)system can result from the failure of lower-level (sub)systems. For instance:

— X fails if both Y and Z fail (Y and Z may be identical units, either of which can carry the system on its own if required).
— X fails if either Y or Z (or both) fails (Y and Z may be units that act serially, and the failure of either breaks the chain).
— X fails if any two of Y, Z, and T fail.

Figure 1. A very simple fault tree. Numbers represent failures and letters identify the simplest system events.

The example in Figure 1 above illustrates the basic concepts of a fault tree:

— The system will fail if both failure 3 and failure 2 occur.
— Failure 2 will occur if either event C or event D (or both) occurs.
— Failure 3 will occur if either failure 1 or event E (or both) occurs.
— Failure 1 will occur if both event A and event B occur.

Generally, the events A to E represent types of failure of system components. Once the tree is drawn, the minimal cut sets can be identified. A cut set is a set of lowest-level events where occurrence of every event would cause the system to fail. In Figure 1, { A, B, C, D, E } is a cut set because, if these events all happened, the system would fail.

A minimal cut set is a cut set that cannot be reduced. For example, in Figure 1 the minimal cut sets are { E, C }, { E, D }, { A, B, C }, { A, B, D }. The events listed in any one of these sets will together cause a system failure. Figure 1 is, of course, very simple and, for realistic trees, computer programs are needed to identify minimal cut sets.

Knowledge of the minimal cut sets allows the analyst to provide whatever protection is required to meet the system's availability and reliability requirements.

The Bayesian Belief Network
In theory, rates and distributions can be associated with each lowest-level event in a fault tree, and a computer program can then consolidate this information into a failure rate and failure distribution for the entire system. However, in practice, quantifying the failures of the lowest-level components of an entire system, particularly a software system, is difficult or, in many cases, impossible.

One particular issue is that, often, better failure information is available for a subsystem than for some or any of its components. For instance, in Figure 1, statistics may be available for failure 3 and event E, but not for events A and B. In this case, the tree could be transformed to make failure 3 into a component without substructure.

However, this change would lose information and make the reuse of items in different subtrees impossible, thereby placing significant constraints on how the tree can be created.

Additionally, information may be available at a subsystem level (e.g. Failure 3) and for some of the subsystem’s component events (e.g. B). This scenario is difficult to handle with traditional techniques.

Bayesian Belief Networks (BBNs) handle such complexity in failure information by accepting evidence for the failure rate of any node and by using Bayes' theorem to calculate the a posteriori probabilities of the failure rates of the sub-elements: reasoning from effect to cause.

Figure 2. A BBN for the fault tree in Figure 1. Without additional information, all components are assumed to have a 50/50 chance of having failed. ( To view an expanded image click here)

Using a graphical tool for entering and analyzing BBNs, it is easy to translate the structure of Figure 1 into a BBN, as illustrated in Figure 2 above . This figure shows the BBN before any values for failure rates have been entered.

In the figure, True indicates that the component or (sub)system has failed within the first N hours of operation (where N is a value appropriate for the system under consideration: e.g. 109 hours).

Since we have provided the tool with no better values for the low-level events, we can see from Figure 2 that the tool has assumed there is a 50% probability of the event having happened. The tool then wraps up these values to produce a probability of failure for the whole system.

Assuming that we have “soft” values for the failure probability of subsystem 3, we can add those values to the model. As shown in Figure 3 below , the tool then recalculates not only the new value for the entire system but also applies Bayes' Theorem to calculate the values for subsystem 1 and events A, B, and E. Note that the tool still has no better values for events C and D than 50%.

Figure 3. Soft evidence has been entered for the failure probability of Subsystem 3, modifying the failure probability not only of the system but also of the subsystems and components. ( To view an expanded image , click here)

Application to an RTOS Microkernel
The example in Figure 2 and Figure 3, while illustrative, is much smaller than the fault tree for real-life software components. Take, for example, an RTOS microkernel, which forms the software foundation of a safety-critical system. Creating a fault tree for a microkernel consists of several steps:

1) Identifying the ways in which the kernel can fail.
2) Using documented product history to create a fault tree of conditions that, in practice, contribute to each failure type.
3) Expressing the fault tree as a BBN so that one can incorporate soft evidence about field failure rates and calculate the resulting post-probabilities.
4) Combining reports of field failures with field usage figures to estimate the failure rates to be used in the fault tree.

Once the model is complete, one can carry out a sensitivity analysis to find the events to which the final result was most sensitive (i.e. which values, if changed slightly, would lead to a significantly different outcome ). One can then refine these values and repeat the calculation.

The result is an estimate, based on justifiable assumptions, of the level of availability and, if correctness information is also available, reliability of the microkernel. This level could then be compared, for example, with the requirements of Safety Integrity Levels 1, 2, 3, and 4 of IEC 61508 .

Statistical Analysis
The methodology described above rests on the applicability of handling software failures statistically. Some have argued that this treatment is inappropriate because, unlike hardware, software doesn’t “wear out”: all failures are design failures, and the software life cycle doesn’t therefore follow the conventional “bathtub curve”.

Nonetheless, software failure rates do in fact follow the bathtub curve. Software typically has a high failure rate when first released because unanticipated usage patterns uncover latent faults. Once these failures have been fixed, the software settles down until changes to it (patches, enhancements, etc.) and to its environment (OS changes, faster processors, etc.) cause the failure rate to start rising again.

As for criticisms of the theoretical underpinnings of the statistical model, one can argue, as have Littlewood et al (2001), that the random nature of demands provides a genuine statistical element to software failures.

That is, if the complete, multi-dimensional input space (or “demand space”) were known, then the sequence of program invocations could be visualized as “walks” through that multi-dimensional space. Some walks lead to “Heisenbugs ” (hard-to-reproduce bugs that occur because of complex interactions) that invoke errors and thereby create failures.

A different trajectory could lead to the very same point in the input space and, because of the nature of Heisenbugs, not cause a failure. The walks that applications make through the input space are unknown and, because the space is so large and multi-dimensional, are to all intents and purposes random. The sequence of failures is random and can therefore be analyzed statistically.

Fault tree analysis is especially applicable to a mature product where field usage figures and problem reports exist. Using a Bayesian Belief Network to express the fault tree allows both hard and soft evidence to be incorporated into the product analysis in a quantifiable way. One can then incorporate the results of this analysis into a larger model that expresses a full, quantified safety case for the product.

Chris Hobbs is an OS kernel developer at QNX Software Systems , where he has particular responsibility for the use of the micro-kernel in safety-critical systems. He has worked with many open source projects, both professionally and personally. After graduating with a degree in mathematics, focusing on mathematical philosophy, Chris is pursuing a career in software development (the open positions for mathematical philosophers being limited) where he has worked intensively in the area of sufficiently-available software: software that *just* meets its availability requirements. He is a member of the Canadian Advisory Committee for IEC SC65A and 65C, the IEC subcommittee responsible for IEC61508, the standard for the use of software in safety-critical systems. He has published several technical books including one on the use of WBEM/CIM for device and service management.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.