Build Safety-Critical Designs with UML-based Fault Tree Analysis - The basics

Bruce Powel Douglass, IBM/Rational

April 27, 2009

Bruce Powel Douglass, IBM/Rational

Faults and Failures.
A safety fault is a nonconformance of a system that leads to a hazard. Faults come in two flavors: failure states and errors. A failure is an event that occurs when a component no longer functions properly, and leads to a failed state.

A soft failure is a temporary failure that may be corrected (or correct itself) without replacing the failed component. A hard failure is one in which the component must be replaced to repair the defect.

Failures are distinct from errors. An error is a design or implementation defect. Failures are events that occur at some point in time while errors are omnipresent conditions or states. Errors may not always be apparent; when they become apparent, they are said to manifest.

Mechanical or electronic hardware may have both failures and errors, while software can only have errors. In addition, many (but by no means all) systems have a condition that is known to be always safe " this is called the fail-safe state. In many systems, this state is with the device turned off or power removed. For example, the fail-safe state for a microwave oven is off. Many systems do not have such a fail-safe state.

Faults may be tolerated for a period of time before they lead to an accident. For example, a patient ventilator failure can be tolerated for about five minutes before death occurs. Overpressure can be tolerated for about 250 ms before it causes irreversible lung damage.

A failure in the control of aircraft ailerons and elevators in many modern aircraft must be corrected within 50 ms or less to maintain stability. The period of time the system can tolerate a fault is called the fault tolerance time.

To ensure safety, the system must both detect and handle the fault before the fault tolerance time has elapsed. Also, note that the mean time between failures (MTBF) of the component must be (much) longer than the fault tolerance time. Figure 1 below shows the relevant times related to the handling of the fault.

Figure 1: Fault Timeline

These timeframes have ramifications on the kinds of safety detection and correction measures to be applied. If the detection is to be done with periodic or continuous background testing, then the time to complete the test (including the time to perform the normal device operation during that time) is called the fault detection time.

In many systems, there simply isn't enough processor bandwidth to complete the tests in software in addition to the normal system execution to detect the faults in timely fashion. When this is true, other means must be added to detect the fault.

For example, a periodic RAM test, such as the Abraham walking bit test, can detect various kinds of hard memory failures. However, in a system with several megabytes of memory and short fault tolerance time, the detection of a safety-relevant fault cannot be guaranteed to occur within the fault tolerance time. A possible solution is to add mirrored memory with built-in parity checking eliminates the need for a periodic RAM test.

< Previous
Page 2 of 4
Next >

Loading comments...

Most Commented

  • Currently no items

Parts Search Datasheets.com

KNOWLEDGE CENTER