Build Safety-Critical Designs with UML-based Fault Tree Analysis - The basics -

Build Safety-Critical Designs with UML-based Fault Tree Analysis – The basics

The Unified Modeling Language (UML)has been successfully been used in many real-time and embedded domains,including aerospace, military, and medical markets. Many of thesesystems within these markets are used within safety critical contexts.

So far, disparate tools and environments have been used forcapturing requirements, creating designs, and analyzing system safety.However, UML is an extremely powerful, extensible language. To thisend, I have created a UML profile that support capturing requirements,creating designs, and analyzing system safety all within the same UMLtool environment.

This series of three articles will discuss the use of Fault Tree Analysis (FTA) for safety analysisin embedded systems design and use of the UML profiling mechanism tocreate a safety analysis profile , including the definition ofits normative metamode.

This profile enables developers and analysts to capturesafety-related requirements, perform FTA and other safety analyses,create designs that meet those safety concerns, and provide reportsshowing the relations between the safety analysis, requirements, anddesign model elements.

What is Safety?
The paucity of material on safety critical systems has lead towidespread misunderstanding of the various terms used to discusssafety. The most basic term is safety. Safety is defined to be freedomfrom accidents or losses. An accident is an event in time in which anundesirable consequence occurs, such as death, injury, equipmentdamage, or financial loss.

A safety-critical system in a system, which may contain electronic,mechanical, and software aspects, that presents an opportunity foraccidents to occur. For many people, safety-critical systems are onlythose that present the opportunity for injury or loss of life, but thisomits from consideration other systems which might benefit from thetechniques and approaches common in safety analysis. Therefore, Iprefer to designate a safety critical system to be any system in whichthe cost of use of a system due to an accident is potentially high.

A hazard is system state that when combined with otherenvironmental conditions inevitably leads to an accident [1] . Hazards are normally classifiedas to severity. For example, there is a hazard of being shocked whenjumping the 12-volt battery in your car, but this is a less severe riskthan slamming into a mountainside at 600 knots while riding in acommercial aircraft. Different standards use different categories forhazard severity.

For example, the FDA[2] uses major (irreversible injury or death), moderate (injury), and minor(no injury) levels of concern for device safety. The German standardDIN 19250 identifies 8 categories, along with required safety measuresfor each category while the more recent IEC 61508 [3] identifies 4 safety integritylevels (SILs): catastrophic, critical, marginal, and negligible,although the text notes that the severity of system-presented hazardsis actually a continuum.

The risk of a hazard is defined to be the product of theprobability of the occurrence of the hazard and its severity:

Riskhazard = probabilityhazard x severityhazard

Being shocked by your car battery is relatively high but whencombined with the low severity, the overall risk is low. Similarly,while the consequences of an abrupt release of the kinetic energy of acommercial aircraft are quite severe, its probability is low ” againresulting in a low risk. The various standards also identify differentrisk levels based on both the severity of the hazard and its likelihoodof occurrence.

In the process of system design, hazards must be identified andsafety measures must be put in place to reduce the risk.

Faults and Failures.
A safety fault is a nonconformance of a system that leads to a hazard.Faults come in two flavors: failure states and errors. A failure is anevent that occurs when a component no longer functions properly, andleads to a failed state.

A soft failure is a temporary failure that may be corrected (orcorrect itself) without replacing the failed component. A hard failureis one in which the component must be replaced to repair the defect.

Failures are distinct from errors. An error is a design orimplementation defect. Failures are events that occur at some point intime while errors are omnipresent conditions or states. Errors may notalways be apparent; when they become apparent, they are said tomanifest.

Mechanical or electronic hardware may have both failures anderrors, while software can only have errors. In addition, many (but byno means all) systems have a condition that is known to be always safe” this is called the fail-safe state. In many systems, this state iswith the device turned off or power removed. For example, the fail-safestate for a microwave oven is off. Many systems do not have such afail-safe state.

Faults may be tolerated for a period of time before they lead to anaccident. For example, a patient ventilator failure can be toleratedfor about five minutes before death occurs. Overpressure can betolerated for about 250 ms before it causes irreversible lung damage.

A failure in the control of aircraft ailerons and elevators in manymodern aircraft must be corrected within 50 ms or less to maintainstability. The period of time the system can tolerate a fault is calledthe fault tolerance time.

To ensure safety, the system must both detect and handle the faultbefore the fault tolerance time has elapsed. Also, note that the meantime between failures (MTBF) of the component must be (much) longerthan the fault tolerance time. Figure 1 below shows therelevant times related to the handling of the fault.

Figure1: Fault Timeline

These timeframes have ramifications on the kinds of safetydetection and correction measures to be applied. If the detection is tobe done with periodic or continuous background testing, then the timeto complete the test (including the time to perform the normal deviceoperation during that time) is called the fault detection time.

In many systems, there simply isn't enough processor bandwidth tocomplete the tests in software in addition to the normal systemexecution to detect the faults in timely fashion. When this is true,other means must be added to detect the fault.

For example, a periodic RAM test, such as the Abraham walking bittest, can detect various kinds of hard memory failures. However, in asystem with several megabytes of memory and short fault tolerance time,the detection of a safety-relevant fault cannot be guaranteed to occurwithin the fault tolerance time. A possible solution is to add mirroredmemory with built-in parity checking eliminates the need for a periodicRAM test.

Reliability and Safety
Reliability and safety are mostly independent concerns. Reliabilityrefers to the probability that a system or component will meet itsfunctional and quality of service (e.g. timeliness) requirements withina specified timeframe. While this sounds similar to our previousdefinition of safety, but the two concepts are importantly different.

A safe system is one which does not lead to accidents. It may failall the time and still be safe. A reliable system may fail infrequentlybut when it does fail it does so with catastrophic consequences ” sucha system is not safe.

A handgun, for example, is a very reliable piece of equipment, butcan easily lead to accidents even in the absence of a system fault. Onthe other hand, my old Plymouth station wagon refuses turn on at all,therefore it is very safe even though it is unreliable.

In general, reliability is a separate concern from safety, and itis important to maintain the distinction. For the most part, in systemsthat have a fail-safe state, reliability is an opposing concern tosafety.

Reliability is improved when the system continues to provideservices even it creates a hazardous situation. If the system iscreating a hazardous situation, and there is a fail-safe state, thenentering the fail-safe state improves system safety but decreasessystem reliability.

Consider a medical treatment laser. If a memory cell in thecontroller seems faulty, the safest thing the system can do is to shutdown with the laser is de-energized (its fail-safe state), even if itis relatively unlikely that the detected fault could lead to a hazard.This decreases the system reliability. In such systems, a pessimisticpolicy is likely to be safer than an optimistic policy.

Many systems don't have a fail-safe state. If you're flying at 600knots and 35,000 feet, it is not safe to shut off the jet engine if itis suspected of having a fault.

Similarly, in a drive-by-wire car, the last thing I want to see isan “Abort, Retry, Ignore” message appear on my dashboard when I'mdriving down the freeway at 85 (ah, excuse me, 55) mph. In suchsystems, increasing reliability (such as by adding redundant deliverychannels) also improves the system safety.

Figure2: Safety vs. Reliability

Types of safety measures
There are several different kinds of things one can do about faults:

Obviation. For example, the use of mechanicallyincompatible fasteners can remove the hazard of connecting a patientoxygen intake to a nitrogen source.

Education. The hazard can be handled by educatingthe users so that they won't create hazardous conditions throughequipment misuse. This is a relatively weak safety measure that dependson the sophistication of the user and may not be appropriate in manycircumstances.

Alarming. Announcing the hazard to the user whenit appears so that they can take appropriate action. This approachrequires a fault tolerance time that can take into account the reactiontime of monitoring personnel. For example, an ECG monitor can notify anattending physician of an asystole condition.

Interlocks. The hazard can be removed by usingsecondary devices and/or logic to intercede when a hazard presentsitself. For example, a medical treatment laser might automaticallydisconnect power to the laser when its cover is off.

Transitioning to a fail-safe state. The hazard canbe handled by ensuring that a system can detect faults prior to anaccident and enter a state which is known to be safe. For example, acruise control system can shut off, returning to manual control when afault is detected.

Switch to a redundant channel. The hazard can behandled by engaging another actuation channel to perform the systemaction correctly. This approach is generally preferred when the systemhas no fail-safe state.

Use of additional safety equipment. For example,the use of a drill press may require a light curtain to ensure the userdoesn't place his or her limbs in harm's way.

Restricting access. Using passwords to preventusers from inadvertently invoking “service mode” in which safety checksare turned off.

Labeling. The hazard can be handled by labeling,e.g. High Voltage — DO NOT TOUCH

Each of these different approaches may be appropriate in differentcircumstances. Obviation is usually safest, but not always achievable.Going to a fail-safe state requires both a means for detecting a faultand the presence of a system condition which is both known andachievable.

How Can the UML Help?
UML is a modeling language that is commonly applied to both softwareand systems development. It provides a semantic basis of fundamentalconcepts and views (diagrams) that depicts the interaction of elementsof interest. UML can aid the development of safety critical systems ina number of ways:

* By providing design clarity
* By modeling architectural redundancy
* By modeling low-level redundancy
* By creating safety-relevant views of the requirements and design
* By aiding in safety analysis

First, UML can provide design clarity by exposing the design of thesystem in class diagrams (AKAinternal block diagrams in SysML ) and by explicating showing thetraceability to requirements. If all you have is source code, then itcan be extremely difficult to identify the redundant safety measures,traceability to requirements, and other safety relevant aspects of thedesign. (SysML is a profile (specialized version) of UML used insystem engineering ).

The fundamental building blocks of a UML model is the notion of aclass (Block in SysML). It contains features such as data(“attributes”), services (“operations”), logic (“state machines”),algorithms (“activity diagrams”), quality of services aspects(“constraints”), interactions (“sequence diagrams”), and connectionpoints (“ports”).

When a class has safety relevance, it is possible to add low-levelredundancy, such as using CRC on the class attributes, datareplication, precondition and post condition checking, etc., to ensurethat safety-relevant faults are identified and handled appropriately.

One of the big benefits that UML provides is the ability toconstruct views (diagrams) that focus on narrow aspects of the systemstructure or design. The same elements can be depicted in manydifferent views and the underlying model repository will ensure thatall the views are consistent.

The Harmony/ESW (“embedded software”) process [4,5] identifies five key views ofarchitecture, of which the Safety and Reliability View is one. Ittypically shows the structurally redundant elements and theirinteraction that achieves the safety goals of the system, and can dothis at different levels of abstraction.

This allows the engineering and safety staff to understand howfaults propagate through the system, how safety measures interrupt thatfault propagation, and to perform safety analysis of the designs.

Fault Tree Analysis (FTA) , an analytic approach discussedlater in this series, is a common technique for analyzing how faultslead to hazards and how to add safety measures to address theseconcerns.

While there are a few FTA tools available, it is possible to createa safety-critical profile that permits the capturing of fault metadatafor analysis. (As we shall see shortly, a profile is a specializedversion of the UML, consistent with the underlying UML semantics, tomeet a specialized need. The Systems Modeling Language (SysML) is onesuch profile. ).

The advantage of this is that the requirements, design model andsafety analysis are all co-located and all interconnected. This allowsdevelopers to reliably navigate between these three kinds of views withease.

“Build safety-critical designs with UML-based fault tree analysis” is a 3-part series:
To read Part 1, go to “The basics”
To read Part 2, go to “Defining a a profile”
To read Part 3, go to “Anesthesia ventilator evaluation”

Bruce has worked as a software developer in real-time systems for over 25 years and is a well-known speaker, author, and consultant in the area of real-time embedded systems. He is on the Advisory Board of the Embedded Systems Conference where he has taught courses in software estimation and scheduling, project management, object-oriented analysis and design, communications protocols, finite state machines, design patterns, and safety-critical systems design. He develops and teaches courses and consults in real-time object-oriented analysis and design and project management and has done so for many years. He is the chief evangelist for Rational/IBM. Bruce worked with various UML partners on the specification of the UM, both versions 1 and 2. He is a former co-chairs of the Object Management Group's Real-Time Analysis and Design Working Group. He is the author of several other books on software, including Doing Hard Time: Developing Real-Time Systems with UML, Objects, Frameworks and Patterns (Addison-Wesley, 1999), Real-Time Design Patterns: Robust Scalable Architecture for Real-Time Systems (Addison-Wesley, 2002), Real-Time UML 3rd Edition: Advances in the UML for Real-Time Systems (Addison-Wesley, 2004), Real-Time UML Workshop for Embedded Systems (Elsevier Press, 2006) and several others, including a short textbook on table tennis. His latest book on employing agile methods to develop real-time and embedded systems, Real-Time Agiliy, will appear in June, 2009.

[1] Leveson, Nancy. Safeware:System Safety and Computers Reading, MA: Addison-Wesley, 1995

[2] Guidance for FDAReviewers and Industry: Guidance for the Content of PremarketSubmissions for Software Contained in Medical Devices Washington, D.C.;FDA, 1998

[3] IEC 65A/1508:Functional Safety: Safety-Related Systems Parts 1-7 IEC 1995

[4] Douglass, Bruce Powel.Doing Hard Time: Developing Real-Time Systems with UML, Objects,Frameworks and Patterns Reading, MA: Addison-Wesley, 1999

[5] Douglass, Bruce Powel.Real-Time Agility Reading, MA: Addison-Wesley, 2009.

[6] Douglass, Bruce Powel.Real-Time Design Patterns: Robust Scalable Architecture for Real-TimeSystems Addison-Wesley, 2002

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.