Architecture of safety-critical systemsSafety-critical systems are embedded systems that could cause injury or loss of human life if they fail or encounter errors. Flight-control systems, automotive drive-by-wire, nuclear reactor management, or operating room heart/lung bypass machines naturally come to mind. But devices as common as the power windows in your car are also safety-critical, once you imagine a small child reaching out of the car window at a fast food drive-through to get another packet of ketchup and accidentally leaning on the control switch making the window shut on the child's arm, or worse.
Small system defects or situations can cascade into life-threatening failures very quickly, as shown in Figure 1.
Figure 1: Fault-error-failure cascade can lead to life-threatening hazards
Faults are defects or situations that can lead to failures. They may be quite small, such as a frozen memory bit, an uninitialized variable in software, or a cosmic ray ionizing its way through our embedded system. A fault may (or may not) lead to an error. An error is a manifestation of a fault as an unexpected behavior within our system. It might be something like an incorrect result of a calculation or a mistaken value of a state variable. Errors may (or may not) lead to failure. A failure is a situation in which a system (or part of a system) is not performing its intended function. As we see in Figure 1, a low-level failure in some small part of a system can be viewed as a fault at another level, which can lead to errors at that level that can trigger failures that can themselves be viewed as faults at yet a higher level. If these faults are allowed to "avalanche" to system-level failures, they can lead to hazards that have the potential to threaten injury or loss of life.
Safety vs. high availability
Some readers may be thinking "Hey, this is starting to sound an awful lot like high availability." But while there are a number of points of contact between safety-critical system design and high-availability system design, the objectives of the two are quite different and many of the design architectures they use are quite different.
Many high-availability systems do not threaten human life in cases of failure and instead are designed to maximize "uptime" and minimize "downtime." High-availability systems today strive to be up and running 99.999% of the time (the so-called "five nines availability"), equivalent to a total of about five minutes down time per year.
On the other hand, safety-critical systems don't always strive to maximize uptime. In fact, they may intentionally take themselves down or bring some subsystems down in situations where there is a threat of injury or loss of life. These systems take themselves to a safe state in order to break the fault-error-failure-hazard sequence of Figure 1, before a life-threatening situation is ever reached. For many safety-critical systems, such as medical infusion pumps and cancer irradiation systems, the safe state is to immediately stop and turn the system off.
For other safety-critical systems, no safe state exists. For these systems, stopping is simply not an option. Examples are aircraft jet-engine controllers and medical respiratory ventilators. For yet other safety-critical systems, safe states do exist but they require a complex and lengthy sequence of activities to bring the system to the safe state. An example of this is an automotive brake-by-wire system. While driving on the Interstate or the Autostrada, you don't want your automotive system to suddenly announce "Memory Parity Error: Brakes Not Available." A much safer design alternative would be "graceful degradation" like "Bring to Garage Now: Only 3-Wheel Braking Available." Another example is the car's power window, which should open completely whenever it detects an obstacle as it closes. These systems must not stop or turn themselves off when a hazard is detected; their embedded services must instead continue to be available while failures and hazards are present.
Figure 2 illustrates the relationship between safety-critical and high-availability systems with regard to hazards and safe states.
Figure 2: Safety-critical vs. high-availability systems
In this Venn diagram, safety-critical medical infusion pumps and cancer irradiation systems would fall into the leftmost section, while jet engine controllers, respiratory ventilators, brake-by-wire systems, and power windows fall into the center section where safety-critical and high-availability systems overlap. The rightmost section of the diagram is for high-availability systems that are not safety-related, such as online banking, stock exchanges, and business-critical websites. Many communication systems fall into this rightmost section, apart from features such as emergency response ("911" in USA) that fall into the center section of the Venn diagram.
I addressed the rightmost section of Figure 2 in my article "Design Patterns for High Availability" (Embedded Systems Programming, August 2002, p. 24). Many of the design patterns discussed there are also useful building blocks for the safety-critical systems that fall into the center section of Figure 2. In this article, I'll continue to focus on the left and center sections of the diagram, covering safety-critical systems.
As with any embedded system, design is preceded by a system requirements definition, covering physical and functional specification. For safety-critical systems, a thorough hazard analysis and risk analysis must also be done. Only then can architectural design get started.
The objective of hazard analysis is to systematically identify the dangers to human safety that a system may pose, including an evaluation of the likelihood of an accident resulting from each hazard. A popular technique for doing hazard analysis is called fault tree analysis. It takes a top-down hierarchical decomposition approach but it doesn't decompose functions the way we learned in freshman engineering class. Rather, it involves decomposing undesired system events in order to identify which combinations of hardware, software, human, or other errors could cause safety-threatening hazards.
Figure 3: Fault tree analysis example
A fault tree analysis begins by asking "What are the three (or six or seven) most life-threatening things my system could conceivably do?" Each safety threat you come up with will become the top node of its own fault tree as shown in Figure 3. Then ask, "What sorts of things could cause this to happen?" Your answers will be shown as the first level of decomposition of the fault tree. Then ask, "What sorts of things could cause each of these to happen?" Your answers will become the next level of the fault tree; and so on. You can use the logical "AND" and "OR" symbols from digital electronics to provide details of logical combinations in your diagrams, as in Figure 3.
An alternative, perhaps more systematic, approach to hazard analysis is called event tree analysis. It is a bottom-up approach that examines the results of operation or failure of system components and subsystems. An event tree is often diagrammed horizontally, as in Figure 4.
Figure 4: Event tree analysis example
View a larger version of this image
The top-level safety-threatening event for this event tree is shown on the left and the various system components and subsystems involved in handling this event are shown across the top of the figure. The specific safety-threatening situation analyzed in this event tree is "If medical infusion pump fluid pressure fails, will the system report an alarm as required?" For each component and subsystem involved, probabilities of successful operation and failure are noted. When the probabilities are mathematically combined, the result of this example is that alarms will fail to be reported 16.21% of the time. (This example has been made-up to keep the numbers simple. Real medical systems are typically much more reliable.)
After hazard analysis, the next step is risk analysis. Risk is the combination of the probability of an undesired event occurring and the severity of its consequences. It might be expressed in units such as "deaths per 100 years of system operation." Once the greatest risks posed by a system have been identified, they must be dealt with in the system design: if possible, the underlying hazards should be avoided or removed. This can often be done using:
- hardware overrides to bypass risky software components
- lockouts to prevent entry into risky states
- lockins to ensure remaining within safe states
- interlocks to constrain sequences of events in order to avoid hazards.
Together with the system requirements, the results of the hazard analysis and risk analysis will guide a safety-critical system's architectural design.
Detecting sensor errors
Correct sensor data are so crucial to safe operation that many systems use redundancy in their sensor data acquisition. Redundancy doesn't always mean sensor replication as shown in Figure 5 with two identical sensors. It could also mean functional redundancy, or the measurement of the same real-world value in two different ways. For example, patient respiration rate can be measured both by the expansion and contraction of the rib cage, and by measurement of expiratory CO2 concentration.
Figure 5: Sensor-input comparison
You can also implement redundancy as analytic redundancy, which is the comparison of a measured value with a value derived in some other way, as shown in Figure 6.
Figure 6: Analytic redundancy: comparison with other data
For example, the result of a position sensor measurement could be compared with a calculation of the sum of the previous position plus velocity multiplied by elapsed time:
xt = x0 + vavg*t .
If there is known constant acceleration, the formula would instead be:
xt = x0 + v0*t + 1/2*a*t2 .
High school physics does come in handy! If the calculated and measured values agree pretty closely, we're confident that the sensor is working correctly. Another example: in the medical world, patient heart rate can be extracted from a signal analysis of an arterial blood-pressure waveform. It can then be compared with the value measured directly from the patient's electrocardiograph signal when doing analytic redundancy.
These approaches can be combined and embellished. In the approaches discussed so far, if the compared data disagree we know something is wrong with a sensor. But we don't know which sensor is wrong. So it's often best to just shut down the entire redundant pair in what's called a fail-stop. An alternative approach is to add a third redundant element and to replace the two-way comparison with three-way "voting." If you use three-way voting in a strictly replication-based design such as Figure 5, the result is called triple modular redundancy (TMR). But this could also be done in a mix-and-match sort of way, resulting in a combination of several kinds of redundancy. In the various triple redundancy approaches, a faulty sensor can be identified and shut down while the remaining redundant elements can continue to operate safely.
If a safety-critical system has an immediate safe state, as illustrated on the left side of Figure 2, a shutdown system can be used to terminate a hazardous situation as soon it detects it. The basic shutdown architecture is illustrated in Figure 7.
Figure 7: Basic shutdown architecture
The shutdown system is a dedicated unit with responsibility for identifying dangerous situations. It will force the entire system into a safe state (in other words, off) whenever a hazard is detected and thus lock the system out of a life-threatening state. The shutdown system is independent of the primary system that is normally in control, and operates in parallel with it. To ensure its complete independence, the shutdown system has its own separate sensor(s). A diagnostic subsystem is used to ensure the integrity of operation of the shutdown system itself. If the diagnostic subsystem determines that the shutdown system's decisions may be untrustworthy, it can bring the entire system to an immediate safe state rather than allowing the primary system to continue to operate without trustworthy shutdown monitoring going on in parallel.
A cancer irradiation facility can be designed in this way. The primary system operates a nuclear particle accelerator that directs a highly focused beam into a well-defined area of a patient's body. Its sensors monitor the radiation dosage on target. Its irradiation shutdown system, on the other hand, works with radiation sensors on other parts of the patient's body and in other parts of the treatment room. It will also monitor radiation dosage on target. The irradiation shutdown system itself is evaluated by an irradiation shutdown diagnostic subsystem.
As we will see later on, the primary system in Figure 7 can be designed in a number of ways, some of them quite complex with sophisticated redundancy built in. But as shown thus far, the shutdown system portion of the design has no redundancy and is thus a potential single point of failure. A single faulty shutdown system could also bring a cancer irradiation facility to a standstill, thus endangering its patients in a different way by denying them their medical treatment. So in fact, some safety-critical systems have dual shutdown systems working in parallel (with either "AND" or "OR" logic for deciding when to shut down the primary). In extreme instances, a safety-critical system can be designed with three shutdown systems working in parallel using TMR-style voting among them. In this way, a faulty shutdown system can be identified and itself be shut down while the remaining shutdown systems can continue to operate in redundant and trustworthy fashion and the primary system continues to provide its services.
Single channel with actuation monitoring
The idea of a shutdown system can also be applied on a smaller scale within a primary system itself, as shown in Figure 8. The ellipses represent major system activities, which could be implemented as software tasks or processes, either on separate processors or sharing a single processor, depending on the scale of the system. A basic primary system is structured by the simple design pattern of Input-Process-Output, shown here across the top of the figure as the sequence labeled "Data Acquisition," "Processing/Transformations," "Output/Control." To lower costs, the primary system and the sensor data integrity checking "shutdown" monitoring activity (at the lower left) are shown here as sharing the same input sensor(s).
Figure 8: Protected single channel, showing actuation monitoring options
View a larger version of this image
The idea of shutdown monitoring can also be extended to the output side of a system. This is called actuation monitoring, which is illustrated on the right side of Figure 8 for a medical safety-critical system. Actuation monitoring can be done in a number of ways, each with a different balance of costs versus benefits. The most basic form of actuation monitoring is end-around monitoring. It simply checks the commands to the output actuators for validity before they reach the actuators themselves. A more stringent form is wrap-around monitoring, which checks that the output actuators are actually producing valid outputs that will soon reach the patient under treatment. A third, usually more costly, form is actuation-results monitoring that uses an independent set of sensors to verify that the system is actually producing the results it's intended to provide.
A medical infusion pump controller could be designed in this way. Let's assume that a stepper motor is doing the actual pumping of fluid. End-around monitoring could be used to check that the stepper motor is receiving the correct (or at least reasonable) commands. Wrap-around monitoring could use a fluid flow sensor to check that the correct (or reasonable) amount of fluid is being delivered to the patient under treatment. And actuation-results monitoring could use an invasive probe to measure the concentration of specific drugs or other contents in the patient's bloodstream resulting from the operation of the infusion pump.
A significant weakness of both the shutdown system and the single-channel architectures is that they cannot continue to operate safely in the presence of faults. They have single points of failure. You can see them stretching across the top of Figure 8. This means that these architectures can only be used in safety-critical systems that have an immediate safe state, as on the left side of Figure 2.
For safety-critical systems without an immediate safe state, dual-channel architectures can be used to allow a system to continue operation even when one of its channels has "fail stopped." In Figure 9 we see an illustration of a dual-channel architecture in which each of the channels uses the single-channel architecture of Figure 8.
Figure 9: A dual-channel architecture
In dual-channel architecture one of the channels serves as the primary, or active, channel and the other is a standby or backup channel, ready to take over system operation if the current primary channel suffers faults or failure. Depending on the needs of the specific safety-critical system, the standby channel when becoming active could either continue normal operation of the system or it could take the system through a possibly long and complex sequence of steps to bring it to its eventual safe state.
For example, an operating room heart/lung bypass machine has got to continue to deliver its life-sustaining services even if one if its internal embedded processing channels fails. On the other hand, a nuclear reactor control system, in cases of failure of one of its internal embedded processing channels, would be expected to stay in operation long enough to shut down the reactor by proceeding through a lengthy sequence of activities: stepping the graphite moderator rods down into the full depth of the reactor core while accelerating the flow of coolant through the reactor, and monitoring the gradual slowdown of the nuclear reaction through myriad sensors—until the reactor can be declared safe for human access.
Dual-channel architecture is going to have higher unit costs than previous architectures we've discussed. There will be redundant embedded processing channels using redundant hardware and redundant sensors. But the big benefit of paying this price is the ability to continue to operate in the presence of a fault.
Dual-channel architecture has a number of popular variants. If the two channels shown in Figure 9 use the same replicated software and hardware, the architecture can handle random faults well but it can't handle systematic faults such as software design or coding defects that would be reproduced in both channels. If this is of concern in your system, a heterogeneous dual-channel architecture is preferable. This kind of architecture would consist of two channels implemented in totally different ways. For example, software for the two channels could be implemented by separate software-development teams working from the same software requirements specification, in what is called "n-version programming" or "dissimilar software." Clearly, the development costs as well as the unit cost for doing this would be high.
Another variant of the dual-channel architecture is multi-channel voting architecture. This extends the TMR approach discussed earlier for sensor error detection into the realm of entire replicated processing channels. In this architecture, three (or more) channels operate in parallel. A "voter" compares the outputs of the channels: if a majority of channel outputs agree, this will become the system's output. If some channels disagree, they will be fail-stopped.
An example of a multi-channel architecture used in aerospace applications, is the dual-dual architecture. Four independent processing channels are organized into two pairs of two channels each. While one pair is active, the members of that pair are continually comparing results. As long as they agree, they will continue to be active. But as soon as they disagree, they will hand over control to the other pair, which will then become the active pair.
Many safety-critical systems do not have an immediate safe state, but can't incur the high costs of a full dual-channel or multiple-channel architecture. A lower-cost compromise solution is the monitor-actuator architecture shown in Figure 10.
Figure 10: A monitor-actuator architecture
View a larger version of this image
This architecture doesn't have replicated identical channels, but instead has heterogeneous channels that differ from one another. It has a single primary actuation channel that normally controls the system, shown in the upper portion of Figure 10. The operation and results of this channel are examined by a separate simpler monitoring channel shown below it. If the monitoring channel detects a fault in the actuation channel, normal operation of the actuation channel can't continue. Instead, control of the system is passed to a separate safety channel shown at the bottom of the figure, which has responsibility for bringing the system to a safe state. Depending on the needs of the specific safety-critical system, the safety channel could take the system through a possibly long and complex sequence of steps to bring it to its eventual safe state.
The monitor-actuator architecture could be a reasonable low-cost compromise for applications such as chemical process control or car power windows. It can also serve in applications appropriate for "graceful degradation" of function such as automotive brake-by-wire. The safety channel would implement the graceful degradation of system function.
Keeping people safe
The selection of a safety-critical system architecture is driven by a rigorous hazard analysis followed by risk analysis, in addition to conventional system requirements definition. System design may include combinations of redundant sensor configurations, shutdown systems, actuation monitoring, multiple channel architectures, and/or monitor-actuator structuring.
These embedded systems architectures are much more valuable than can be measured in dollars and cents. Their true value is in protecting and saving human lives.
David Kalinsky is director of customer education at Enea Embedded Technology. He is a lecturer and seminar leader on technologies for embedded software in North America and Europe. In recent years, David has built high-tech training programs for a number of Silicon Valley companies, on aspects of software engineering for the development of real-time and embedded systems. Before that, he was involved in the design of safety-critical embedded medical and aerospace systems. David holds a Ph.D. in nuclear physics from Yale University. You can reach him at email@example.com.
Storey, N. Safety-Critical Computer Systems. Harlow, UK: Addison-Wesley, 1996.
Douglas, B. P. Real-Time Design Patterns. Boston, MA: Addison-Wesley, 2003.
Dunn, W. R. Practical Design of Safety-Critical Computer Systems. Solvang, CA: Reliability Press, 2002.
University of York (UK) High Integrity Systems Engineering group, www.cs.york.ac.uk/hise/