Practical tips on designing safety-critical software

Embedded systems may need to comply with a variety of safety standards based on the market and intended use. These standards can also outline requirements that need to be met based on international standards. Standards such as ones based on the International Electrotechical Commission (IEC) attempt to develop a common set of requirements so that each individual country and/or market doesn’t have completely separate requirements.

First, this paper outlines some various ways that the safety requirements for the system can be understood. It includes customer interaction, competitive intelligence, and professional assistance.

Second, it covers the basics of the IEC 61508 document standard, and the concept of safety integrity levels. There is also an explanation of how faults in an embedded system can lead to errors, and how these can lead to failures. The progression of these failures and their possible multiple instances is also discussed.

Finally, also discussed are some of the various system architectures and their designs. Each is based on the safety integrity level of the overall system, and each is discussed for their applicability to embedded systems.

Which safety requirements?
One of the most important aspects of developing safety-critical software is determining which requirements and standards are going to be followed. Depending on the newness of either your product or the intended market, you may need to get outside help to determine what standards need to be met. Consider following the steps below to aid in your safety certification effort:

Step #1 – Customer interaction: If you are entering a new market, the intended customer probably knows the starting point for which safety requirements need to be met. They may be able to provide information on the safety standards that a similar product already meets. If the end customer is just using the product without a lot of technical background, then this step should be skipped.

Step #2 – Similar product in same intended market: It may be more straightforward to see what safety requirements and standards a similar product meets. For instance, if your company or a partner already sells a medical device to the same market, and your product is similar, this may be a good place to start.

Step#3 – Competitive intelligence: Doing basic research on the Internet or from freely available open information from marketing materials may help determine a good starting point as well. Often, paperwork needs to be filed with agencies on which specific standards were met.

Step #4 – Professional assistance: Each market or market segment normally has agencies or contract facilities that can aid in determining which standards need to be met. Paying a little up front, especially after gathering necessary information from steps 1-3 above, will help make this particular step pay off in the long run.

After gathering this information, you should have a good idea about which standards need to be met. During this investigation, also determine whether or not it is a self-certification activity, a standardized assessment activity, or a full- fledged independent assessment certification.

For the sets of requirements that need to be met, the team should develop a strategy and initial analysis of how they could comply with the requirements and which IEC 61508 SIL safety level should be required.

What EC 61508 requires
One of the more commonly used standards for the design, implementation, and maintenance of a safety-critical system is maintained by the International Electrotechical Commission (IEC). The standard covers all kinds of industries and isn’t specific to a particular one. It defines a characterization called “the equipment under control” or EUC. For the EUC, it further stipulates that the process to develop it, its usage, and maintenance can impact how safety-related risk can be reduced.

The origins of the standard come from the process control industry, and specify a safety life-cycle of sixteen phases categorized as follows:

  • Phases 1-5: Analysis
  • Phases 6 – 13: Design and implementation
  • Phases 14 – 16: Operation and maintenance

The definitions of risk and safety function are also specified in the standard. A risk is a function of the likelihood and severity of any event that can happen within the EUC. A safety function reduces the risk, by either reducing the severity of the event or reducing the likelihood of the event. The standard drives home the fact that there can never be a zero risk system – our only option is to manage those risks with safety functions.

Faults, errors, and failures
A fault is a characteristic of an embedded system that could lead to a system error. An example of a fault is a software pointer that is not initialized correctly under specific conditions, where use of the pointer could lead to a system error. There are also faults that could exist in software that never manifest themselves as an error, and are not necessarily seen by the end user.

An error is an unexpected and erroneous behavior of the system that is unexpected by the end user. This is the exhibited behavior of the system whenever a fault or multiple faults occur.

An example could be a sub-process that quits running within the system from a software pointer that is not initialized correctly. An error may not necessarily lead to a system failure, especially if the error has been mitigated by having a process check to see if this sub-task is running, and restarting it if necessary.

For an embedded system, a failure is best described as a system event not performing its intended function or service as expected by its users at some point in time. Since this is largely based on the user’s perception or usage of the system, the issue itself could be in the initial system requirements or customer specification, not necessarily the software itself.

However, a failure could also occur based on an individual error or erroneous system functionality based on multiple errors in the system. Following the example above, the software pointer initialization fault could result in a sub-task running error, which when it fails causes a system failure such as a crash or user interface not performing correctly.

An important aspect is that for the progression of these terms, they may not necessarily ever manifest themselves at the next level. An uninitialized software pointer is a fault, but if it is never used then an error would not occur (and neither would the failure). There may also need to be multiple instances of faults and errors, possibly on completely different fault trees, in order to progress to the next state. The following diagram in Figure 1 shows the progression for faults, errors, and failures:

Figure 1: Fault to error to failure progression

For safety-critical systems, there are techniques that can be used to minimize the
progression of faults to errors to failures.

Safety integrity levels
Safety integrity levels (or SIL) are used to classify the types of safety systems in the IEC 61508 document. Attaining a particular SIL level shows compliance in the areas of improved reliability, mitigation of faults/errors/failures, system architecture design, verification, and validation.

There are two types of systems in the standard that these levels apply to. The first type of system is a “one shot” system. This is a system that does not operate any more than two times the “proof test frequency”. This frequency is defined as how often the entire system is completely tested and ensured to be fully functional.

Figure 2: Safety integrity levels for a one shot system

The second type of system is labeled as a “continuous” system or a system that operates for a much longer time that the proof test frequency.

The tables in Figure 2   and Figure 3 show the probability of failure and the risk reduction factors for the two types of safety systems.

Figure 3: Safety integrity levels for a continuous system

There also exists a concept of having a “SIL 0” safety component. This is an item where the detection of a fault is not readily measureable, or if the driving of the component itself has no measurement or feedback that it is performing the desired function. These components can be used in a higher level SIL system, but redundancy and/or other devices that can interpret they are working correctly must be used.
Safety-critical architectures
A large part of creating asafety-critical system is deciding on the system/software architecturethat is going to be used. Consider the following processor architecturein Figure 4 :

Figure 4: SIL 0 safety system

Inthis configuration, if running safety-critical software, what happensif the processor does something unexpected? What if the processor runssomething out of a bad memory location, or there is a latent failurethat only exhibits itself after some period of time?

Thisprocessor by itself wouldn’t be able to satisfy a truly safety-criticalsystem by itself. This would be a classic view of a SIL 0 system.Depending on the safety level required, there may be external componentsthat can be added around the processor to perform the desired safetyfunction in parallel if the processor cannot do so.

As thecomplexity of the interface goes up, replicating with circuitry may notsatisfy the mitigation that is being performed for failures that canhappen in your system. This would especially be true if the nature ofthe critical data is contained within serial messages or Ethernetframes. When the amount of safety-critical data increases, or the numberof safety mechanisms increases, it is time for a differentarchitecture.

The following sections outline variousarchitectures that could be used for a safety-critical system. For eacharchitecture, notes are included to outline various aspects, includingpositives and negatives.

Representative SIL 1 or SIL 2 System
In the architecture shown in Figure 5 ,one processor is still performing a majority of the embedded systemwork. In this case, a second processor is added to look at thesafety-related data to make assessments about that data. It then looksat the output of the main processor, and makes a decision if thatprocessor is doing what it is supposed to do.

As an example, saythere is a bit of information in the serial stream that means “STOP” anda separate discrete input signal that also means “STOP”. Bothprocessors could be designed to have visibility to both pieces of data.The main processor would process the safety-critical “stop” logic, alongwith all of the other operations it is performing. The secondaryprocessor would simply look to see if the main processor ordered astopping process based on this data, and would take action if the mainprocessor did not. Maybe the main processor stops in a more gracefulway, where the secondary processor does something more abrupt (liketurning the driveline off).

Figure 5: SIL 2/3/ Safety System

Thisarchitecture lends itself to systems where there is a “safe” state thatthe system can reside in. It is also good because the complexity on thesecondary processor side is limited to just the safety functions of thesystem. The main processor still runs all of the other non-safety code(the secondary does not). When the complexity of the safety-case goesup, or the safety-critical level goes up, then a different architectureis needed to process data.

Representative SIL 3 or SIL 4 System
In an SIL3 or SIL4 system (Figure 6 )architecture, there are two processors that could be identical that areperforming the safety aspects of the system. Each of the processorslabeled “A” and “B” are performing the same operations and handling thesame data. The other processor labeled as “C” is performing clean-uptasks and executing code that has nothing to do with the safety-aspectsof the system. The two safety processors are operating on the same data.

Figure 6: SIL 3/4 safety system

Varioustricks can be done on the two processors to make them a littledifferent. First, the memory maps for the processors can be shifted sothat a software error dealing with memory on one processor wouldn’t bethe same memory on the other processor.

They could also beclocked and operated separately – maybe there isn’t a requirement tohave the processors execute instructions in lock-step with each other.For this architecture, if the processors disagree then the system wouldarrive to a “safe state” for the system.

For this and theprevious architectures listed, it assumes there is a “stop” or “safe”state for the embedded system. If the system must continue to operate,then a more complex system architecture is needed.

Representative “Voter” SIL 3 or SIL 4 System
The architecture shown in Figure 7 shows a “voter” type of system. For this type of system, the processorsactually vote on what should be done next. Information is comparedbetween all of them, and the decision with the greatest number of voteswins. The indecision of the processors is logged and flagged, so thatmaintenance can be done on the system.

Figure 7: SIL 3/4 safety voting system

Therealso needs to be a periodic checking of the interpretation of thevoting mechanism, so that the voting mechanism itself is known to workand doesn’t have a latent failure.

This type of architecture is alarge jump in complexity. There are numerous test cases that need to beperformed to evaluate this system – and the number of possibilitiessharply increases. Embedded engineers spend their entire lives dealingwith the intricacies of systems like this, and the development is notquick or even regular in terms of time.

Selecting the rightarchitecture up front based on the safety requirements is extremelyimportant. Having to shift from one architecture to another afterdevelopment has started is expensive and complicated.

Mark Kraeling is Product Manager at GE Transportation, Melbourne, Florida where he isinvolved with advanced product development in real-time controls,wireless, and communication systems. He has been developing embeddedsoftware for the automotive and transportation industries since theearly 1990s. Mark holds a BSEE from Rose-Hulman, an MBA from JohnsHopkins, and an MSE from Arizona State.

This article wasused as a part of a class on “Practical design of safety-critical architectures  (ESC-330)” that the author conducted at the EmbeddedSystems Conference.

1 thought on “Practical tips on designing safety-critical software

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.