Practical tips on designing safety-critical software
Embedded systems may need to comply with a variety of safety standards based on the market and intended use. These standards can also outline requirements that need to be met based on international standards. Standards such as ones based on the International Electrotechical Commission (IEC) attempt to develop a common set of requirements so that each individual country and/or market doesn’t have completely separate requirements.
First, this paper outlines some various ways that the safety requirements for the system can be understood. It includes customer interaction, competitive intelligence, and professional assistance.
Second, it covers the basics of the IEC 61508 document standard, and the concept of safety integrity levels. There is also an explanation of how faults in an embedded system can lead to errors, and how these can lead to failures. The progression of these failures and their possible multiple instances is also discussed.
Finally, also discussed are some of the various system architectures and their designs. Each is based on the safety integrity level of the overall system, and each is discussed for their applicability to embedded systems.
Which safety requirements?
One of the most important aspects of developing safety-critical software is determining which requirements and standards are going to be followed. Depending on the newness of either your product or the intended market, you may need to get outside help to determine what standards need to be met. Consider following the steps below to aid in your safety certification effort:
Step #1 - Customer interaction: If you are entering a new market, the intended customer probably knows the starting point for which safety requirements need to be met. They may be able to provide information on the safety standards that a similar product already meets. If the end customer is just using the product without a lot of technical background, then this step should be skipped.
Step #2 - Similar product in same intended market: It may be more straightforward to see what safety requirements and standards a similar product meets. For instance, if your company or a partner already sells a medical device to the same market, and your product is similar, this may be a good place to start.
Step#3 - Competitive intelligence: Doing basic research on the Internet or from freely available open information from marketing materials may help determine a good starting point as well. Often, paperwork needs to be filed with agencies on which specific standards were met.
Step #4 - Professional assistance: Each market or market segment normally has agencies or contract facilities that can aid in determining which standards need to be met. Paying a little up front, especially after gathering necessary information from steps 1-3 above, will help make this particular step pay off in the long run.
After gathering this information, you should have a good idea about which standards need to be met. During this investigation, also determine whether or not it is a self-certification activity, a standardized assessment activity, or a full- fledged independent assessment certification.
For the sets of requirements that need to be met, the team should develop a strategy and initial analysis of how they could comply with the requirements and which IEC 61508 SIL safety level should be required.
What EC 61508 requires
One of the more commonly used standards for the design, implementation, and maintenance of a safety-critical system is maintained by the International Electrotechical Commission (IEC). The standard covers all kinds of industries and isn’t specific to a particular one. It defines a characterization called “the equipment under control” or EUC. For the EUC, it further stipulates that the process to develop it, its usage, and maintenance can impact how safety-related risk can be reduced.
The origins of the standard come from the process control industry, and specify a safety life-cycle of sixteen phases categorized as follows:
- Phases 1-5: Analysis
- Phases 6 - 13: Design and implementation
- Phases 14 - 16: Operation and maintenance
The definitions of risk and safety function are also specified in the standard. A risk is a function of the likelihood and severity of any event that can happen within the EUC. A safety function reduces the risk, by either reducing the severity of the event or reducing the likelihood of the event. The standard drives home the fact that there can never be a zero risk system – our only option is to manage those risks with safety functions.
Faults, errors, and failures
A fault is a characteristic of an embedded system that could lead to a system error. An example of a fault is a software pointer that is not initialized correctly under specific conditions, where use of the pointer could lead to a system error. There are also faults that could exist in software that never manifest themselves as an error, and are not necessarily seen by the end user.
An error is an unexpected and erroneous behavior of the system that is unexpected by the end user. This is the exhibited behavior of the system whenever a fault or multiple faults occur.
An example could be a sub-process that quits running within the system from a software pointer that is not initialized correctly. An error may not necessarily lead to a system failure, especially if the error has been mitigated by having a process check to see if this sub-task is running, and restarting it if necessary.
For an embedded system, a failure is best described as a system event not performing its intended function or service as expected by its users at some point in time. Since this is largely based on the user’s perception or usage of the system, the issue itself could be in the initial system requirements or customer specification, not necessarily the software itself.
However, a failure could also occur based on an individual error or erroneous system functionality based on multiple errors in the system. Following the example above, the software pointer initialization fault could result in a sub-task running error, which when it fails causes a system failure such as a crash or user interface not performing correctly.
An important aspect is that for the progression of these terms, they may not necessarily ever manifest themselves at the next level. An uninitialized software pointer is a fault, but if it is never used then an error would not occur (and neither would the failure). There may also need to be multiple instances of faults and errors, possibly on completely different fault trees, in order to progress to the next state. The following diagram in Figure 1 shows the progression for faults, errors, and failures:
For safety-critical systems, there are techniques that can be used to minimize the
progression of faults to errors to failures.
Safety integrity levels
Safety integrity levels (or SIL) are used to classify the types of safety systems in the IEC 61508 document. Attaining a particular SIL level shows compliance in the areas of improved reliability, mitigation of faults/errors/failures, system architecture design, verification, and validation.
There are two types of systems in the standard that these levels apply to. The first type of system is a “one shot” system. This is a system that does not operate any more than two times the “proof test frequency”. This frequency is defined as how often the entire system is completely tested and ensured to be fully functional.
The second type of system is labeled as a “continuous” system or a system that operates for a much longer time that the proof test frequency.
The tables in Figure 2 and Figure 3 show the probability of failure and the risk reduction factors for the two types of safety systems.
There also exists a concept of having a “SIL 0” safety component. This is an item where the detection of a fault is not readily measureable, or if the driving of the component itself has no measurement or feedback that it is performing the desired function. These components can be used in a higher level SIL system, but redundancy and/or other devices that can interpret they are working correctly must be used.