Software safety by the numbers - Embedded.com

Software safety by the numbers

When it comes to safety, it's not what you do, but how you do it. The IEC 61508 standard outlines how safety-critical projects should be managed and how to locate, and create, safety-critical code.

A client developing a safety-rated embedded device asked me to assess his real-time operating system for its part in the upcoming safety audit. My conversation with the vendor went something like this:

“Were there significant changes between the previous version of your kernel and the current maintenance release, or just bug fixes,” I asked.

“Well, we changed the compiler tool chain and added a new feature,” the client responded.

“Were the changes reviewed for possible side-effects on other parts of the system?”

“We tested it before we released it and verified that the bugs were fixed.”

“Is your testing automated?”

“No, testing is done with a developer observing the test output.”

“Do tests that verify bug fixes become part of the regression test suite?”

“No, we have a list of tests that we run for each release.”

“Does your team use a formal version-control system?”

“We make a back up before each release.”

The quality of this vendor's software was unquestionably good: the client had been using this kernel in his devices for years. However, when assessing the functional safety of a piece of software, I must always consider how the software was developed. This vendor's seemingly average development process might actually produce a risk of dangerous failure in a safety-critical system—a risk he could've avoided if his developers were following the IEC 61508 standard. This article will introduce you to the IEC 61508 standard and why you might want to adopt it, even on non-safety-critical projects.

IEC 61508
The International Electrotechnical Commission (IEC) is an international standards body that prepares and publishes standards for all electrical, electronic, and related technologies. IEC 61508 is the standard governing the functional safety of programmable electronic systems. This standard is well established in the industrial process-control and automation industry and is finding a foothold in automotive, heavy machinery, mining, and other fields where safety and reliability are paramount.

Meeting the requirements of IEC 61508 for software development involves a systematic development process, emphasizing requirements traceability, criticality analysis, and validation. These techniques are not new to embedded software developers. They're considered and debated and then often dismissed when cost and deadline considerations come into play. When a software failure could mean the loss of life, however, it's critical to strictly follow a standard such as IEC 61508 that eliminates the possibility of corner-cutting. Even when developing a non-safety-related system, IEC 61508 is an excellent framework for a quality-focused development process.

Software safety lifecycle
Compliance with the IEC 61508 standard is not a single event. Like ISO 9000, this standard requires that developers uphold an ongoing commitment to the users of the product to ensure the continued safe operation of derivative systems. Development of a component of a safety system (such as software) requires a commitment not only to stringent development methods, but also a thorough and careful approach to maintenance.


Figure 1: Safety lifecycle

View a larger version of this image

Safety systems conforming to IEC 61508 are expected to follow a safety lifecycle similar to the one depicted in Figure 1. Also, developers are expected to follow a similar lifecycle when using or designing the components of a safety system, beginning with a risk analysis to determine the Safety Integrity Level (SIL) required. SIL is a quantification of the magnitude of risk reduction as indicated in Table 1.

Table 1: Risk reduction required for safety integrity levels

Safety integrity level (SIL) Risk reduction factor required
4 > 10,000
3 > 1,000
2 > 100
1 > 10

Unlike hardware, software doesn't have well-understood general failure-rate analysis mechanisms to determine the SIL for a system. The IEC 61508 standard recognizes this and instead requires different levels of engineering design and practice to ensure the quality of the software in the system. For example, all four SILs require that the developers test modules that have been modified for a maintenance release. In contrast, comprehensive validation of the entire application after a localized change is only required at SIL3 and higher. Most of the recommendations in this discussion apply to all system integrity levels; however, some of the more stringent requirements are only necessary at SIL2 or higher.


Figure 2: Evolutionary V-model

View a larger version of this image

For software systems, the standard suggests following a V-model development process. The evolutionary V-model shown in Figure 2 depicts the necessary connection between requirements and validation throughout the entire development process. This version of the V-model recognizes that the development process is not linear and that several iterations of design and implementation may be necessary while end users and developers refine the requirements.

Variations of this model are commonly used for non-safety-critical applications; the key differences are the safety-critical requirements. These requirements and their supporting implementation will receive the closest scrutiny during the assessment process.

Safety requirements
There are two types of safety requirements in IEC 61508: safety function requirements and safety integrity requirements. The safety function requirements govern the input/output sequences that perform the safety-critical operation. For example, a boiler could have a pressure sensor (input) that can reach a maximum value (algorithm) before the gas is shut off (output) to the burner.

The safety integrity requirements of a system are composed of diagnostics and other fail-safe mechanisms used to ensure that failures of the system are detected and that the system goes to a safe state if it's unable to perform a safety function. Examples of integrity elements in the boiler would be a current-range diagnostic on the pressure sensor or a watchdog timer. If either of these elements detected a failure they would be able to force the system to a safe state.

As with non-safety-critical development, the design then proceeds from the requirements, both for the safety-critical and non-safety-critical components. This brings to question the criticality of each software module—if a part of my software system is not safety critical, do I need to have all the same protections and safeguards?

The IEC 61508 standard says no; it recognizes that not every element of a system has the same effect on safe operation and therefore allows some modules to be justified as independent. For this reason, you, the developer, must employ a modular design method and define clear interfaces and protection mechanisms between those modules, so that you can definitively classify subsystems into critical and noncritical categories. You can do this by using hardware memory protection through a memory-management unit or by using a language that enforces such encapsulation (Ada, Java, Modula, and so forth). The strength of the protection required depends on the SIL you've selected.

Note that IEC 61508 allows the entire safety system to be decomposed in this fashion, including peripheral hardware, communication paths, computing hardware, and software. Each element can be decomposed and assigned an SIL that together determines the SIL of the entire system.

Criticality analysis
One method for determining the SIL from the interaction among safety-critical, safety-related, and non-safety-critical components is called a criticality analysis. The method for this analysis can be used for software components and the system as a whole.

Each module is classified into one of four criticality levels:

  • C3 —Safety Critical: a module where a single deviation from the specification may cause the system to fail dangerously
  • C2 —Safety Related: a module where a single deviation from the specification cannot cause the system to fail dangerously, but in combination with the failure of a second module could cause a dangerous fault
  • C1 —Interference Free: a module that is not safety critical or safety related, but has interfaces with such modules
  • C0 —Not Safety Related: a module that has no interfaces to safety-related or safety-critical modules

At the simplest level, any module that's directly used in implementing a safety requirement would be C3, and any safety integrity requirements would be C2. Further classification of criticality and SIL design constraints would have to come from a detailed analysis of the design using techniques such as system-fault trees or a software-hazard analysis.

Once you determine the criticality of a module, you can derive the required SIL for that module from Table 2.

Table 2: Relation of SIL, criticality, and required software safety integrity

SIL of the safety function or safety-related system
Criticality of the subsystem or unit
C0 C1 C2 C3
SIL1 No requirements No requirements SIL1 recommended SIL1 required
SIL2 No requirements No requirements if RIF* is met SIL1 required SIL2 required
SIL3 No requirements No requirements if RIF* is met SIL2 required SIL3 required

*Requirement on Interference-Freeness (RIF): Units of lower safety criticality shall not interfere with units of higher safety criticality. This can be achieved by:

  1. Running software components on separate hardware;
  2. Running software components in separate, hardware-protected memory segments, communicating by message passing only; taking other measures of online protection of higher criticality data/resources against erroneous write access;
  3. Encapsulating software components (supported by the programming language, such as Modula, ADA, Java, C#) and demonstrated by the Linker Xref listings);
  4. Analysis of pointers for the lower criticality components.

The main advantage of this method is that by identifying modules that can be developed at a reduced SIL, you can reduce development overhead and documentation for noncritical elements of the system. This is also a factor when selecting commercial off-the-shelf products for inclusion in any system. For more information about criticality analysis, see the white paper by Rainer Faller.1

Beware of tools
It's easy to overlook the effect that software tools and libraries have in creating safety-critical parts of the design, making them safety related, and therefore likely to require an SIL greater or equal to one level below that of the critical elements. It's also possible that third-party software is itself part of your safety function, so it must also meet the requirements of the selected SIL. The most common elements you need to consider are the compiler, language libraries, and any operating-system kernel and services you use.

Tool vendors can take one of two approaches to establish their product's suitability for safety application. First, they can have the product itself certified to the desired SIL. This is the preferred method because it clearly establishes to what level the software can be used and simplifies the compliance argument for the end user.

If a formal assessment of a tool's compliance is not available, then a tool's SIL can be based upon the tool's track record in production and standards compliance. A vendor can establish that a tool is “proven in use” by demonstrating that the tool has been used in different products over an extended period (millions of cumulative hours of operation, for example) and by recording the defects reported during that time. The types of applications counted must either be very similar to the new application (stressing the tool in a similar manner to the new application) or have been used in a diverse set of applications (establishing broad test coverage).

The vendor must also have a quality process that covers development and maintenance that meets a subset of the IEC 61508 requirements. These metrics can be consolidated into a report that you can employ during the assessment process to justify using the tool.

Developers can also take responsibility for assessing the tool by establishing that the tool has been proven in use in existing non-safety products that are similar in function to the developers' application. If you go down this path, you must manage tool updates within the established safety process for the product you're developing. While you can update tools without access to the tool's source code, it's much easier to analyze and justify the effect of changes when these changes can be localized.

The compiler is the piece of the system with the most widespread impact—the compiler touches every piece of code in the system. As a result, the compiler is also the most easily and frequently tested piece of software. It's critical, therefore, to establish that an uncertified compiler is proven in use and complies with the IEC 61508 standards. More easily overlooked are the run-time and utility libraries; since not all libraries are used in all applications, you'll need to pay special attention to library calls used in safety-critical and safety-related code.

An operating-system kernel and application programming interface often perform the most important role in a safety-critical system. Process and stack management, scheduling and flow control, and memory protection all have repercussions on the safety function and can be key elements of meeting safety-integrity requirements

To a lesser extent, you must also consider how other tools you use in software development can affect safety, such as make and build tools, version-control software, and debuggers.

Using existing code
It's not unusual for an existing product to change from a non-safety-critical application to a safety-critical one. Even in developing a new safety-critical product, developers often want to use established proprietary libraries and tools to insert preexisting code in the system.

When using preexisting code in a safety-critical project, you should follow the same methods used for commercial off-the-shelf software: verify that the code is proven in use. Also, you must follow an approved IEC 61508 development and modification process before modifying the software in any way.

Use of this existing code may also require applying or creating new safety-integrity requirements to monitor and verify the integrity of the software's functions. You'll need to add tests to your validation plan to ensure that the software meets the product and safety requirements.

Implementation
The recurring themes in the IEC 61508 process requirements are reproducibility, determinism, and verification. These themes are directly addressed in the IEC 61508 development process by mandating version control, coding standards, language requirements, and a thorough review process.

Formal version control is required not only for the project source code, but for all tools, build scripts, process documents, and outputs. The IEC 61508 standard mandates that a version-control tool must maintain an annotated change history and recommends that the tool provide controlled access to limit check-ins to approved developers. It's essential that the developers can exactly reproduce previous versions of the product, so while the version-control system should support this, the standard also recommends that the developer include a snapshot backup of the entire development environment as part of the release process.

While many development projects have a coding standard or style guide, IEC 61508 has some specific requirements for what must be included in the coding standard. The standard requires some basic style conventions, such as a standard “tombstone” header for source files and a certain level of readability. Additionally, the standard bans any features of programming languages that are incompletely specified or unspecified. For languages such as C or C++ (which do have unspecified or incompletely specified features) developers must use an acceptable subset of the language such as MISRA C or NRC SafeC. (For more information, see the white paper by John Grebe.3 ) Furthermore, safe programming practices, such as avoiding pointers and global variables, must be included in the coding standard.

To assist in verification, developers should use automated tools such as lint and strict compiler flags to enforce compliance with the coding standard. Beyond that, it's necessary to have a formal and documented review process for source code and its supporting documentation. The review should verify that the developers have followed the coding standard and have met the relevant interface requirements. The reviewer should also verify that the software module has a scope and interface suitable for controlled testing.

Testing
The second half of the V-model focuses on verification and validation—in other words, testing at each stage of development. In addition to straightforward requirements validation, developers must perform fault-injection testing at each level as well. Beyond stress testing, fault injection involves disrupting the system to induce a fault and verify that the safety-integrity requirements are met.

At the lowest level (unit or module testing), it's important to verify that the module responds appropriately to boundary values and out-of-range values. At this stage, it's especially important to verify that unexpected or uncommon execution paths are handled correctly. For example, verify that your CRC routine can indeed detect multiple-bit faults and out-of-order sequence faults.

As you integrate software (and hardware) modules, you must conduct further testing to verify their interaction and show that interfaces are operating as specified. Fault injections at this level might verify that one module responds properly when a second module rejects its input. This is also the stage at which you should validate diagnostics and monitoring software.

At the highest level, the system-validation plan must completely test each safety function and safety-integrity requirement. System-level validation should take place in as “life-like” an environment as possible. Here, fault injection can come in the form of communications faults or jabber, input loading, power, and environmental faults.

The impact of change
Once the product has completed validation, you must carefully monitor and control modifications to the system. You must carefully record, review, and approve requests for enhancements, bug fixes, or functional changes for future releases. Also, you must assess each change for any potential impact it might have on the safety requirements.

Once you approve the change, you must complete a formal impact analysis. This analysis begins with a description of the existing module and the root cause of the bug in question. You then document the impact of the current (buggy) behavior on the safety requirements and its possible consequences. This step is necessary because you're responsible for documenting any safety issues and reporting them directly to the end users of the product.

You then describe the proposed change and document the safety impact and its possible consequences. When assessing the impact, it's important that you evaluate not only the primary function of the change, but the consequences it may produce on the interfaces, shared data, or flow of execution of a safety-critical module. You can then use the impact analysis to suggest required testing and as a basis for deciding whether or not to include the change in a release. You must also review the list of all outstanding issues again during the release process to justify the exclusion of safety-related bug fixes.

Stay safe
The safety software development process required by IEC 61508 demands a high level of traceability and verification, beyond that of many development environments. While it's probably unrealistic to use such a process for “everyday” development projects, IEC 61508 does introduce some novel concepts to software development such as criticality analysis and impact analysis.

Jeff Payne is a senior safety engineer with exida.com. He has six years of embedded-systems development experience, including work on safety-critical and high-availability applications. He has a BS in electrical engineering and a BS in computer engineering from Pennsylvania State University. You can contact him at .

References:

  1. Faller, Rainer. “Tailoring of IEC 61508 Requirements using System & Software Criticality Analysis, V1R2,” German national proposal for IEC 61508 maintenance, 2002.
  2. International Electrotechnical Commission, “IEC 61508:2000, Parts 1-7, Functional Safety of Electrical/ Electronic/ Programmable Electronic Safety-Related Systems,” 2000.
  3. Grebe, John. “C/C++ Coding Standard Recommendations for IEC 61508,” exida.com white paper, 2002, www.exida.com/brochures/615081.pdf.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.