Using certified operating systems effectively in safety critical embedded designs - Embedded.com

Using certified operating systems effectively in safety critical embedded designs

In today's world many potentially dangerous pieces of equipment arecontrolled by embedded software. This equipment includes cars, trains,airplanes, oil refineries, chemical processing plants, nuclear powerplants and medical devices.

As embedded software becomes more pervasive, so, too, do the risksassociated with it. As a result, the issue of software safety hasbecome a very hot topic in recent years. The leading internationalstandard in this area is IEC 61508: Functional safety of electrical/electronic/ programmable electronicsafety-related systems.

This standard is generic and not specific to any industry, but hasalready spun off a number of industry specific derived standards, andcan be applied to any industry that does not have its own standard inplace. Several industry specific standards such as EN50128 (Railway), DO-178B (Aerospace), IEC 60880 (Nuclear) and IEC 601-1-4 (Medical Equipment),are already in place.

Debra Herrmann (Herrmann, 1999) has found a total of 19 standardsrelated to software safety and reliability cut across industrialsectors and technologies. These standards' popularity is on the rise,and more and more embedded products are being developed that conform tothem.

Since an increasing number of embedded products also use an embeddedreal time operating system (RTOS),it has become inevitable that products with an RTOS are being designedto conform to such standards. This creates an important question fordesigners: how is my RTOS going to effect my certification?

This article will attempt to explore the challenges and advantagesof using an RTOS in products that will undergo certification.

Making RTOSes Safer
Today, using a commercial RTOS is standard practice in manyorganizations. The benefits of using a commercial RTOS vs. rolling yourown or going without one are many.

For starters, it will help you structure your project intoencapsulated software parts (tasks and memory blocks) and save your owndevelopment team a lot of time and effort from creating suchinfrastructure. With today's tight budget and schedules, it ispreferable to have your team work on application code for your productrather than infrastructure that supports your application.

With the commercial RTOS, not only has this work already been donefor you, and is available to you, but it has presumably undergone tensof thousands of hours of actual field usage making it much more matureand likely more reliable than any new code that would be written.

In addition, using a commercial RTOS typically includes a richdevelopment toolset that will help you to debug and monitor yourapplications. Having tools that are aware of the operating system andhow it works is a tremendous advantage when trying to resolve complexdesign and debug issues.

In addition to the aforementioned reasons for using a commercialRTOS in any product, there are other advantages that a commercial RTOScan bring to safety applications. These include memory protection, safecommunications and secure task scheduling.

These services allow application programmers to readily includesafety measures in their application. Operating systems that have thesefeatures will save work and reduce risk for their users. If operatingsystems do not have these features available, then the applicationdeveloper must build these features into their application.

The importance of protecting memory
Memory protection provides application and operating system dataprotection against hardware faults and illegal writes. Many operatingsystems will isolate the address spaces of different processes runningon the OS. This protection, combined with a hardware memory management unit (MMU ) will prevent one process fromillegally writing over the data of another's.

This is a key feature that supports applications of different safetyintegrity levels to run on the same processor. Doing so has tremendousadvantages when going through a certification.

First off, it allows you to identify non-safety critical tasks thatdo not need to be developed with the same level of rigor in terms ofprocess and on-line diagnostics. This can greatly simplify the effortinvolved in the certification.

Secondly, it allows you to use third party software in non safetycritical areas without having to certify that software. Having tocertify third party software creates extra challenges if you do nothave the source code, lifecycle documentation, and/or cooperation fromthe vendor.

Without this memory protection, you must either use other measuresto assure this isolation, or consider all software to be safetycritical and therefore subject to the requirements of the highestsafety integrity level of your product. You must essentially assumethat all software can corrupt the memory space of all other softwareand therefore must be treated as safety critical.

When using this type of memory protection, there is one concern thatshould be considered. If you are using a microcontroller that sendsdata between the core processor and on-chip I/O devices via directmemory access (DMA), then thesetransactions are not protected by the MMU.

If any of the I/O devices are safety critical, then the non safetycritical applications either must not access these devices oradditional protection measures will have to be added to the code.

Other types of memory protection available include cyclic redundancy checking (CRC)on static areas of dataand code, duplicate storage, monitored heap management, and stackoverflow detection.

CRC checking will detect hardware failures as well as illegal datawrites. While a MMU may protect illegal data writes from anotherprocess, illegal data writes within a process may still occur andshould be protected against. For static data that does not change atall or changes very infrequently, a CRC should be used if the data issafety critical.

Another alternative for protection of safety critical data isduplicate storage with comparisons every time the data is accessed. Ifduplicate storage is used, one copy should be inverted in order todetect specific failures of bits being stuck at one or zero.

Monitored heap management willensure that enough memory is available for dynamic memory allocationand can include some checks for corruption. Stack overflow detectionwill detect the situation where the stack has run out of space and hasstarted overwriting other memory areas.

Communication of safety critical data is another area where acommercial RTOS can make things easier. This includes bothinter-processor communication and intra-process communication. The useof the inter-processor communication is becoming more critical when thesystem and hardware designers decide to partition the system inmultiple processors.

This is a trend that can be seen even in smaller electronic controlunits such as smart transmitters as the price for a single processordrops below the price of specific ICs. Both of these types ofcommunication can be risky when you have tasks of varying prioritiesthat can interrupt each other.

If you are not extremely careful with the implementation it becomespossible to send inconsistent data that was interrupted beforegathering a complete set.

These types of problems are often very difficult to find and fixbecause they only occur rarely and it is not easy to capture data whenthey do occur. However, when they do occur, the system response couldbe an unsafe undetected failure. If the underlying operating systemuses queue based messaging between specific tasks with protectedsender/receiver information and message body then this problem can besolved at the operating system level.

Similarly, if the OS uses shared memory areas with controlledread/write access with protected ownership information and content, theOS can take care of this issue for you as well. Otherwise, you wouldhave to include similar measures in your application code which couldcreate quite a bit of extra work.

A secure task scheduling and monitoring mechanism that ensures thatsafety critical tasks run when needed is another advantage that can beoffered by a commercial RTOS. This can be done using eitherdeterministic scheduling, logical flow control monitoring as per IEC61508-2 or a time fence which will terminate the execution of a task ifit over-runs its allotted execution or deadline.

These methods, when combined with a windowed external hardwarewatchdog are highly effective in assuring that critical functions runat the rate that they are required.

IEC 61508 Certification
Currently the IEC 61508 standard does not make any reference to RTOSsoftware or COTS (Commercial off theshelf software). This will be changing in the upcomingsecondedition of the standard which will explicitly require that COTSsoftware shall meet the same requirements as newly developed software.

This is essentially implied by the current standard by not beingmentioned, but now it will be specifically called out. Therefore, whencertifying your product, you must treat the OS just like any of yourcomponents.

Software certification consists of several different phases. First,the development process used to create the software is analyzed. Thenthe software design is analyzed to determine potential failure modesand measures implemented in the software. Herein lies the majorchallenge of using a commercial operating system. Since the developmentprocess and safety measures of the operating system is out of yourcontrol, how can you possibly hope to get this product certified?Fortunately, there are several options here.

The simplest option is to use a certified operating system. Thereare real time operating systems on the market have been certified toIEC 61508. Choosing one of these operating systems can take a lot ofheadaches out of the process. The second option is to use anon-certified operating system and include it as a component in yourcertification process. This option is more difficult, but it can andhas been done many times.

Certified Operating Systems
The major advantage of using a certified operating system is thereduction of risk, cost, and time to market. Doing so eliminates therisk that the operating system component is not able to be certifiedwithout changes that may be outside of your control.

It gets rid of the cost and time involved in certifying theoperating system portion of your design. It eliminates the cost andtime involved in creating additional measures in your application codeto avoid potential faults in the operating system. And, it rules outthe need to gather proven in use data on the operating system.

Another advantage of using a certified RTOS is that it will providea safety manual, which provides guidance on how to safely use theoperating system. This will include information about which featuresand functions can and can't be used safely as well as any proceduresthat must be put in place to ensure safety. Also, a certified RTOS willprovide some of the features such as memory protection that will makeit easier to design safety into your application.

These reasons make it quite attractive to use a certified operatingsystem in your device if at all possible. Of course, there are caseswhen this is just not practical. A common example is the case where youhave an already existing product that you are trying to certify. Thisproduct may use a non-certified operating system, and the effort toswitch operating systems could be quite large. In addition, doing socould disrupt the reliability of a product that has many hours of fieldproven runtime to its credit. In this case, switching to a newoperating system may add more risk and cost than it saves.

Non-certified Operating Systems
If switching operating systems is not an option, then you must followthe path of implementing measures in your application code to ensurethe safety of the operating system. The best way to go about this is todo a software hazard analysis.

This process essentially consists of analyzing all of the componentsof the operating system to determine what failure modes are possible.For each failure mode found, a measure must exist in the product toensure that the failure is safe.

A safe failure is defined as one where the outputs can be placed inthe state that shuts down the process which is normally de-energized. Ahazard analysis is done by going through attributes of each componentone by one and applying guidewords to determine possible deviations.

Possible causes and consequences are then analyzed and possiblesafety measures are considered. Figure1 below shows a example hazardanalysis for one attribute of an operating system. Note that thepurpose of this example is just to give you a feel of how the hazardanalysis is done and should not be considered complete or correct.

Figure1. Example Hazard Analysis

The result of the hazard analysis will be a list of safety measuressuch as those shown in Column 5 of Figure1 above . These safety measures may be items that are alreadyimplemented in your application or they may be new measures that youmust add to your application.

The disadvantage of having to do this analysis on the operatingsystem is obvious; it could be a lot of work especially if it isdiscovered that many safety measures must be implemented. However, itis not as bad as it might seem at first.

Many of the safety measures that you would need to implement for theoperating system would also be beneficial to your application and wouldend up being required anyway once you performed the hazard analysis onyour own application.

In conclusion, there are many well known advantages for using acommercial RTOS in your product and as a result their use is quitewidespread today. When using an RTOS in a safety critical applicationthere are some significant advantages and challenges in doing so.

A good operating system will actually have features that make itmuch easier to implement safety functions and can be a big help inreducing the total amount of work required. However, using a thirdparty operating system introduces an area of risk that may be out ofyour control.

Fortunately, there are certified operating systems on the marketthat mitigate most of this risk and there are accepted methodsavailable for including a non-certified operating system if necessary.

Michael Medoff is a senior safetyengineer with Exida.com. He hasover 18 years of software development experience, including work onembedded safety-critical and high-availability applications .

References
Herrmann, Debra S, SoftwareSafetyand Reliability, IEEE Computer Society Press, Los Alamitos, CA,1999.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.