Self-testing in embedded systems: Software failure

All electronic systems carry the possibility of failure. An embedded system has intrinsic intelligence that facilitates the possibility of predicting failure and mitigating its effects. This two-part series reviews the options for self-testing that are open to the embedded software developer, along with testing algorithms for memory and some ideas for self-monitoring software in multi-tasking and multi-CPU systems. In part one, we looked at self-testing approaches to guard against hardware failure. Here in part two, we look at self-testing methods that address software malfunctions.

As was mentioned in the introduction to part one, the acceptance of possible failure is a key requirement for building robust systems. This is extremely relevant when considering the possibility of software failure. Even when great care has been taken with the design, testing and debugging of code, it is almost inevitable that undiscovered bugs lurk in all but the most trivial code. Predicting a failure mode is tough, as this requires knowledge of the nature of the bug that leads to the failure and, if that knowledge were available, the bug would have been expunged during development.

The best approach is to recognize that there are broadly two types of software malfunction: data corruption and code looping. Some defensive code can be implemented to detect these problems before too much damage is done.

Data corruption
Arguably the most powerful feature of the C language is also the most common cause of errors and faults: pointers. Data is most likely to become corrupted if it is written via a pointer. The problem is there is no easy way to detect an invalid pointer. If the pointer is NULL, a dereference results in a trap, so ensuring that a suitable trap handler is installed is a start. A similar trap can handle the situation where an invalid (non-existent) memory address is presented by a pointer. However, if the address is valid, but incorrect, random errors may occur.

A memory management unit (MMU) provides some options to trap erroneous situations, as it gives software control over what memory is considered to be valid at any given time. The classic use of an MMU is with a process model operating system. In this context, the code of each task can only access the memory specifically allocated to it. Any attempt to access outside of this area causes an error.

There are two special cases where there is a chance to detect pointer errors: stack overflow/underflow and array bound violation.

Stack space allocation is something of a black art. Although there are static analysis tools around that can help, careful testing during development is wise. This may involve filing the stack with a “fingerprint” value, and then looking at utilization after some period of code execution, or write access breakpoints may be employed. Runtime checks for stack usage are often sensible. This simply requires the addition of “guard words” at either end of the allocated stack space. These are pre-loaded with a unique value, which can be recognized as being untouched. It is logical to use an odd number (as addresses are normally even) and avoid common values like 0, 1 and 0xffffffff. There is then a 4 billion to 1 chance of a false alarm. Like memory tests, the guard words can be checked from a background task or whenever the CPU has nothing better to do. Another possible way to monitor the guard words would be with an MMU that has a fine grain resolution, but such functionality is not common.

In some languages, access to arrays is carefully controlled so that accesses can only occur validly within their allocated memory space. One reason why there is normally no checking in C is that this would introduce a runtime overhead every time an array element if accessed, which is likely to be unacceptable. This could be implemented in C++ by overloading the [ ] (array index) operator. However, it would still be possible to make an erroneous access because pointers can be used instead of the array index operator. Normal array element access like this:

arr[3] = 99;

Can also be written thus:

*(arr+3) = 99;

However, the most common array access problem is accidentally iterating off of the end, thus:

int arr[4];for (i=0; i<=4; i++)    arr[i] = 0;

To detect this kind of error, guard words, like those used with a stack, may be used

Next page >>

Code looping
Obviously, testing of code logic should eliminate all situations where the execution might get stuck in a loop. However, this is difficult to do with total proficiency, particularly with embedded applications, as the termination of a loop may be dependent on some event that is external to the code under test. Ideally, careful design means that timeouts are included in any context where the code might get stuck, but this is not always possible and may be overlooked.

The usual way to ensure that code is not stuck is to employ a watchdog timer. This is some simple hardware that needs to be addressed periodically to stop it from “biting”. If the preset delay is exceeded and the watchdog bites, an interrupt or reset will result. Watchdog refresh code needs to be included at strategic points in the execution flow.

In a multi-tasking system, a watchdog may be implemented in software. A specific task may be implemented with no other function except to monitor the wellbeing of other tasks. If another task fails to prove that it is executing correctly, the watchdog task takes action. A watchdog task may simply use a bank of event flags, all of which it sets to 1. Each other task is required to clear a specific flag down to 0 periodically. If the watchdog task observes an uncleared flag, action is taken.

What to do on detecting a failure
It is established that there are numerous ways that embedded software might detect an error condition. Having done so, what next? That is a very good question and the answer, as is so often the case with embedded software, depends on the exact nature of the application.

For many deeply embedded applications, there is only one option: reset the CPU and, hence, reboot the software. This means that the device ceases functioning for a period of time, which may be inconvenient, but is better than it functioning incorrectly or unreliably. Imagine, for example, a heart pacemaker. Would you rather it missed a couple of beats or went berserk and adversely affected the heart rhythm?

Sometimes there is the opportunity to inform an operator or user. This would be ideal when an impending fault is detected. The user is given some time to finish up before initiating a reset or other remedial action.

What about undetected failures?
Regardless of how carefully an embedded system is designed, both hardware and software, and how much defensive self-testing code is included, there is always the possibility that the device will lock up. The last line of defense is the ability for the user to effect a reset. This can be implemented using existing controls on the device, a dedicated button is not necessarily required. However, instructions like “Press button A three times, while holding down button B.” are easily forgotten. A dedicated reset button – even if it needs a paperclip – is preferable.

Colin Walls has over thirty years experience in the electronics industry, largely dedicated to embedded software. A frequent presenter at conferences and seminars and author of numerous technical articles and two books on embedded software, Colin is an embedded software technologist with Mentor Embedded [the Mentor Graphics Embedded Software Division], and is based in the UK. His regular blog is located at: http://blogs.mentor.com/colinwalls. He may be reached by email at colin_walls@mentor.com

5 thoughts on “Self-testing in embedded systems: Software failure

  1. “IMHO far too many people use the watchdog as a get out of jail card and use it to “fix” bad code/system design.nnIf I was designing a pacemaker, the first thing I would do is split it over two CPUs. One just does the heart rate control and the other

    Log in to Reply
  2. “All good points @CdhmanningnMaybe I did not make it clear enough that the use of a watchdog should be a last resort, not an excuse for sloppy code or bad system design.”

    Log in to Reply
  3. “Certainly…nnA watchdog should only every be triggered by an “act of god”: micro failure etc.nnIf code ever causes a watchdog to go off, you have problems.nnUnfortunately far too many people use watchdog as a primary robustness mechanism.n”

    Log in to Reply
  4. “I think Colin made it clear that, sometimes, it is difficult to detect an error condition in any other way. In this case, the use of a watchdog is the difference between continuing wrong behaviour and a pause followed by the restoration of correct operati

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.