All electronic systems carry the possibility of failure. An embedded system has intrinsic intelligence that facilitates the possibility of predicting failure and mitigating its effects. This two-part series reviews the options for self-testing that are open to the embedded software developer, along with testing algorithms for memory and some ideas for self-monitoring software in multi-tasking and multi-CPU systems. In part one, we looked at self-testing approaches to guard against hardware failure. Here in part two, we look at self-testing methods that address software malfunctions.
As was mentioned in the introduction to part one, the acceptance of possible failure is a key requirement for building robust systems. This is extremely relevant when considering the possibility of software failure. Even when great care has been taken with the design, testing and debugging of code, it is almost inevitable that undiscovered bugs lurk in all but the most trivial code. Predicting a failure mode is tough, as this requires knowledge of the nature of the bug that leads to the failure and, if that knowledge were available, the bug would have been expunged during development.
The best approach is to recognize that there are broadly two types of software malfunction: data corruption and code looping. Some defensive code can be implemented to detect these problems before too much damage is done.
Arguably the most powerful feature of the C language is also the most common cause of errors and faults: pointers. Data is most likely to become corrupted if it is written via a pointer. The problem is there is no easy way to detect an invalid pointer. If the pointer is NULL, a dereference results in a trap, so ensuring that a suitable trap handler is installed is a start. A similar trap can handle the situation where an invalid (non-existent) memory address is presented by a pointer. However, if the address is valid, but incorrect, random errors may occur.
A memory management unit (MMU) provides some options to trap erroneous situations, as it gives software control over what memory is considered to be valid at any given time. The classic use of an MMU is with a process model operating system. In this context, the code of each task can only access the memory specifically allocated to it. Any attempt to access outside of this area causes an error.
There are two special cases where there is a chance to detect pointer errors: stack overflow/underflow and array bound violation.
Stack space allocation is something of a black art. Although there are static analysis tools around that can help, careful testing during development is wise. This may involve filing the stack with a “fingerprint” value, and then looking at utilization after some period of code execution, or write access breakpoints may be employed. Runtime checks for stack usage are often sensible. This simply requires the addition of “guard words” at either end of the allocated stack space. These are pre-loaded with a unique value, which can be recognized as being untouched. It is logical to use an odd number (as addresses are normally even) and avoid common values like 0, 1 and 0xffffffff. There is then a 4 billion to 1 chance of a false alarm. Like memory tests, the guard words can be checked from a background task or whenever the CPU has nothing better to do. Another possible way to monitor the guard words would be with an MMU that has a fine grain resolution, but such functionality is not common.
In some languages, access to arrays is carefully controlled so that accesses can only occur validly within their allocated memory space. One reason why there is normally no checking in C is that this would introduce a runtime overhead every time an array element if accessed, which is likely to be unacceptable. This could be implemented in C++ by overloading the [ ] (array index) operator. However, it would still be possible to make an erroneous access because pointers can be used instead of the array index operator. Normal array element access like this:
arr = 99;
Can also be written thus:
*(arr+3) = 99;
However, the most common array access problem is accidentally iterating off of the end, thus:
int arr;for (i=0; i<=4; i++) arr[i] = 0;
To detect this kind of error, guard words, like those used with a stack, may be used
Obviously, testing of code logic should eliminate all situations where the execution might get stuck in a loop. However, this is difficult to do with total proficiency, particularly with embedded applications, as the termination of a loop may be dependent on some event that is external to the code under test. Ideally, careful design means that timeouts are included in any context where the code might get stuck, but this is not always possible and may be overlooked.
The usual way to ensure that code is not stuck is to employ a watchdog timer. This is some simple hardware that needs to be addressed periodically to stop it from “biting”. If the preset delay is exceeded and the watchdog bites, an interrupt or reset will result. Watchdog refresh code needs to be included at strategic points in the execution flow.
In a multi-tasking system, a watchdog may be implemented in software. A specific task may be implemented with no other function except to monitor the wellbeing of other tasks. If another task fails to prove that it is executing correctly, the watchdog task takes action. A watchdog task may simply use a bank of event flags, all of which it sets to 1. Each other task is required to clear a specific flag down to 0 periodically. If the watchdog task observes an uncleared flag, action is taken.
What to do on detecting a failure
It is established that there are numerous ways that embedded software might detect an error condition. Having done so, what next? That is a very good question and the answer, as is so often the case with embedded software, depends on the exact nature of the application.
For many deeply embedded applications, there is only one option: reset the CPU and, hence, reboot the software. This means that the device ceases functioning for a period of time, which may be inconvenient, but is better than it functioning incorrectly or unreliably. Imagine, for example, a heart pacemaker. Would you rather it missed a couple of beats or went berserk and adversely affected the heart rhythm?
Sometimes there is the opportunity to inform an operator or user. This would be ideal when an impending fault is detected. The user is given some time to finish up before initiating a reset or other remedial action.
What about undetected failures?
Regardless of how carefully an embedded system is designed, both hardware and software, and how much defensive self-testing code is included, there is always the possibility that the device will lock up. The last line of defense is the ability for the user to effect a reset. This can be implemented using existing controls on the device, a dedicated button is not necessarily required. However, instructions like “Press button A three times, while holding down button B.” are easily forgotten. A dedicated reset button – even if it needs a paperclip – is preferable.
Colin Walls has over thirty years experience in the electronics industry, largely dedicated to embedded software. A frequent presenter at conferences and seminars and author of numerous technical articles and two books on embedded software, Colin is an embedded software technologist with Mentor Embedded [the Mentor Graphics Embedded Software Division], and is based in the UK. His regular blog is located at: http://blogs.mentor.com/colinwalls. He may be reached by email at email@example.com