Self-testing in embedded systems: Software failure
All electronic systems carry the possibility of failure. An embedded system has intrinsic intelligence that facilitates the possibility of predicting failure and mitigating its effects. This two-part series reviews the options for self-testing that are open to the embedded software developer, along with testing algorithms for memory and some ideas for self-monitoring software in multi-tasking and multi-CPU systems. In part one, we looked at self-testing approaches to guard against hardware failure. Here in part two, we look at self-testing methods that address software malfunctions.
As was mentioned in the introduction to part one, the acceptance of possible failure is a key requirement for building robust systems. This is extremely relevant when considering the possibility of software failure. Even when great care has been taken with the design, testing and debugging of code, it is almost inevitable that undiscovered bugs lurk in all but the most trivial code. Predicting a failure mode is tough, as this requires knowledge of the nature of the bug that leads to the failure and, if that knowledge were available, the bug would have been expunged during development.
The best approach is to recognize that there are broadly two types of software malfunction: data corruption and code looping. Some defensive code can be implemented to detect these problems before too much damage is done.
Arguably the most powerful feature of the C language is also the most common cause of errors and faults: pointers. Data is most likely to become corrupted if it is written via a pointer. The problem is there is no easy way to detect an invalid pointer. If the pointer is NULL, a dereference results in a trap, so ensuring that a suitable trap handler is installed is a start. A similar trap can handle the situation where an invalid (non-existent) memory address is presented by a pointer. However, if the address is valid, but incorrect, random errors may occur.
A memory management unit (MMU) provides some options to trap erroneous situations, as it gives software control over what memory is considered to be valid at any given time. The classic use of an MMU is with a process model operating system. In this context, the code of each task can only access the memory specifically allocated to it. Any attempt to access outside of this area causes an error.
There are two special cases where there is a chance to detect pointer errors: stack overflow/underflow and array bound violation.
Stack space allocation is something of a black art. Although there are static analysis tools around that can help, careful testing during development is wise. This may involve filing the stack with a “fingerprint” value, and then looking at utilization after some period of code execution, or write access breakpoints may be employed. Runtime checks for stack usage are often sensible. This simply requires the addition of “guard words” at either end of the allocated stack space. These are pre-loaded with a unique value, which can be recognized as being untouched. It is logical to use an odd number (as addresses are normally even) and avoid common values like 0, 1 and 0xffffffff. There is then a 4 billion to 1 chance of a false alarm. Like memory tests, the guard words can be checked from a background task or whenever the CPU has nothing better to do. Another possible way to monitor the guard words would be with an MMU that has a fine grain resolution, but such functionality is not common.
In some languages, access to arrays is carefully controlled so that accesses can only occur validly within their allocated memory space. One reason why there is normally no checking in C is that this would introduce a runtime overhead every time an array element if accessed, which is likely to be unacceptable. This could be implemented in C++ by overloading the [ ] (array index) operator. However, it would still be possible to make an erroneous access because pointers can be used instead of the array index operator. Normal array element access like this:
arr = 99;
Can also be written thus:
*(arr+3) = 99;
However, the most common array access problem is accidentally iterating off of the end, thus:
int arr; for (i=0; i<=4; i++) arr[i] = 0;
To detect this kind of error, guard words, like those used with a stack, may be used