Self-testing in embedded systems: Hardware failure
All electronic systems carry the possibility of failure. An embedded system has intrinsic intelligence that facilitates the possibility of predicting failure and mitigating its effects. This two-part series reviews the options for self-testing that are open to the embedded software developer, along with testing algorithms for memory and some ideas for self-monitoring software in multi-tasking and multi-CPU systems. In this first part, we look at self-testing approaches to guard against hardware failure. In part two, we'll look at self-testing methods that address software malfunctions.
Embedded software can be incredibly complex, so the possibilities for something going wrong are extensive. Add to this the complexity and potential unreliability of hardware, and system failure seems almost inevitable. And yet, most systems are amazingly reliable, functioning faultlessly for months at a time. This is no accident. The reliability comes about through careful design in the first place and a tacit acceptance of the possibility of failure in the second.
Writing robust software, that is less likely to exhibit failure requires three issues to be considered:
- How to reduce the likelihood of failure
- How to handle impending failure (and maybe prevent it)
- How to recover from a failure condition
Four main points of failure
The first task is to identify the possible points of failure. There are broadly four possibilities where failure may occur:
- The CPU (microprocessor or microcontroller) itself
- Circuitry around the CPU or one or more peripheral devices
Each of these may be addressed separately.
Failure of just the CPU (or part of it) in an embedded system is quite rare. In the event of such a failure occurring, it is unlikely that any instructions could be executed, so self-testing code is irrelevant. Fortunately, this kind of failure is most likely to happen on power up, when the dead system is likely to be noticed by a user.
As multicore systems are becoming more common, the possibilities for CPUs to maintain a health check on one another arise. A simple handshake communication during initialization would suffice to verify that power-up failure has not occurred. Unfortunately, if all the cores are on the same chip, there is less likelihood of one of them failing in isolation.
An embedded system may have any amount of other electronics around the CPU and all of that can potentially fail. If it dies totally, the peripheral will most likely “disappear” – it no longer responds to its address and accessing it results in a trap. Suitable trap code is a good precaution.
Other possible self-testing is totally device dependent. A communications device, for example, may have a loopback mode, which facilitates some rudimentary testing. A display can be loaded with something distinctive so that an operator might observe any visible failures.
Given the enormous amount of memory in modern systems and the tiny geometry of the chip technology, it is surprising that memory failure is not a very frequent occurrence. There are two broad types of possible fault: transient and hard.
Transient faults occur from time to time and are virtually impossible to prevent. They are caused by stray radiation – typically cosmic rays – that randomly flip a single bit of memory. Heavy shielding might reduce their possibility, but this is not practical for many types of device. There is no reliable way to detect a transient fault itself. It is most likely to become manifest as a software malfunction, because some code or data was corrupted. Strategies for monitoring the health of software are discussed in part two of this two-part series.