Self-testing in embedded systems: Hardware failure

All electronic systems carry the possibility of failure. An embedded system has intrinsic intelligence that facilitates the possibility of predicting failure and mitigating its effects. This two-part series reviews the options for self-testing that are open to the embedded software developer, along with testing algorithms for memory and some ideas for self-monitoring software in multi-tasking and multi-CPU systems. In this first part, we look at self-testing approaches to guard against hardware failure. In part two, we'll look at self-testing methods that address software malfunctions.

Embedded software can be incredibly complex, so the possibilities for something going wrong are extensive. Add to this the complexity and potential unreliability of hardware, and system failure seems almost inevitable. And yet, most systems are amazingly reliable, functioning faultlessly for months at a time. This is no accident. The reliability comes about through careful design in the first place and a tacit acceptance of the possibility of failure in the second.

Writing robust software, that is less likely to exhibit failure requires three issues to be considered:

  • How to reduce the likelihood of failure
  • How to handle impending failure (and maybe prevent it)
  • How to recover from a failure condition

Four main points of failure
The first task is to identify the possible points of failure. There are broadly four possibilities where failure may occur:

  • The CPU (microprocessor or microcontroller) itself
  • Circuitry around the CPU or one or more peripheral devices
  • Memory
  • Software

Each of these may be addressed separately.

CPU failure
Failure of just the CPU (or part of it) in an embedded system is quite rare. In the event of such a failure occurring, it is unlikely that any instructions could be executed, so self-testing code is irrelevant. Fortunately, this kind of failure is most likely to happen on power up, when the dead system is likely to be noticed by a user.

As multicore systems are becoming more common, the possibilities for CPUs to maintain a health check on one another arise. A simple handshake communication during initialization would suffice to verify that power-up failure has not occurred. Unfortunately, if all the cores are on the same chip, there is less likelihood of one of them failing in isolation.

Peripheral failure
An embedded system may have any amount of other electronics around the CPU and all of that can potentially fail. If it dies totally, the peripheral will most likely “disappear” – it no longer responds to its address and accessing it results in a trap. Suitable trap code is a good precaution.

Other possible self-testing is totally device dependent. A communications device, for example, may have a loopback mode, which facilitates some rudimentary testing. A display can be loaded with something distinctive so that an operator might observe any visible failures.

Memory failure
Given the enormous amount of memory in modern systems and the tiny geometry of the chip technology, it is surprising that memory failure is not a very frequent occurrence. There are two broad types of possible fault: transient and hard.

Transient faults occur from time to time and are virtually impossible to prevent. They are caused by stray radiation – typically cosmic rays – that randomly flip a single bit of memory. Heavy shielding might reduce their possibility, but this is not practical for many types of device. There is no reliable way to detect a transient fault itself. It is most likely to become manifest as a software malfunction, because some code or data was corrupted. Strategies for monitoring the health of software are discussed in part two of this two-part series.

Next page >>

Hard faults are permanent malfunctions and show up in three forms:

  1. Memory not responding to being addressed at all.
  2. One or more bits are stuck on 0 or 1.
  3. There is cross talk; addressing one bit has an effect on one or more others.

(1) results in a trap in the same way as the aforementioned peripheral failure. Ensuring that a suitable trap handler is implemented addresses this issue.

The other forms of hard memory failure, (2) and (3), may be detected using self-test code. This testing may be run at start-up and can also be executed as a background task.

Start-up memory testing
There are a couple of reasons why testing memory on start-up makes a lot of sense. As with most electronic devices, the time when memory is most likely to fail is on power-up. So, testing before use is logical. It is also possible to do more comprehensive testing on memory that does not yet contain meaningful data.

A common power-up memory test is called “moving ones”. This tests for cross-talk – i.e. whether setting or clearing one bit affects any others. Here is the logic of a moving ones test:

set every bit of memory to 0for each bit of memory{    verify that all bits are 0    set the bit under test to 1    verify that it is 1    verify all other bits are 0    set the bit under test to 0}

The same idea may be applied to implement a moving zeros test. Ideally, both tests should be used in succession. Coding these tests needs care. The process should not, itself, use any memory – code should be executed from flash and all working data must be stored in CPU registers.

With increasingly large amounts of memory, the time taken to perform these tests escalates exponentially and could result in an unacceptable delay in the start-up time for a device. Knowledge of memory architecture can enable optimization. For example, cross talk is more likely within a given memory array. So, if there are multiple arrays, the test can be performed individually on each one. Afterwards, a quick check can be performed to verify that there is no cross-talk between arrays, thus:

fill all of memory with 0sfor each memory array{    fill array with 1s    verify that other arrays still contain just 0s    fill array with 0s}

This can then be repeated with all of memory starting full of ones.

Background memory testing
Once “real” code is running, comprehensive memory testing is no longer possible. However, testing of individual bytes/words of memory is possible, so long as tiny interruptions in software execution can be tolerated. Most embedded systems have some idle time or run a background task, when there is no real work to be done. This may be an opportunity to run a memory test.

A simple approach is to write, read and verify a series of bit patterns: all ones, all zeros and alternate one/zero patterns. Here is the logic:

for each byte of memory{    turn off interrupts    save memory byte contents    for values 0x00, 0xff, 0xaa, 0x55    {        write value to byte under test        verify value of byte    }    restore byte data    turn on interrupts}

Implementing this code requires a little care, as an optimizing compiler is likely to conclude that some or all of the memory accesses are redundant and optimize them away. The compiler is optimistic about memory integrity.

In part two of this two-part series, we'll look at self-testing approaches for mitigating software failures.

Colin Walls has over thirty years experience in the electronics industry, largely dedicated to embedded software. A frequent presenter at conferences and seminars and author of numerous technical articles and two books on embedded software, Colin is an embedded software technologist with Mentor Embedded [the Mentor Graphics Embedded Software Division], and is based in the UK. His regular blog is located at: http://blogs.mentor.com/colinwalls. He may be reached by email at colin_walls@mentor.com

7 thoughts on “Self-testing in embedded systems: Hardware failure

  1. “Don't forget some of your peripherals may have a built-in self test such as MEMS accelerometers. We have had a failure where the core of the device (accelerometer) communicated with the CPU, yet an axis failed to provide live data (output was fixed). Ex

    Log in to Reply
  2. “@MWagner – Absolutely. Part of the point of the article is to make you think about these possibilities. Smart peripheral devices are very likely to have such facilities.”

    Log in to Reply
  3. “The bit that often stumps people is this:nnOk, we've detected a problem. Now what do we do?nnThrowing an exception or rebooting just immediately converts a defect into a failure.nnFar too often I've seen code with asserts or other measures that chan

    Log in to Reply
  4. “@Cdhmanning – You are correct. This area will be covered in Part 2 of the article. But I'll admit that it won't be totally comprehensive, as so much is application-specific.”

    Log in to Reply
  5. “A couple of years back my car had a failure of its Mass Airflow Sensor. Without MAS numbers, the whole fuel mix/ignition system fails.nnI dug into this and found out a bit more about this. It turns out that the ECU also models the expected MAS numbers

    Log in to Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.