Editor's Note: Welcome to AspenCore's Special Project on the safety of autonomous vehicles. This article is part of an in-depth look from a variety of angles at the business and technology of autonomous vehicle safety.
Many system designs, including industrial machinery, medical devices, and automobiles, are safety-critical and need to have an ability to detect their own operational failures in real-time and react in a way to avoid harming the people using them. Creating a processor-based system to provide this functional safety thus requires using a combination of hardware error-checking, hardware self-test, and system redundancy to provide the software-independent fault detection and safe resolution these systems need. Fortunately, there are processors available that handle much of the hardware heavy lifting needed for safety critical systems.
The need for functional safety in processor-based systems is rising, especially in automotive applications. Even setting aside the whole movement toward autonomous vehicles, automobiles are increasingly reliant on microprocessors in implementing critical functions. Anti-lock braking systems, engine control, and steering are simply a few of the vehicle functions now under processor control that have major safety implications. Should any of these processors make even a single misstep without being caught, the results could be fatal.
Unfortunately, the opportunities for something to go wrong in a processor-based design are legion. As the diagram below shows, proper code execution requires many system elements to work correctly. The processor and all its internal registers, the program and cache memories, the RAM, and the bus interfaces among them, along with the system power and clocks, all must operate flawlessly with precision timing. But as anyone who has had their computer lock up for no apparent reason knows, a single bit change anywhere in this system can derail the entire operation. A noise glitch on any line of the bus, a stray alpha particle or cosmic ray strike (yes, they do happen, and more often than one might think) that alters a bit in memory or a register, low voltage, clock drift, and a host of other sources can cause the system to stumble.
The core of processor-based systems offers many opportunities for noise glitches and other single event upsets to completely derail proper software execution.
Such errors can be made unlikely through careful design, but not eliminated. For a system to be deemed safe, then, it must be able to detect such an error in real time and respond appropriately to mitigate its effects. What constitutes proper mitigation is highly application dependent, but the methods for detecting an error are well-established and common to safety critical designs. Transactions on the system bus, for instance, can be monitored by including error correction coding (ECC) or cyclic redundancy check (CRC) data with each transaction. Voltage monitors can keep tabs on power sources, and watchdog timers can help monitor clock signals.
A watchdog timer can also provide a gross indication of proper processor operation by having the processor reset the timer on a regular basis. If the processor fails in that duty, the watchdog sends a signal to alert the system to the failure once the timer has run out. This involves making a tradeoff between the software overhead of frequent timer resets and the delay in signaling processor failure, however.
Yet, detecting a failure is only one part of functional safety. The other part is responding to the failure in a way that maintains safe system operation. This response cannot be entirely software based. You cannot count on being able to use a processor that has failed to mitigate its own problems or even react to the alerts. There must be an independent hardware mechanism in place.
A variety of architectures have evolved over the years to provide such an independent mechanism in processor-based systems. These architectures include the use of a single processor with hardware checker and the use of two processors with the second processor of the same or different type as the main unit. This second processor can operate independently, running the same or independent software, serving as a touchstone to validate the main processor's behavior on a cycle-by-cycle basis. The more popular alternative, though, is for the second processor to run in lockstep with the main unit, using the same code and data. However, the secondary processor will typically work on a slight delay from the primary, to avoid having both processors affected by a transient error on the system bus.
A variety of architectures have been developed that support the detection and mitigation of random processing errors. (Source: EE Times)