Tips on building & debugging embedded hardware & software designs: Part 2
I’m constantly astonished by the utter reliability of computers. While people complain and fume about various PC crashes and other frustrations, we forget that the machine executes millions of instructions per second, even when sitting in an idle loop.
Smaller device geometries mean that sometimes only a handful of electrons represent a one or zero. A single-bit failure, for a fleetingly transient bit of time, is disaster.
Yet these failures and glitches are exceedingly rare. Our embedded systems, and even our desktop computers, switch trillions of bits without the slightest problem.
Problems can and do occur, though, due more often to hardware or software design flaws than to glitches. A watchdog timer (WDT) is a good defense for all but the smallest of embedded systems. It’s a mechanism that restarts the program if the software runs amok.
The WDT usually resets the processor once every few hundred milliseconds unless reset. It’s up to the firmware to reinitialize the watchdog timer, restarting the timing interval. The code tickles the timer frequently, restarting the countdown interval. A code crash means the timer counts down without interruption; at time-out, hardware resets the CPU, ideally bringing the system back on-line.
The first rule of watchdog design is to drive the CPU’s reset input, not an interrupt (such as NMI). A WDT time-out means that something awful happened, something that may have left the CPU in an unpredictable scrambled state. Only RESET is guaranteed to bring the part back on-line.
The non-maskable interrupt is seductive to some designers, especially when the pin is unused and there’s a chance to save a few gates. For better or worse, NMI—and all other interrupt inputs—is not fail-safe. Confused internal logic will shut down NMI response on some CPUs.
On other chips a simple software problem can render the non-maskable interrupt unusable. The 68 K, for example, will crash if the stack pointer assumes an odd value. If you rely on the WDT to save the day, driving an interrupt while SP is odd results in a double bus fault, which puts the CPU in a dead state until it’s reset.
Next, think through the litigation potential of your system. Life-threatening failure modes mean you’ve got to beware of simple watchdog timers! If a single I/O instruction successfully keeps the WDT alive, then there’s a real chance that the code might crash but continue to tickle the timer.
Some companies (Toshiba, for example) require a more complex sequence of commands to the timer; it’s equally easy to create a PLD yourself that requires a fiendishly complex WDT sequence.
It’s also a very bad idea to put the WDT reset code inside of an interrupt service routine. It’s always intriguing, while debugging, to find your code crashed but one or more ISRs still functioning. Perhaps the serial receive routine still accepts characters and echoes them to the sender.
After all, the ISR by definition runs independently of the rest of the code, so will often continue to function when other routines die. If your WDT tickler stays alive as the world collapses around the rest of the code, then the watchdog serves no useful purpose.
This problem multiplies in a system with an RTOS, as a reliable watchdog monitors all of the tasks. If some of the tasks die but others stay alive—perhaps tickling the WDT—then the system’s operation is at best degraded.
In this case write the WDT code as its own task, driven by a timer. All other tasks send messages to the watchdog process, indicating “I’m alive.” Only when the WDT activity sees that all tasks that should have checked in are indeed operating does it service the watchdog.
If you use RTOS-supplied messaging to communicate the tasks’ health—rather than dreaded though easy global variables—there’s little chance that errant code overwriting RAM can create a false indication that all’s OK. Suppose the WDT does indeed find a fault and resets the CPU. Then what? A simple reset and restart may not be safe or wise.
One system uses very high-energy gamma rays to measure the thickness of steel. A hardware problem led to a series of watchdog time-outs. I watched, aghast, as this system cycled through WDT resets about once a second, each time opening the safety shield around the gamma ray source! The technicians were understandably afraid to approach close enough to yank the power cord.
If you cannot guarantee that the system will be safe after the watchdog fires, then you simply must add hardware to put it in a reasonable, nondangerous, mode. Even units that have no safety issues suffer from poorly thought-out WDT designs.
A sensor company complained that their products were getting slower. Over time, and with several thousand units in the field, response time to user inputs degraded noticeably. A bit of research showed that their system’s watchdog properly drove the CPU’s reset signal, and the code then recognized a warm boot, going directly to the application with no indication to the users that the time-out had occurred.
We tracked the problem down to a floating input on the CPU that caused the software to crash—up to several thousand times per second. The processor was spending most of its time resetting, leading to apparently slow user response.
If your system recovers automatically from a WDT time-out, add an LED or status display so users—or at least the programmers!—know that the system had an unexpected reset. Don’t use a bit of clever watchdog code to compensate for software or hardware glitches.


Loading comments... Write a comment