Li’l Bow Wow

Although a watchdog timer is essential to reliability, a vulnerable dog does not a reliable system make. Here's how you can breed a champion.

Watchdog timers live on the boundary between hardware and software. A watchdog, along with other hardware interlocks found on some equipment, is an essential resource that protects users from rogue code, honest design errors, and acts of God. The humble watchdog timer (WDT) is the last resort when our embedded systems crash; it's the only chance we have to bring the system back online without manual user intervention.

But I believe our philosophy of watchdogging is quite flawed. Our misplaced faith in the perfection of our code leads too many to believe that crashes happen rarely if at all. As a result, the WDT design is usually an afterthought. My observations suggest that WDTs are called upon often over the life of a system-so their design had better be of the highest quality; higher even than the rest of the application, which may have crashed and invoked the WDT. If you expect your system to run reliably and can't count on a nearby user to hit the reset button when the code goes astray, you should add a killer watchdog.

I did a mini-survey of typical external and internal watchdogs last month, and all of them came up short. None met the exacting standards that I think are needed for high-reliability systems.

What constitutes an awesome watchdog timer? The perfect WDT should detect all erratic and insane software modes. It must not make any assumptions about the condition of the software or the hardware; in the real world, anything that can go wrong will. The ultimate WDT must bring the system back to normal operation no matter what went wrong, whether it was a software defect, a RAM glitch, or a bit flip from cosmic rays.

It's impossible to recover from a hardware failure that keeps the computer from running properly, but at the least the WDT must put the system into a safe state. Finally, it should leave breadcrumbs behind, generating debug information for the developers. After all, a watchdog timeout is the yin and yang of an embedded system. It saves the system, keeping customers happy, yet demonstrates an inherent design flaw that should be addressed. Without debug information, troubleshooting these infrequent and erratic events is close to impossible. What does all this mean in practice? Here's my take.

Best of breed

An effective watchdog is independent from the main system. Though all WDTs are a blend of interacting hardware and software, something external to the processor must always be poised like the sword of Damocles, ready to intervene as soon as a crash occurs. Pure software implementations are simply not reliable.

There's only one kind of intervention that's effective: an immediate reset to the processor and all connected peripherals.

All we really know when the WDT fires is that something awful happened. Software bug? Perhaps. Hardware glitch? Also possible. Can you ensure that the error wasn't something that scrambled the processor's internal logic states? I worked with one system in which a motor in another room induced so much electromagnetic field that our instrument sometimes went bonkers. We tracked this down to a sub-nanosecond glitch on one CPU input, a glitch so short that the processor went into an undocumented weird mode. Only a reset brought it back to life.

Many embedded systems have a watchdog that initiates a nonmaskable interrupt (NMI). Designers figure that firing off an NMI rather than initiating a reset preserves some of the system's context. It's easy to seed debugging assets in the NMI handler (like a stack capture) to aid in resolving the crash's root cause. That's a great idea, except that it doesn't work.

Some CPUs, notably the 68k and ColdFire, will throw an exception if a software crash causes the stack pointer to go odd. That's not bad, except that any watchdog circuit that then drives the CPU's NMI will invoke code that pushes the system's context, creating a second stack fault. The CPU halts and stays halted until a reset, and only a reset, comes along.

Asserting reset is the only reliable way to bring a confused microprocessor back to lucidity. Some clever designers, though, build circuits that drive NMI first, and then after a short delay pound on reset. If the NMI works, its exception handler can log debug information and then halt. It may also signal other connected devices that this unit is going offline for a while. The pending reset guarantees a clean restart of the code. Don't be tempted, however, to use the NMI handler to make dangerous hardware safe; that task always, in every system, belongs to a circuit external to the possibly confused CPU.

Don't forget to reset all the hardware; a simple CPU restart may not be enough. Are the peripherals absolutely, positively, in a sane mode? Maybe not. Runaway code may have issued all sorts of I/O instructions that placed complex devices in insane modes. Give every peripheral a hardware reset; software resets may get lost in all of the I/O chatter.

Consider what the system must do to be totally safe after a failure. Maybe a pacemaker needs to reboot in a heartbeat (so to speak)-or maybe backup hardware should issue a few ticks if reboots are slow.

Once I saw a thickness gauge that beamed high energy gamma rays through four inches of hot steel fail in a spectacular way. Defective hardware crashed the code. The WDT properly closed the protective lead shutter, blocking off the five curie cesium source. I was present, and watched incredulously as the engineering VP put his head in the path of the beam. The crashed code, still executing something, tricked the watchdog into opening the shutter, beaming high intensity radiation through the veep's forehead. I wonder to this day what eventually became of the man.

A really effective watchdog cannot use the CPU's clock, which may fail. A bad solder joint on the crystal, poor design that doesn't work well over temperature extremes, or numerous other problems can shut down the oscillator. No WDT internal to the CPU is really safe. Unfortunately, all the WDTs that I know of share the processor's clock.

Under no circumstances should the software be able to reprogram the WDT or any of its necessary components (like reset vectors, I/O pins used by the watchdog, and so on). Always assume runaway code runs under the guidance of a malevolent deity.

Build a watchdog that monitors the entire system's operation. Don't assume that things are fine just because some loop or interrupt service routine runs often enough to tickle the WDT. The watchdog's software should look at a variety of parameters to ensure the product is healthy, kicking the dog only if everything is okay. What is a software crash, after all? Occasionally the system executes a HALT and stops, but more often the code vectors off to a random location, continuing to run instructions. Maybe only one task crashed. Perhaps only one is still alive-no doubt that which kicks the dog.

Think about what can go wrong in your system. Take corrective action when that's possible, but initiate a reset when it's not. For instance, can your system recover from exceptions like floating point overflow or divide by zero? If not, these conditions may well signal the early stages of a crash. Either handle these competently or initiate a WDT timeout. For the cost of a handful of lines of code, you may keep a 60 Minutes camera crew from appearing at your door.

It's a good idea to light an LED or otherwise indicate that the WDT kicked. A lot of devices automatically recover from timeouts; they quickly come back to life with the customer unaware a crash occurred. Unless you have a debug LED, how do you know if your precious creation is working properly or occasionally invisibly resetting? One outfit I consulted for complained that over time, and with several thousand units in the field, their product's response time to user inputs degraded noticeably. A bit of research showed that their system's watchdog properly drove the CPU's reset signal, and the code then recognized a warm boot, going directly to the application with no indication to the users that the time-out had occurred. We tracked the problem down to a floating input on the CPU that caused the software to crash-up to several thousand times per second. The processor was spending most of its time resetting, leading to apparently slow user response. A simple LED would have flagged the problem during debug, long before customers started yelling.

Everyone knows we should include a jumper to disable the WDT during debugging. But few folks think this through. The jumper should be inserted to enable debugging and removed for normal operation. Otherwise, if manufacturing forgets to install the jumper, or if it falls out during shipment, the WDT won't function. And there's no production test to check the watchdog's operation.

Design the logic so the jumper disconnects the WDT from the reset line (possibly though an inverter so an inserted jumper sets debug mode). Then the watchdog continues to function even while debugging the system. It won't reset the processor but will light the LED. (The light may come on when breakpointing and singlestepping, but should never come on during full-speed testing.)

In the doghouse

Most embedded processors that include high-integration peripherals have some sort of built-in WDT. Avoid using them except in the most cost-sensitive or benign systems. Internal WDTs offer minimal protection from rogue code. Runaway software may reprogram the WDT controller, many internal WDTs will not generate a proper reset, and any failure of the processor will make it impossible to put the peripherals into a safe state. A great WDT must be independent of the CPU it's trying to protect.

However, in systems that must use the internal versions, there's plenty we can do to make them more reliable. The conventional model of kicking a simple timer at erratic intervals is too easily spoofed by runaway code.

A pair of design rules leads to decent WDTs: prove the software is running properly by executing a series of unrelated things, all of which must work, before kicking the dog, and make sure that erratic execution streams that wander into your watchdog routine won't issue incorrect tickles.

This is a great place to use a simple state machine. Suppose we define a global variable named state. At the beginning of the main loop set state to 0x55AA. Call watchdog routine A, which adds an offset-say 0x1111-to state and then insures the variable is now 0x66BB. Return if the compare matches; otherwise halt or take other action that will cause the WDT to fire.

Later, maybe at the end of the main loop, add another offset to state, say 0x2222. Call watchdog routine B, which makes sure state is now 0x88DD. Set state to zero. Kick the dog if the compare worked. Return. Halt otherwise.

This is a trivial bit of code, but now runaway code that stumbles into any of the tickling routines cannot errantly kick the dog. Further, no tickles will occur unless the entire main loop executes in the proper sequence. If the code just calls routine B repeatedly, no tickles will occur because it sets state to zero before exiting.

Add additional intermediate states as your paranoia or fear of litigation dictates.

Normally I detest global variables, but this is a perfect application. Cruddy code that mucks with the variable, errant tasks doing strange things, or any error that steps on the global should also make the WDT timeout.

Put these actions in the program's main loop, not inside an interrupt service routine. It's fun to watch a multithreaded product crash. The entire system might be hung, but one task somehow responds to interrupts. If your tickler stays alive as the world collapses around it, the watchdog serves no useful purpose. (We'll look at multitasking issues in more detail next month.)

If the WDT doesn't generate an external reset pulse (some processors handle the restart internally) make sure the code issues a hardware reset to all peripherals immediately after start-up. That may mean working with the EEs so an output bit resets every resettable peripheral.

In a multiprocessor system it's easy to turn all of the processors into watchdogs. Have them exchange “I'm okay” messages periodically. The receiver resets the transmitter if it stops speaking. This approach checks a lot of hardware and software and requires little circuitry.

If you must take action to return dangerous hardware to a safe state and because there's no way to guarantee the code will come back to life, stay away from internal watchdogs. Broken hardware will obviously cause this, but so can lousy code. A digital camera was recalled recently when users found that turning the device off when in a certain mode meant it could never be turned on again. The code wrote faulty info to flash memory that created a permanent crash.

Let the dogs out

The ideal watchdog is one that doesn't rely on the processor or its software. It's external to the CPU, shares no resources, and is utterly simple, thus devoid of latent defects.

Use a low-cost PIC, a Z8, or other similar dirt-cheap processor as a system health monitor. These parts have an independent clock, on-chip memory, and the built-in timers we need to build a truly great WDT. Being external, you can connect an output to hardware interlocks that put dangerous machinery into safe states.

Tickle it using the same sort of state-machine described above. Like the windowed watchdogs I mentioned last month (TI's TPS3813 and Maxim's MAX6323), define both min and max tickle intervals, to further limit the chances that a runaway program deludes the WDT into avoiding a reset.

Perhaps it seems extreme to add an entire processor just for the sake of a decent watchdog. We'd be fools to add extra hardware to a highly cost-constrained product. Most of us, though, build lower volume, higher margin systems. A 50-cent part that prevents the loss of an expensive mission or that even saves the cost of one customer support call might make a lot of sense.

I'd hoped to conclude this discussion in just two columns, but there's more. Lots more. I'm passionate about reliable embedded systems, and watchdogs are an integral part of these. Stay tuned for next month's thoughts on watchdogs in multitasking systems and some ways to fix applications that only partially crash.

Jack G. Ganssle is a lecturer and consultant on embedded development issues. He conducts seminars on embedded systems and helps companies with their embedded challenges. Contact him at .

Reference

For the basics on watchdog timers, see Murphy, Niall and Michael Barr, “Beginner's Corner: Watchdog Timers,” Embedded Systems Programming, October 2001, p.79.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.